Development status can be tracked here.
Acrypto ptoject is closed because of async API support in the vanilla kernel.
More details can be found here.
Combined (ipsec, dm-crypt, acrypto) patchsets can be found in archive.
The latest released version is for 2.6.20 kernel tree.
Release notes can be found here.
2.6.16, 2.6.17, 2.6.18 kernel trees were moved into maintenance mode now.
Combined patchsets include:
- acrypto core
- IPsec ESP4 port to acrypto
- dm-crypt port to acrypto
- OCF to acrypto bridge, which allows to run OCF device drivers with acrypto (for example ixp4xx),
requires OCF installed.
Acrypto supports following features:
- multiple asynchronous crypto device queues
- crypto session routing (allows to complete single crypto session when
several operations (crypto, hmac, anything) are completed)
- crypto session binding (bind crypto processing to specified device)
- modular load balancing (one can created load balancer which will get
into account for example pid of the calling process)
- crypto session batching genetically implemented by design (acrypto
provides the whole data structure to crypto device, i.e. it is
possible to use acrypto as a bridge which routes requests between
completely different devices, since it does not differentiate between
users, just handles requests)
- crypto session priority
- different kinds of crypto operation(RNG, asymmetrical crypto, HMAC and
any other)
IXP4xx benchmark with OCF to acrypto bridge:
- with 1500 buffers it runs with 150 Mbit/sec
- software-only ecryption on that processor only allows to get ~1.5 Mbit/sec
- IPsec shows about 20 Mbit/sec
2006_05_29 version.
New acrypto package and combined patchsets have been released.
Short changelog:
- fixed bug in dm-crypt/ipsec acrypto port in AES-192/256 type detection.
- reduced number of atomic operations in simple load balancer.
- removed reference counter leak for broken sessions.
- small code cleanups.
Hardware: VIA C3 Ezra with 256 Mb of RAM.
Big file copy from unencrypted to encrypted partition.
HIFN (7955 in 32/33 pci slot) with acrypto dm-crypt:
687235072 bytes (687 MB) copied, 78.1593 seconds, 8.8 MB/s
0.19user 28.15system 1:18.32elapsed 36%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+210minor)pagefaults 0swaps
687235072 bytes (687 MB) copied, 79.6745 seconds, 8.6 MB/s
0.14user 28.30system 1:20.57elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+210minor)pagefaults 0swaps
687235072 bytes (687 MB) copied, 75.5192 seconds, 9.1 MB/s
0.13user 31.09system 1:16.33elapsed 40%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+209minor)pagefaults 0swaps
687235072 bytes (687 MB) copied, 75.9418 seconds, 9.0 MB/s
0.12user 28.85system 1:16.40elapsed 37%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+209minor)pagefaults 0swaps
SW dm-crypt:
687235072 bytes (687 MB) copied, 91.1585 seconds, 7.5 MB/s
0.16user 12.82system 1:31.25elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+211minor)pagefaults 0swaps
687235072 bytes (687 MB) copied, 91.5068 seconds, 7.5 MB/s
0.10user 13.12system 1:32.36elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+211minor)pagefaults 0swaps
687235072 bytes (687 MB) copied, 91.4392 seconds, 7.5 MB/s
0.10user 12.98system 1:32.13elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+210minor)pagefaults 0swaps
687235072 bytes (687 MB) copied, 94.6944 seconds, 7.3 MB/s
0.10user 12.82system 1:35.47elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+211minor)pagefaults 0swaps
As you see HIFN results are definitely better. But CPU usage is higher,
but that CPU usage is not 100%-what_user_might_use, but instead it is
time which copy itself was performed, so if it is higher it is better.
Hardware: Celeron 1.3 Ghz with 504 Mb of RAM.
HIFN (7955 in 32/33 pci slot) with acrypto dm-crypt:
727478272 bytes transferred in 64.528197 seconds (11273804 bytes/sec)
0.19user 17.22system 1:04.55elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+211minor)pagefaults 0swaps
727478272 bytes transferred in 67.712388 seconds (10743651 bytes/sec)
0.18user 15.33system 1:08.23elapsed 22%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (13major+252minor)pagefaults 0swaps
SW dm-crypt:
727478272 bytes transferred in 59.352731 seconds (12256863 bytes/sec)
0.22user 10.50system 0:59.54elapsed 18%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (11major+256minor)pagefaults 0swaps
727478272 bytes transferred in 60.080185 seconds (12108456 bytes/sec)
0.26user 11.07system 1:01.19elapsed 18%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (17major+248minor)pagefaults 0swaps
HIFN is slower here. Probably because of PCI bridge.
Trivial copy benchmark on Xeon 2.4 Ghz (HT enabled) with 1 Gb of RAM with
HIFN driver (7955 adapter in pci-x slot) and dm-crypt:
mkfs.ext3 (12 Gb partition):
real 0m17.898s
user 0m0.032s
sys 0m4.088s
$ time cp /storage1/iso/FC-5-i386-disc1.iso /mnt/1/
real 0m25.397s
user 0m0.104s
sys 0m5.064s
$ time cp /storage1/iso/FC-5-i386-disc1.iso /mnt/1/
cp: overwrite `/mnt/1/FC-5-i386-disc1.iso'? y
real 0m26.245s
user 0m0.092s
sys 0m5.792s
$ time cp /storage1/iso/FC-5-i386-disc1.iso /mnt/1/
cp: overwrite `/mnt/1/FC-5-i386-disc1.iso'? y
real 0m29.867s
user 0m0.096s
sys 0m5.476s
With software crypto provider (which is rougly equal to synchronous crypto speed)
above numbers are upto 20% slower. But I want to add some special usage case:
sometimes I see cp/kjournald starvation,
i.e. CPU usage is about 2-3% and input dataflow for dm-crypt
is very small, so overall performance is small too. But in that case CPU
usage is very small too.
2006_05_21 version.
New acrypto package and combined patchsets are
available in archive.
Changes since 2006_04_12:
- removed connector symbols.
- VIA padlock driver moved outside acrypto tree into separate driver.
- New HIFN driver.
2006_04_12 version.
New acrypto combined patch for 2.6.15 kernel tree has been released, which
fixes IPsec ESP4 tunnel mode processing and initialization dependency on
connector when acrypto is built statically.
Many thanks to Yakov Lerner for testing.
IPsec ESP4 transport mode benchmark:
2.6.16-1.2069_FC4smp -> vanilla 2.6.16-git: ~11.8 MB/s
vanilla 2.6.16-git -> 2.6.16-1.2069_FC4smp: ~13.2 MB/s
2.6.16-1.2069_FC4smp -> acrypto 2.6.16: ~12.6 MB/s
acrypto 2.6.16 -> 2.6.16-1.2069_FC4smp: ~13.5 MB/s
I've also run IPsec benchmark with HIFN driver:
2.6.16-1.2069_FC4smp -> vanilla 2.6.16-git: ~11.8 MB/s
vanilla 2.6.16-git -> 2.6.16-1.2069_FC4smp: ~13.2 MB/s
2.6.16-1.2069_FC4smp -> acrypto HIFN 2.6.16: ~13.2 MB/s
acrypto HIFN 2.6.16 -> 2.6.16-1.2069_FC4smp: ~13.5 MB/s
As you might expect, CPU usage with HIFN driver and acrypto is noticebly less.
Above numbers drift with the time, especially when machine running
stock FC4 kernel overheats, and that numbers decrease to 12-13 MB/s.
And the more CPUs or hardware accelerators you have, the more you will win
from acrypto setup, since it was designed with scalability and
fully asynchronous processing in mind. Difference is not that big
and is changed with time, but this shows that acrypto has not less
performance than usual synchronous crypto processing,
which can not scale neither with hardware accelerators nor with SMP.
Combined patches for 2.6.16 and 2.6.15 are
available in archive.
Combined patch includes acrypto, IPsec ESP4 port, dm-crypto port.
New standalone acrypto source released. It is a sync with combined
patch, so it only includes resolution of dependency on connector when
acrypto is built statically.
Tarball is available in archive.
Main work is concentrated on 2.6.16 IPsec port, which was noticebly
changed after 2.6.15.
New 2005_12_27 version.
Implemented priority queues for acrypto.
Realisation nitpick:
New priority queue is allocated and linked into the list of queues
first time new crypto session with such priority is allocated, and
never free it until device is removed. Such caching is done for performance
reason, but it has a disadvantage: when there are no crypto sessions
with higher priority, access to the lower priority queues includes
overhead of all higher priority lists traversal and checks if that lists
are empty.
New 2005_12_23 version.
Here is small changelog:
- Removed crypto load balancer thread.
- Eliminated main_crypto_dev.
- Removed unused session states.
- Simplified locking and reference counting.
- Simplified load balancing schema.
- Simple load balancer is part of acrypto module now.
- Added direct completion mode.If session's callback can be invoked in any context
or if crypto provider can call complete_session() from
process context, one can set SESSION_DIRECT in session flags
which will lead to callback invocation directly from complete_session(),
but not from workqueue.
- Here are dm-crypt port changes:
- Reduced memory usage.
- Use memory pools.
- Removed several race conditions.
- Code simplification.
Bonnie++ benchmark was run on 2.4 Ghz Xeon (1+HT) with 1Gb of RAM on ext3 partition,
only SW async_provider crypto provider for AES-128 CBC is loaded:
| Size:Chunk Size | Per Char | Block | Rewrite | Per Char | Block | Num Files | Create | Read | Delete | Create | Read | Delete | | K/sec | % CPU | K/sec | % CPU | K/sec | % CPU | K/sec | % CPU | K/sec | % CPU | / sec | % CPU | | / sec | % CPU | / sec | % CPU | / sec | % CPU | / sec | % CPU | / sec | % CPU | / sec | % CPU |
| 2000M | 23054 | 96 | 35757 | 35 | 13802 | 13 | 19284 | 61 | 35445 | 4 | 123.3 | 0 | 16 | 2407 | 99 | +++++ | +++ | +++++ | +++ | 2392 | 99 | +++++ | +++ | 5986 | 100 |
| 2000M | 19987 | 97 | 27425 | 18 | 12486 | 6 | 19955 | 64 | 35415 | 5 | 116.4 | 0 | 16 | 2365 | 98 | +++++ | +++ | +++++ | +++ | 2441 | 99 | +++++ | +++ | 5793 | 95 |
I'm pleased to announce asynchronous crypto layer for Linux kernel 2.6.
It supports following features:
- multiple asynchronous crypto device queues
- crypto session routing
- crypto session binding
- modular load balancing
- crypto session batching genetically implemented by design
- crypto session priority
- different kinds of crypto operation(RNG, asymmetrical crypto, HMAC and
any other)
Some design notes:
acrypto has one main crypto session queue(double linked list, probably
it should be done like crypto_route or sk_buff queue), into which each
newly allocated session is inserted and this is a place where load
balancing searches it's food. When new session is being prepared for
insertion it calls load balancer's ->find_device() method, which should
return suitable device(current simple_lb load balancer returns device
with the lowest load(device has the least number of session in it's
queue)) if it exists. After crypto_device being returned acrypto creates
new crypto routing entry which points to returned device and adds it to
crypto session routing queue. Crypto session is being inserted into
device's queue according to it's priority and it is crypto device driver
that should process it's session list according to session's priority.
All insertion and deletion are guarded by appropriate locks, but
session_list traversing is not guarded in crypto_lb_thread() since
session can be removed _only_ from that function by design, so if crypto
device (atomically) marks session as completed and not being processed
and use list_for_each_safe() for traversing it's queue all should be OK.
Each crypto load balancer must implement 2 methods:
->rehash() and ->find_device() which will be called from any context and
under spinlock.
->rehash() method should be called to remix crypto sessions in device's
queues, for example if driver decides that it's device is broken it
marks itself as broken and load balancer(or scheduler if you like)
should remove all sessions from this queue to some other devices.
If session can not be completed scheduler must mark it as broken and
complete it(by calling first broke_session() and then complete_session()
and stop_process_session()). Consumer must check if operation was
successful(and therefore session is not broken).
->find_device() method should return appropriate crypto device.
For crypto session to be successfully allocated crypto consumer must
provide two structures - struct crypto_session_initializer
(hmm, why only one z?) and struct crypto_data.
struct crypto_session_initializer contains data needed to find
appropriate device, like type of operation, mode of operation, some
flags(for example SESSION_BOUND, which means that session must be bound
to specified in bdev field crypto device, it is useful for TCPA/TPM),
session priority and callback which will be called after all routing for
given session are finished.
struct crypto_data contains scatterlists for src, dst, key and iv.
It also has void *priv field and it's size which is allocated and may be
used by any crypto agent(for example VIA PadLock driver uses it to store
aes_ctx field, crypto_session can use this field to store some pointers
needed in ->callback()).
Actually callback will be called from queue_work, but I suppose it is
better to not assume calling context.
->callback() will be called after all crypto routing for given session
are done with the same parameters as were provided in initialisation
time(if session has only one routing callback will be called with
original parameters, but if it has several routes callback will be
called with parameters from the latest processed one). I believe crypto
callback should not know about crypto sessions, routings, device and so
on, proper restriction is always a good idea.
Crypto routing.
This feature allows the same session to be processed by several
devices/algorithms. For example if you need to encrypt data and then
sign it in TPM device you can create one route to encryption device and
then route it to TPM device, or this can be used for tweakable cipher
encryption.
Crypto device.
It can be either software emulator or hardware accelerator chip(like
HIFN 79*/83* or Via PadLock ACE/RNG, or even TPM device like each IBM
ThinkPad or some HP laptops have
(gentle hint: _they_ even have a _windows_ software for them :) )).
It can be registered with asynchronous crypto layer and must provide
some data for it:
->data_ready() method - it is called each time new session is added to
device's queue.
Array of struct crypto_capability and it's amount -
struct crypto_capability describes each operation given device can
handle, and has a maximum session queue length parameter.
Note: this structure can [be extended to] include "rate" parameter to
show absolute speed of given operation in some units, which therefore
can be used by scheduler(load balancer) for proper device selection.
Actually queue length can somehow reflects device's "speed".
Acrypto has full userspace support through ioctl and direct process' vmas and pages access.
It is done using ioctl() with 2 copyings from+to userspace data.
Session processing contains of 3 major parts:
1. Session creation. CRYPTO_SESSION_ALLOC ioctl.
User must provide special structure which has src, dst, key and iv data sizes
and crypto initializer(crypto operation, mode, type and priority).
2. Data filling. User must call several CRYPTO_FILL_DATA ioctls.
Each one requires data size and data type(structure crypto_user_data) and data itself.
3. Finish. User must call CRYPTO_SESSION_ADD ioctl with pointer to the are whre crypting result must be stored.
The latter ioctl will sleep while session is being processed.
Second userspace communication mechanism is based on direct access to the process'
vmas and pages from acrypto, pointers are transferred using special kernel connector structure.
Obviously it can not be used with the most hardware and sizes more than one page,
but I like the idea itself.
Proof of concept userspace code can be found in archive.
Acrypto supports input/output ESP4 IPsec processing, dm-crypt was ported to acrypto,
it also supports several hardware chips, all drivers can be found in
archive.
One can find benchmark results of acrypto+asynchronous block device vs. cryptoloop vs. dm-crypt at
bd page.
My future acrypto
prognosis.
The latest version can be found here.
Here one can find little README about
supported hardware.
Here one can browse acrypto capable drivers tree.
|