Acrypto - asynchronous crypto layer for linux kernel 2.6

About TODO Blog RSS Old blog Projects Gallery Notes

Acrypto - asynchronous crypto layer for linux kernel 2.6

Development status can be tracked here.

Acrypto ptoject is closed because of async API support in the vanilla kernel.
More details can be found here.

Combined (ipsec, dm-crypt, acrypto) patchsets can be found in archive.

The latest released version is for 2.6.20 kernel tree.
Release notes can be found here.
2.6.16, 2.6.17, 2.6.18 kernel trees were moved into maintenance mode now.

Combined patchsets include:
  • acrypto core
  • IPsec ESP4 port to acrypto
  • dm-crypt port to acrypto
  • OCF to acrypto bridge, which allows to run OCF device drivers with acrypto (for example ixp4xx), requires OCF installed.


Acrypto supports following features:
  • multiple asynchronous crypto device queues
  • crypto session routing (allows to complete single crypto session when several operations (crypto, hmac, anything) are completed)
  • crypto session binding (bind crypto processing to specified device)
  • modular load balancing (one can created load balancer which will get into account for example pid of the calling process)
  • crypto session batching genetically implemented by design (acrypto provides the whole data structure to crypto device, i.e. it is possible to use acrypto as a bridge which routes requests between completely different devices, since it does not differentiate between users, just handles requests)
  • crypto session priority
  • different kinds of crypto operation(RNG, asymmetrical crypto, HMAC and any other)


IXP4xx benchmark with OCF to acrypto bridge:
  • with 1500 buffers it runs with 150 Mbit/sec
  • software-only ecryption on that processor only allows to get ~1.5 Mbit/sec
  • IPsec shows about 20 Mbit/sec



2006_05_29 version.
New acrypto package and combined patchsets have been released.
Short changelog:
  • fixed bug in dm-crypt/ipsec acrypto port in AES-192/256 type detection.
  • reduced number of atomic operations in simple load balancer.
  • removed reference counter leak for broken sessions.
  • small code cleanups.
Hardware: VIA C3 Ezra with 256 Mb of RAM.
Big file copy from unencrypted to encrypted partition.

HIFN (7955 in 32/33 pci slot) with acrypto dm-crypt:
687235072 bytes (687 MB) copied, 78.1593 seconds, 8.8 MB/s
0.19user 28.15system 1:18.32elapsed 36%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+210minor)pagefaults 0swaps

687235072 bytes (687 MB) copied, 79.6745 seconds, 8.6 MB/s
0.14user 28.30system 1:20.57elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+210minor)pagefaults 0swaps

687235072 bytes (687 MB) copied, 75.5192 seconds, 9.1 MB/s
0.13user 31.09system 1:16.33elapsed 40%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+209minor)pagefaults 0swaps

687235072 bytes (687 MB) copied, 75.9418 seconds, 9.0 MB/s
0.12user 28.85system 1:16.40elapsed 37%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+209minor)pagefaults 0swaps
SW dm-crypt:
687235072 bytes (687 MB) copied, 91.1585 seconds, 7.5 MB/s
0.16user 12.82system 1:31.25elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (2major+211minor)pagefaults 0swaps

687235072 bytes (687 MB) copied, 91.5068 seconds, 7.5 MB/s
0.10user 13.12system 1:32.36elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+211minor)pagefaults 0swaps

687235072 bytes (687 MB) copied, 91.4392 seconds, 7.5 MB/s
0.10user 12.98system 1:32.13elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (4major+210minor)pagefaults 0swaps

687235072 bytes (687 MB) copied, 94.6944 seconds, 7.3 MB/s
0.10user 12.82system 1:35.47elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (3major+211minor)pagefaults 0swaps
As you see HIFN results are definitely better. But CPU usage is higher, but that CPU usage is not 100%-what_user_might_use, but instead it is time which copy itself was performed, so if it is higher it is better.

Hardware: Celeron 1.3 Ghz with 504 Mb of RAM.
HIFN (7955 in 32/33 pci slot) with acrypto dm-crypt:
727478272 bytes transferred in 64.528197 seconds (11273804 bytes/sec)
0.19user 17.22system 1:04.55elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+211minor)pagefaults 0swaps

727478272 bytes transferred in 67.712388 seconds (10743651 bytes/sec)
0.18user 15.33system 1:08.23elapsed 22%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (13major+252minor)pagefaults 0swaps
SW dm-crypt:
727478272 bytes transferred in 59.352731 seconds (12256863 bytes/sec)
0.22user 10.50system 0:59.54elapsed 18%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (11major+256minor)pagefaults 0swaps

727478272 bytes transferred in 60.080185 seconds (12108456 bytes/sec)
0.26user 11.07system 1:01.19elapsed 18%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (17major+248minor)pagefaults 0swaps
HIFN is slower here. Probably because of PCI bridge.

Trivial copy benchmark on Xeon 2.4 Ghz (HT enabled) with 1 Gb of RAM with HIFN driver (7955 adapter in pci-x slot) and dm-crypt:
mkfs.ext3 (12 Gb partition):
real    0m17.898s
user    0m0.032s
sys     0m4.088s

$ time cp /storage1/iso/FC-5-i386-disc1.iso /mnt/1/

real    0m25.397s
user    0m0.104s
sys     0m5.064s

$ time cp /storage1/iso/FC-5-i386-disc1.iso /mnt/1/
cp: overwrite `/mnt/1/FC-5-i386-disc1.iso'? y

real    0m26.245s
user    0m0.092s
sys     0m5.792s

$ time cp /storage1/iso/FC-5-i386-disc1.iso /mnt/1/
cp: overwrite `/mnt/1/FC-5-i386-disc1.iso'? y

real    0m29.867s
user    0m0.096s
sys     0m5.476s
With software crypto provider (which is rougly equal to synchronous crypto speed) above numbers are upto 20% slower. But I want to add some special usage case: sometimes I see cp/kjournald starvation, i.e. CPU usage is about 2-3% and input dataflow for dm-crypt is very small, so overall performance is small too. But in that case CPU usage is very small too.



2006_05_21 version.
New acrypto package and combined patchsets are available in archive.
Changes since 2006_04_12:
  • removed connector symbols.
  • VIA padlock driver moved outside acrypto tree into separate driver.
  • New HIFN driver.



2006_04_12 version.
New acrypto combined patch for 2.6.15 kernel tree has been released, which fixes IPsec ESP4 tunnel mode processing and initialization dependency on connector when acrypto is built statically. Many thanks to Yakov Lerner for testing.

IPsec ESP4 transport mode benchmark:
	2.6.16-1.2069_FC4smp -> vanilla 2.6.16-git: ~11.8 MB/s
	vanilla 2.6.16-git -> 2.6.16-1.2069_FC4smp: ~13.2 MB/s

	2.6.16-1.2069_FC4smp -> acrypto 2.6.16: ~12.6 MB/s
	acrypto 2.6.16 -> 2.6.16-1.2069_FC4smp: ~13.5 MB/s
I've also run IPsec benchmark with HIFN driver:
	2.6.16-1.2069_FC4smp -> vanilla 2.6.16-git: ~11.8 MB/s
	vanilla 2.6.16-git -> 2.6.16-1.2069_FC4smp: ~13.2 MB/s

	2.6.16-1.2069_FC4smp -> acrypto HIFN 2.6.16: ~13.2 MB/s
	acrypto HIFN 2.6.16 -> 2.6.16-1.2069_FC4smp: ~13.5 MB/s
As you might expect, CPU usage with HIFN driver and acrypto is noticebly less.
Above numbers drift with the time, especially when machine running stock FC4 kernel overheats, and that numbers decrease to 12-13 MB/s.

And the more CPUs or hardware accelerators you have, the more you will win from acrypto setup, since it was designed with scalability and fully asynchronous processing in mind. Difference is not that big and is changed with time, but this shows that acrypto has not less performance than usual synchronous crypto processing, which can not scale neither with hardware accelerators nor with SMP.

Combined patches for 2.6.16 and 2.6.15 are available in archive.

Combined patch includes acrypto, IPsec ESP4 port, dm-crypto port.

New standalone acrypto source released. It is a sync with combined patch, so it only includes resolution of dependency on connector when acrypto is built statically. Tarball is available in archive.

Main work is concentrated on 2.6.16 IPsec port, which was noticebly changed after 2.6.15.

New 2005_12_27 version.
Implemented priority queues for acrypto.
Realisation nitpick:
New priority queue is allocated and linked into the list of queues first time new crypto session with such priority is allocated, and never free it until device is removed. Such caching is done for performance reason, but it has a disadvantage: when there are no crypto sessions with higher priority, access to the lower priority queues includes overhead of all higher priority lists traversal and checks if that lists are empty.

New 2005_12_23 version.
Here is small changelog:
  • Removed crypto load balancer thread.
  • Eliminated main_crypto_dev.
  • Removed unused session states.
  • Simplified locking and reference counting.
  • Simplified load balancing schema.
  • Simple load balancer is part of acrypto module now.
  • Added direct completion mode.If session's callback can be invoked in any context or if crypto provider can call complete_session() from process context, one can set SESSION_DIRECT in session flags which will lead to callback invocation directly from complete_session(), but not from workqueue.
  • Here are dm-crypt port changes:
    • Reduced memory usage.
    • Use memory pools.
    • Removed several race conditions.
    • Code simplification.

Bonnie++ benchmark was run on 2.4 Ghz Xeon (1+HT) with 1Gb of RAM on ext3 partition, only SW async_provider crypto provider for AES-128 CBC is loaded:
Sequential Output Sequential Input Random
Seeks
Sequential Create Random Create
Size:Chunk SizePer CharBlockRewritePer CharBlockNum FilesCreateReadDeleteCreateReadDelete
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU/ sec% CPU
sync2000M2305496357573513802131928461354454123.3016240799++++++++++++++++239299++++++++5986100
acrypto22000M199879727425181248661995564354155116.4016236598++++++++++++++++244199++++++++579395



I'm pleased to announce asynchronous crypto layer for Linux kernel 2.6.
It supports following features:
- multiple asynchronous crypto device queues
- crypto session routing
- crypto session binding
- modular load balancing
- crypto session batching genetically implemented by design
- crypto session priority
- different kinds of crypto operation(RNG, asymmetrical crypto, HMAC and any other)

Some design notes:
acrypto has one main crypto session queue(double linked list, probably it should be done like crypto_route or sk_buff queue), into which each newly allocated session is inserted and this is a place where load balancing searches it's food. When new session is being prepared for insertion it calls load balancer's ->find_device() method, which should return suitable device(current simple_lb load balancer returns device with the lowest load(device has the least number of session in it's queue)) if it exists. After crypto_device being returned acrypto creates new crypto routing entry which points to returned device and adds it to crypto session routing queue. Crypto session is being inserted into device's queue according to it's priority and it is crypto device driver that should process it's session list according to session's priority.

All insertion and deletion are guarded by appropriate locks, but session_list traversing is not guarded in crypto_lb_thread() since session can be removed _only_ from that function by design, so if crypto device (atomically) marks session as completed and not being processed and use list_for_each_safe() for traversing it's queue all should be OK.

Each crypto load balancer must implement 2 methods:
->rehash() and ->find_device() which will be called from any context and under spinlock.
->rehash() method should be called to remix crypto sessions in device's queues, for example if driver decides that it's device is broken it marks itself as broken and load balancer(or scheduler if you like) should remove all sessions from this queue to some other devices. If session can not be completed scheduler must mark it as broken and complete it(by calling first broke_session() and then complete_session() and stop_process_session()). Consumer must check if operation was successful(and therefore session is not broken).
->find_device() method should return appropriate crypto device.

For crypto session to be successfully allocated crypto consumer must provide two structures - struct crypto_session_initializer (hmm, why only one z?) and struct crypto_data. struct crypto_session_initializer contains data needed to find appropriate device, like type of operation, mode of operation, some flags(for example SESSION_BOUND, which means that session must be bound to specified in bdev field crypto device, it is useful for TCPA/TPM), session priority and callback which will be called after all routing for given session are finished.
struct crypto_data contains scatterlists for src, dst, key and iv. It also has void *priv field and it's size which is allocated and may be used by any crypto agent(for example VIA PadLock driver uses it to store aes_ctx field, crypto_session can use this field to store some pointers needed in ->callback()).
Actually callback will be called from queue_work, but I suppose it is better to not assume calling context.
->callback() will be called after all crypto routing for given session are done with the same parameters as were provided in initialisation time(if session has only one routing callback will be called with original parameters, but if it has several routes callback will be called with parameters from the latest processed one). I believe crypto callback should not know about crypto sessions, routings, device and so on, proper restriction is always a good idea.

Crypto routing.
This feature allows the same session to be processed by several devices/algorithms. For example if you need to encrypt data and then sign it in TPM device you can create one route to encryption device and then route it to TPM device, or this can be used for tweakable cipher encryption.

Crypto device.
It can be either software emulator or hardware accelerator chip(like HIFN 79*/83* or Via PadLock ACE/RNG, or even TPM device like each IBM ThinkPad or some HP laptops have (gentle hint: _they_ even have a _windows_ software for them :) )). It can be registered with asynchronous crypto layer and must provide some data for it:
->data_ready() method - it is called each time new session is added to device's queue.
Array of struct crypto_capability and it's amount - struct crypto_capability describes each operation given device can handle, and has a maximum session queue length parameter. Note: this structure can [be extended to] include "rate" parameter to show absolute speed of given operation in some units, which therefore can be used by scheduler(load balancer) for proper device selection. Actually queue length can somehow reflects device's "speed".

Acrypto has full userspace support through ioctl and direct process' vmas and pages access. It is done using ioctl() with 2 copyings from+to userspace data.
Session processing contains of 3 major parts:
1. Session creation. CRYPTO_SESSION_ALLOC ioctl.
User must provide special structure which has src, dst, key and iv data sizes and crypto initializer(crypto operation, mode, type and priority).
2. Data filling. User must call several CRYPTO_FILL_DATA ioctls.
Each one requires data size and data type(structure crypto_user_data) and data itself.
3. Finish. User must call CRYPTO_SESSION_ADD ioctl with pointer to the are whre crypting result must be stored.
The latter ioctl will sleep while session is being processed.

Second userspace communication mechanism is based on direct access to the process' vmas and pages from acrypto, pointers are transferred using special kernel connector structure.
Obviously it can not be used with the most hardware and sizes more than one page, but I like the idea itself.

Proof of concept userspace code can be found in archive.

Acrypto supports input/output ESP4 IPsec processing, dm-crypt was ported to acrypto, it also supports several hardware chips, all drivers can be found in archive.

One can find benchmark results of acrypto+asynchronous block device vs. cryptoloop vs. dm-crypt at bd page.

My future acrypto prognosis.

The latest version can be found here.
Here one can find little README about supported hardware.
Here one can browse acrypto capable drivers tree.