|
|
Network tree allocator.
|
Developement status can be tracked here.
Network tree allocator can be used to allocate memory for all network
operations from any context. Main designed features are:
- reduced fragmentation (self defragmentation)
- possibility to create zero-copy sending and receiving (ground work for userspace netchannels and
high-performance sniffers)
- greater than SLAB speed
- full per CPU allocation and freeing (objects are never freed on different CPU)
- dynamically grown cache
- separate network allocations from main system's ones
Design notes.
Original idea was to store meta information used for allocation in an
AVL tree,
but since I found a way to use some "unused" fields in struct page,
tree is unused in the allocator.
Terms.
node - container for per-page (of different order) meta information.
chunk - region of free or allocated memory.
Each allocation is aligned to the minimum allocation size (L1 cache line size bytes).
Each node contains a bitmask of used/unused chunks of memory for memory
region bound to that node. Each free chunk is placed into double linked
list inside array, indexed of size (divided by minimal allocation size).
Allocation algo.
When user requests some memory regiosn, it's size is rounded upto
minimum allocation size (instead of power of two in SLAB) and
appropriate entry in array of lists of free chunks is selected. If that
list contains free elements, first one is returned, otherwise next list
is selected with bigger size until non-empty list is found.
Then allocator determines a node for appropriate chunk (using fields in
corresponding struct page->lru), and bits corresponding to selected
chunk are marked as used. If chunk had bigger size than requested, the
rest is placed back into the list for appropriate size.
Thus allocator gets requested memory aligned to minimum allocation size.
Freeing algo.
When user frees chunk of the memory, appropriate node and allocation CPU
are selected for given memory regions (by dereferencing page->lru
fields). If current CPU does not correspond to allocation CPU, then
chunk is dereferenced into single-linked list entry and placed into list
of semi-free objects for freeing on original (allocation) CPU.
If freeing CPU is the same as allocation one, appropriate node is
checked and neighbours for being freed chunk are found. Free
neighbour(s) are then dereferenced into double linked list entry and
removed from appropriate lists of free elements. Then all free chunks
are combined into one and appropriate bitmask is updated in the node.
Chunk of (possibly) bigger size then is dereferenced into double-linked
list and placed into list of free objects for appropriate size.
Then freeing algo checks if list of "semi-free" objects (objects which
were started to be freed on different CPUs, but should be freed actually
on current one) is not empty, and if so, all chunks from that list are
freed as described above.
All above lists and arrays are not accessed by different CPUs, except
"semi-freed" list, which is accessed under appropriate lock (each CPU
has it's own lock and list).
Defragmentation is a part of freeing algorithm and initial fragmentation
avoidance is being done at allocation time by removing power-of-two
allocations. Rate of fragmentation can be found in some userspace
modlling tests being done for both power-of-two SLAB-like and NTA
allocators.
For initial test I've run NTA with set of pseudo-random sized allocations
until first allocation fails, when it hapens I decrease maximum allocation
size in two times. Each graph below shows free and used chunks
inside each page (there are 4094 pages), green points correspond
to free and red ones - to used chunks (each one of 32 bytes).
Maximum allocation size is equal to PAGE_SIZE, failed allocation was for 1912 bytes
(60 chunks of 32 bytes):

Maximum allocation size is equal to PAGE_SIZE/2 (decreased after allocation failure),
failed allocation was for 968 bytes (31 chunks of 32 bytes):

Maximum allocation size is equal to PAGE_SIZE/4 (decreased in two times after first and second
allocation failures), last failed allocation was for 504 bytes (16 chunks of 32 bytes):

Maximum allocation size is equal to PAGE_SIZE/8 (decreased in two times after each of three
allocation failures), last failed allocation was for 252 bytes (8 chunks of 32 bytes):

This tests do not show how fragmentation is changed with the time, when there are a lot of allocations and
freeings are completed, but even existing results show that network tree
allocator performs very well. Next time I will run the same tests
after some pseudo-random allocation and freeing periods.
For comparison I've run the same test with power-of-2 slab-like allocator (actually
it is much more simple, but it has the same ideas as SLAB allocator and
probably can behave even better if we get into account big-sized chunks).
Picture does not change when maximum allocation size
is being decreased after allocation failures, since most of the
overhead and fragmentation is obtained from power of 2 rounds.

This SLAB-like power-of-two allocator overhead and fragmentation actually looks different
than on the picture, since almost all allocations have fragmentation overhead,
so each vertical line actually must contain several red(used)-green(free or fragmentation overhead)
pieces, where sum of all pieces of the same colour will be equal to what is shown
on the picture. But picture presents that absolute amount of fragmentation overhead
is extremely high for power-of-2 allocators. For the real SLAB allocator picture
will be better for small-sized chunks (since chunks never share pools with different sized ones,
except when they can steal pages when cache is refilled),
but much worse for big-sized ones.
Difference in used and free chunks position on the pictures is due to
the fact, that in network tree allocator chunks in page are shown
in reverse order (i.e. higher addresses are first).
Benchmarks with trivial epoll based web server showed noticeble (more
than 40%) imrovements of the request rates (1600-1800 requests per
second vs. more than 2450 ones). It can be described by more
cache-friendly freeing algorithm, by tighter objects packing and thus
reduced cache line ping-pongs, reduced lookups into higher-layer caches
and so on.
Design of allocator allows to map all node's pages into userspace thus
allows to have true zero-copy support for both sending and receiving
dataflows. Based on this ability zero-copy sniffer and sending programm
can be found in archive.
As described in recent threads [1] it is also possible to eliminate any
kind of main system OOM influence on network dataflow processing, thus
it is possible to prevent deadlock for systems, which use network as
memory storage (swap over network, iSCSI, NBD and so on).
Network allocator allows to dynamically grow it's cache when there is
strong demand on this.
Things in TODO list:
- implement userspace mapping and move netchannels into userspace
1. Discussions about deadlock prevention in network based storage devices.
http://thread.gmane.org/gmane.linux.kernel/434411/focus=434411
http://thread.gmane.org/gmane.linux.kernel/435986/focus=435986
http://thread.gmane.org/gmane.linux.kernel.mm/11682/focus=11682
Patch is available in archive.
|
|
|