Zbr's days.
September
Sun Mon Tue Wed Thu Fri Sat
  6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        
2008
Months
Sep
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Sat, 01 Sep 2007

Memory defragmentation.

Christoph Lameter from SGI has posted a patchset aimed to implement memory defragmentation in his SLUB memory allocator.
Main idea of the new version is to find pages, which are in the slab, but are not referenced, to free them and combine into bigger chunks (compound pages).

SLAB/SLUB/SLOB still does not support in-page defragmentation, which was one of the main issues in my network (tree) allocator, which combined any objects close to each other, so that allowed bigger allocations, but not only pages into compound pages (which is can do too). Main feature of the network allocator was the idea, that objects should never be freed on the different CPU than where it was allocated. In SLAB this approach is never used and objects can be freed on different than allocation CPUs which leads to the fragmentation.

/devel/networking/nta :: Link / Comments (0)


Fri, 20 Apr 2007

Power-of-two allocators.


Eric Dumazet has rised an interesting question about existing power-of-two allocator related to no-mmu implementation of the - is it possible to allocate higher-order page and then return part of it as unused (for example if someone has allocated 10-order page and then return 8-order part).
As far as I can tell (I'm not absolutely sure though) it is impossible with SLAB one - each page can only be 'split' into the same-sized parts, so either 10-order, or two 9-order, or 4 8-order, but not one 8-order and one 10-order minus 8-order.
That was one of the reasons I created network allocator, which I proved does fix such power-of-two overhead in the single page, i.e. blocks are combined when freed to form bigger one, and it is possible to allocate exactly requested block not aligned to power-of-two boundary.
But my allocator did not get enough attention (did I say that already for something unrelated? :), so it was a bit postponed.
Let's see, if there will be some interesting suggestions in the thread.

Update: David Howells of RedHat seems to be sure, that it is possible to allocate an order-10 page, then release part of it (say an order-8 subpage). But from the whole thread it seems that he says about no-mmu case, which can work on top of SLOB allocator in some embedded system.

/devel/networking/nta :: Link / Comments (0)


Wed, 13 Dec 2006

Avoiding - and fixing - memory fragmentation.


Interesting article in LWN about possible solutions for avoiding and fixing memory fragmentation.
This issue was main goal of the network tree allocator, which among others has ability to perform zero-copy sending and receiving, which I plan to integrate into netchannels.
Main difference is the fact, that network tree allocator combines chunks inside pages including composed pages, while solution proposed by Mel Gorman is to combine pages.

/devel/networking/nta :: Link / Comments (0)


Wed, 06 Sep 2006

Network allocator is now configurable.


I.e. one can select in kernel config to compile it or not, the same is applied to zero-copy sniffer, which allows to remove sniffer's overhead and fix some other mentioned temporal limitations.

/devel/networking/nta :: Link / Comments (0)


Sat, 19 Aug 2006

Zero-copy networking.


Initial zero-copy implementation is receiving side for zero-copy sniffer based on network allocator.

/devel/networking/nta :: Link / Comments (0)


Fri, 18 Aug 2006

Towards full zero-copy network support.


I've started zero-copy sniffer implementation, which is quite straightforward - each node contains bitmask of free/used chunks and bitmask of mapped (and used) to userspace chunks, when some area is mapped and is marked as being used and is going to be freed, freeing algorithm checks if it can do it or not, so freeing actually can be postponed (for arbitrary long time). Userspace reads from special char device set of structures which show allocated pointers and theirs sizes, so it can access raw data. Writing the same structures to that char device marks appropriate chunks of memory as mmaped but unused, so it can be freed when needed. Mmap itself is not implemented yet.

/devel/networking/nta :: Link / Comments (0)


Wed, 16 Aug 2006

Network allocator.


I've released second version of network allocator and sent it to mail lists for review. Short changelog:

  • added dynamically grown cache
  • changed some inline issues
  • reduced code size
  • removed AVL tree implementation from the sources
  • changed minimum allocation size to l1 cache line size (some arches require that)
  • removed skb->__tsize parameter
  • added a lot of comments
  • a lot of small cleanups

As usual patch is available in archive.

/devel/networking/nta :: Link / Comments (0)


Tue, 15 Aug 2006

Network allocator.


After some cleanups it is possible to achieve more than 2460 requests per second with trivial epoll based web server on system with network allocator instead of usual kmalloc/SLAB one for network payload data (for reference: system with kmalloc/SLAB allocator can only handle 1600-1800 requests per second).

/devel/networking/nta :: Link / Comments (0)


Mon, 14 Aug 2006

Network tree allocator homepage.


I've create one here. It includes design description, benchmarks, TODO items and all related information.

NTA implementation with design notes has been sent for review. This work was supposed to be funded by external company, but since they dissapeared I will release it in a way I want.
Patch is available in archive.

/devel/networking/nta :: Link / Comments (0)


Network tree allocator.


Weekend was quite productive: I've completed per CPU support for NTA, so it is fully per-cpu right now excecpt one case when freeing happens on different CPU than original allocation, in that case I put a chunk into queue to be freed on original CPU.

I've also added support for combined pages, so it is possible to allocate upto 16k on x86 with netwrok tree allocator right now.

While hacking on NTA I've decided to completely drop tree from the allocator, since struct page has enough place to put there a pointer to the node. I'm also working on removing so called container cache for network tree allocator (container is a structure which holds free chunks in a list), so when that tasks are completed I will do first release. I expect it to be done today.

Ok, I've removed container cache entirely, so neither allocation, nor freeing requires any kind allocation anymore (sounds really crazy, but it is).

There is some problem with extensive struct page usage in the network tree allocator - combined pages use page->private member as a pointer to the head of combined pages, while it is a spinlock_t for mapping code, so it is impossible to map combined pages and mappind destroys combining, so I need to create some tricks with page->lru instead of stock combining usage.

Here we go: when chunk of memory is free, it is stored in special LIFO list, since it is free, it is possible to dereference it into list entry itself without any kind of containers around it, since each chunk is at least 32bytes long (it should be L1 cache size actually), it is possbile to store there double linked entry, so removing as long as lookup of that entry takes O(1) (lookup is just a dereferencing of the pointer into list entry).
Since each page->lru has two pointers unused (well, they are used in by kernel, but since allocator is not supposed to return it's pages to the kernel, it is perfectly ok to overwrite them), I placed there a pointer to the node and a cpu number where that page was allocated. So freeing just gets that pointers and checks if CPU it runs on differs from allocation one or not, in case it is the same CPU, node is obtained from page->lru and appropriate neighbour pointers are calculated, which are then dereferenced into struct list_head and removed from appropriate lists. Pointers are combined and thus fragmentation is greatly reduced.

/devel/networking/nta :: Link / Comments (0)


Fri, 11 Aug 2006

Network tree allocator.


Scalability issues.
SLAB allocator is essentially per-cpu - memory being freed stays on the CPU which calls kfree() even if it was not originally allocated on that CPU. From one point of view it is bad (the same address must live in allocation and freeing CPUs caches and so on), but from other point it is very good, since allocator becomes lock-free. Since SLAB allocator by design can only contain chunks of memory of predefined size even from completely different pages, it can not perform any kind of fragmentation avoidance.
Network tree allocator was designed to be able to combine neighbour chunks into region of bigger size, so when freeing happens allocator will search for neighbours. So if NTA will become per-cpu, allocator must search for neighbours not on freeing CPU, but on CPU which was used for allocation, and since it is possible to simultaneously free different chunks which were originally allocated on the same CPU, there must exist some locking between them. Since freeing allows to change allocation state - i.e. some chunks of free memory can be removed and combined with other chunks, freeing logic must lock part of allocation logic (so allocator would not get chunk which is going to be combined with currently being freed one), so basically we need to introduce at least two locks - per free list (all free chunks are combined into FIFO lists) and per node (since the same node can contain chunks of the different sizes which can be simultaneously freed on different CPUs). Such complex locking can not be cheap, and the worst thing is that each node must contain a lock, which increases it's size from 12 to 36 bytes when debugging is turned off and thus does not fit into single cache line on a lot of arches. Decision to combine chunks only when freeing happens on the same CPU as allocation is not considered, since it is unlikely condition, so it will lead to constant increase of fragmentation.
As practice shows this solution is bad, since there is a problem with locking - allocation path locks list of free objects, gets free chunk, drops free list lock, locks corresponding node, updates node's bitmask, drops node lock; while freeing path gets node from freeing pointer, locks that node, updates it's bitmask, locks list of free objects of one neighbour, searches for that neighbour, drops the lock, locks list of objects for the next neighbour, searches for that neighbour, drops the lock and finally drops the lock for node. This approach has a race.

Interesting idea is not to free objects if freeing happens on different CPU than allocation, and put free object into queue for freeing on the original CPU. When CPU, where allocation originally happend, is going to perform next freeing or allocation, it can combine those batched objects.
In this scheme there is only tiny locking place when object being freed is going to be placed or removed from queue of "semi-free" objects (i.e. queue of objects allocated on different CPU and thus scheduled for freeing there).

/devel/networking/nta :: Link / Comments (0)


Wed, 09 Aug 2006

Network tree allocator performance test.


I've run epoll based web server from kevent testbed and got 2301 requests per second, while with usual code it is about 1600-1800 requests per second.
It can be explained by tons of reasons, but this test clearly shows that network tree allocator can behave not worse and maybe better than usual slab one (all debugging options are turned off) for network traffic allocations.

This test (single-threaded web server and httperf as client) has been run without any SMP performance tuning (and I have one gigantic lock right now around all allocations and freeings, but do not think too bad about my mental abilities, it will be completely eliminated after per-cpu tuning is completed (similar to how it is implemented in SLAB-allocator)).

All changes in the core network stack (not including allocator itself) conains of *kmalloc()/kfree() replacement in *alloc_skb()/skb_release_data() and addition of a new field into struct sk_buff which holds total allocation size, since ->totalsize variable can be changed while skb is being processed in the kernel.

/devel/networking/nta :: Link / Comments (0)


Network tree allocator.


While thinking some more about generic tree and hash table data representation, I've come to the conclusion, that tree should be more appropriate case for the structures which can dynamically grow/shrink with the time. For example with netwrok tree allocator it is trivial task to add new memory into the cache, and it is easy task to remove pages (but not trivial, since AVL-tree removing algo is very complex, although fast (and to be 100% honest with the reader, I want to note, that I did not implement it for NTA)), so memory hotplug and various OOM conditions can be handled much more nicely than with table based approach where parts of the table must be relocated.

The same issue comes in mind with recent changes in network hash tables manipulations - table dynamic grow/shrink sometimes requires the whole table relocation, which can be extremely large. As far as I recall there was a discussion about tree vs. table approach and the later was selected, but I do not recall any details already. Well, maybe it's time to reimplement the thing... At least for upcoming fast NAT rework I plan to use trees instead of hash tables to store NAT entries, and since most of my work looks for the most people like researching-only (it is not actually) projects far from reality (only two of them are in the kernel tree), I can create any crazy schemes I like.

/devel/networking/nta :: Link / Comments (0)


Tue, 08 Aug 2006

Network tree allocator.


I've moved it into the kernel and made all network traffic to be allocated using it. It is not tuned for SMP performance yet (it requires some per-cpu-alike magic), NTA does not support cache grow when there is requirement for that and context allows and there are no interfaces for the zero-copy networking yet, but the most complex part has been implemented already, although there are some bugs there yet.

After I complete SMP tuning I will run some performance tests and start sending and receiving zero-copy network stack implementation.

/devel/networking/nta :: Link / Comments (0)


Mon, 07 Aug 2006

Network tree allocator.


Additional 100 milliards of allocations have been done for network tree allocator. It's time to move it into the kernel.

While hacking on NTA I've created special SLAB-like 3 layer cache for struct avl_container - special structures used to store pointers to free chunks inside special crafted FIFO lists.

Now there is only following allocation being done using Linux memory allocation primitives:

  • initial storage structures for AVL trees (i.e. pages of data (which will be reused by tree allocator) and tree nodes) and array of lists of free chunks
  • container cache layers (l1 and l2 are pages, l3 is element of the list, which should be allocated very rarely), which are only allocated when appropriate layer is empty

So in run-time there are no allocations from main memory except rare page-sized allocations to refill container cache. As expected after some short period of time container cache stopped to grow.

Interesting note that after switching to cache allocator from usual malloc()/free() for containers general allocation speeds has increased.

/devel/networking/nta :: Link / Comments (0)


Sun, 06 Aug 2006

Network tree allocator.


Let's see how fragmentation problem is being solved in NTA.
For initial test I've run NTA with set of pseudo-random sized allocations until first allocation fails, when it hapens I decrease maximum allocation size in two times. Each graph below shows free and used chunks inside each page (there are 4094 pages), green points correspond to free and red ones - to used chunks (each one of 32 bytes).
Maximum allocation size is equal to PAGE_SIZE, failed allocation was for 1912 bytes (60 chunks of 32 bytes):
Fragmentation graph

Maximum allocation size is equal to PAGE_SIZE/2 (decreased after allocation failure), failed allocation was for 968 bytes (31 chunks of 32 bytes):
Fragmentation graph

Maximum allocation size is equal to PAGE_SIZE/4 (decreased in two times after first and second allocation failures), last failed allocation was for 504 bytes (16 chunks of 32 bytes):
Fragmentation graph

Maximum allocation size is equal to PAGE_SIZE/8 (decreased in two times after each of three allocation failures), last failed allocation was for 252 bytes (8 chunks of 32 bytes):
Fragmentation graph

This tests do not show how fragmentation is changed with the time, when there are a lot of allocations and freeings are completed, but even existing results show that network tree allocator performs very well. Next time I will run the same tests after some pseudo-random allocation and freeing periods.

For comparison I've run the same test with power-of-2 slab-like allocator (actually it is much more simple, but it has the same ideas as SLAB allocator and probably can behave even better if we get into account big-sized chunks). Picture does not change when maximum allocation size is being decreased after allocation failures, since most of the overhead and fragmentation is obtained from power of 2 rounds.
SLAB-like power-of-two allocator overhead and fragmentation

This SLAB-like power-of-two allocator overhead and fragmentation actually looks different than on the picture, since almost all allocations have fragmentation overhead, so each vertical line actually must contain several red(used)-green(free or fragmentation overhead) pieces, where sum of all pieces of the same colour will be equal to what is shown on the picture. But picture presents that absolute amount of fragmentation overhead is extremely high for power-of-2 allocators. For the real SLAB allocator picture will be better for small-sized chunks (since chunks never share pools with different sized ones, except when they can steal pages when cache is refilled), but much worse for big-sized ones.

Difference in used and free chunks position on the pictures is due to the fact, that in network tree allocator chunks in page are shown in reverse order (i.e. higher addresses are first).

/devel/networking/nta :: Link / Comments (0)


Sat, 05 Aug 2006

Network tree alocator.


First test stage completed - more than 52 milliards of allocations of different sizes (currently tested only from 1 to PAGE_SIZE with 32 bytes granularity) have been done, so roughly it is correct. To prove correctness more I plan to start second testing stage, which will include much faster freeing (test will keep pointers to allocated objects inside array, which indexes will be structured to not contain any gaps) and full utilisation of allocated page pools, so if allocation fails, it will be restarted with smaller size until all allocator's memory is used. That will give an interesting statistics of memory usage and fragmentation in tree allocator. I will also include periodical dump of bitmasks of free and used objects so it would be possible to visually observe fragmentation issues.

I've found a way to determine if given address belongs to PAGE_SIZEd chunk or to bigger contiguous region - it is quite simply by looking at page->lru.next and page->private, which can be used to detect compound pages and it's order, which solves a problem of converting a freed address into a page which holds freed area.

/devel/networking/nta :: Link / Comments (0)


Fri, 04 Aug 2006

Network tree allocator.


I've started stress testing for new allocator.
First one is quite simple - system tries to allocate a lot of chunks of random size, when there is no memory or number of allocated chunks exceeds some threshold (1 million allocations), system starts to free them one by one from the begining. It is quite slow test, since test's freeing logic (do not confuse with freeing logic inside allocator) runs through the whole array (currently it contains 1 million entries) of allocated chunks and tries to free them all.

/devel/networking/nta :: Link / Comments (0)


Wed, 02 Aug 2006

Network tree allocator.


Userspace model has been completed.
I have not run stress tests yet, but it already can allocate set of objects and combine them back when they are freed. Let's see an example:

  • allocate 120 bytes
  • allocate 200 bytes
  • allocate 70 bytes

  • free 120 bytes
  • free 70 bytes
  • free 200 bytes
This ends up in the following sequence:
  • get one page
  • split it to 120 and PAGE_SIZE-120 bytes parts
  • mark PAGE_SIZE-120 bytes part as free and move it's container into new list
  • split PAGE_SIZE-120 into 200 and PAGE_SIZE-120-200 bytes parts (second allocation)
  • mark PAGE_SIZE-120-200 bytes part as free and move it's container into new list
  • split PAGE_SIZE-120-200 into 70 and PAGE_SIZE-120-200-70 bytes parts (third allocation)
  • mark PAGE_SIZE-120-200-70 bytes part as free and move it's container into new list

  • free first chunk (120 bytes) - it was first in the page and it does not have free neighbours (above 200 bytes allocation was done right after this chunk and it is used)
  • allocate new container and add it into the list for 120 bytes (aligned to AVL_MIN_SIZE actually) free objects
  • free third chunk (70 bytes) - it has a big free area at the left (note that I'm talking about little endian here), of PAGE_SIZE-120-200-70 bytes and second allocated (currently used) chunk of 200 bytes at the right, so this freeing reuse container from the left chunk and move it into the list for PAGE_SIZE-120-200-70+70 bytes
  • free 200 bytes chunk - it has both free neighbours: with PAGE_SIZE-120-200-70+70 bytes and 120 bytes. So it will reuse container for 120 bytes, so it results in PAGE_SIZE chunk.

For those who likes to look at unknown logs and symbols (like I do, especially new dmesgs), here is debug output from initial implementation of network tree allocator, produced by described above steps:
PAGE_SIZE: 4096, max nodes: 4094, node size: 32.
avl_update_node: node: 0x523800, value: 02554000, ptr: 0x2554000, cpos: 127, size: 128, num: 4.
avl_fill_bits: num: 4, pos: 0, idx: 0, p: f, start: 0, stop: 4, fffffffffffffff0.
avl_update_node: reuse container 0x2555f60 in pos 123 with ptr 0x2554080.
main: allocated ptr: 0x2554000, size: 120.

avl_update_node: node: 0x523800, value: 02554000, ptr: 0x2554080, cpos: 123, size: 224, num: 7.
avl_fill_bits: num: 7, pos: 4, idx: 0, p: 7f0, start: 4, stop: 11, fffffffffffff800.
avl_update_node: reuse container 0x2555f60 in pos 116 with ptr 0x2554160.
main: allocated ptr: 0x2554080, size: 200.

avl_update_node: node: 0x523800, value: 02554000, ptr: 0x2554160, cpos: 116, size: 96, num: 3.
avl_fill_bits: num: 3, pos: 11, idx: 0, p: 3800, start: 11, stop: 14, ffffffffffffc000.
avl_update_node: reuse container 0x2555f60 in pos 113 with ptr 0x25541c0.
main: allocated ptr: 0x2554160, size: 70.

avl_free: ptr: 0x2554000 [02554000], pos: 0, sbits: 4, size: 120.
avl_fill_bits: num: 4, pos: 0, idx: 0, p: f, start: 0, stop: 4, ffffffffffffc00f.
avl_combine: lp: (nil), lbits: 0, lc: (nil), rp: (nil), rbits: 0, rc: (nil), 
	current: ptr: 0x2554000, bits: 4, combined: ptr: 0x2554000, idx: 3, cont: (nil).
avl_combine: Added new container for pointer 0x2554000, size: 128.
main: freed ptr: 0x2554000.

avl_free: ptr: 0x2554160 [02554000], pos: 11, sbits: 3, size: 70.
avl_fill_bits: num: 3, pos: 11, idx: 0, p: 3800, start: 11, stop: 14, fffffffffffff80f.
avl_free: found free left neighbour at 0x25541c0, bits: 114.
avl_combine: lp: 0x25541c0, lbits: 114, lc: 0x2555f60, rp: (nil), rbits: 0, rc: (nil), 
	current: ptr: 0x2554160, bits: 3, combined: ptr: 0x2554160, idx: 116, cont: 0x2555f60.
avl_combine: Using existing container for pointer 0x2554160, size: 3744.
main: freed ptr: 0x2554160.

avl_free: ptr: 0x2554080 [02554000], pos: 4, sbits: 7, size: 200.
avl_fill_bits: num: 7, pos: 4, idx: 0, p: 7f0, start: 4, stop: 11, ffffffffffffffff.
avl_free: found free left neighbour at 0x2554160, bits: 117.
avl_free: found free right neighbour at 0x2554000, bits: 4.
avl_combine: lp: 0x2554160, lbits: 117, lc: 0x2555f60, rp: 0x2554000, rbits: 4, rc: 0x2555f80, 
	current: ptr: 0x2554080, bits: 7, combined: ptr: 0x2554000, idx: 127, cont: 0x2555f80.
avl_combine: Using existing container for pointer 0x2554000, size: 4096.
main: freed ptr: 0x2554080.

Completed.

/devel/networking/nta :: Link / Comments (0)


Tue, 01 Aug 2006

Network tree allocator.


While creating various bitfield operations I've found, that several existing Linux kernel ones are way too suboptimal, for example set_bit_string() and __clear_bit_string() on x86_64 (actually I have not seen in other arches). And I'm saying not about assembler optimisations, but usual C ones.
So right now I'm a bit snapped between kevents and friends, tree allocator, slacking and paid work (yep, I need to work not less than 8 hours every day to get some beer and other goodies), but I plan to complete userspace implementation very soon, since most of the things are already implemented.

/devel/networking/nta :: Link / Comments (0)


Mon, 31 Jul 2006

Network tree allocator.


Freeing implementation uncovered a problem in the last algorithm - when system tries to free an object and searches in the AVL tree node's bitmask of free and used chunks, it is impossible to determine it's size if left neighbour (with increased size) is marked as used too. So I decided that freeing function will get a size of the object as parameter. Alternative variant is to store pointer to free objects inside some data structure (array, list or anything else), but that obviously requires additional memory and CPU overhead. Network stack always knows size of the each freed object, and actually even SLAB has special binding for cache objects and it's size, so I just decided to not introduce additional overhead right now and implement this later if there will be any need for that.

/devel/networking/nta :: Link / Comments (0)


Fri, 28 Jul 2006

Network tree allocator.


After some thoughts and small bottle of "Krusovice" allocation seems to work (I create userspace model first), although no stress tests were run yet.
It is slightly simpler than freeing, so if things will continue to be as they are, tomorrow full userspace network memory tree allocator will work.

/devel/networking/nta :: Link / Comments (0)


Network tree allocator.


And to complicate thins even more let's do everything simple.

Each allocated physical page (or pointers to larger contiguous blocks, they must differ and there must exist some way to obtain pointer to the start of the block from it's internal pointer, for example it is trivial to obtain pointer to the physical page (do not confuse with struct page) from the pointer inside that page - one just needs to AND given address with PAGE_MASK) is inserted into the AVL tree and live there forever. Each node has a bitmask of free and empty regions inside that page (if minimum allowed allocation block is 32 bytes, this bitmask takes PAGE_SIZE/32 bits), where each bit corresponds to part of the page which is used or free.
Each free region is placed into array of single-linked lists, where each list entry stores pointer to the actual free region and a pointer to the next entry. Array is indexed using allocation size (aligned to the minimum allowed allocation size).

When region is freed, appropriate page pointer is calculated (using trivial AND with PAGE_MASK) and page is selected inside the tree. Using offset from the page address to freed address we can select two neighbour regions which are either used or free and theirs sizes using search inside bitmask of free and used regions in each node. Free ones are searched inside lists for the appropriate sizes (yes, using slow search inside single-linked list, call me a loser), removed from that lists and concatenated (with node's bitmask update), then new free region is placed into the list for appropriate size.

When system is asked to allocate new region it checks list of free regions for the appropriate size, if it is empty it tries to get next list (with increased size). Eventually it will find the list with free region, which will be splitted into the used one (and appropriate pointer will be returned to the caller), and free one which will be moved into the new list for appropriate size. Bitmask of free and used regions for the node, which hosts page where returned region is allocated, will be updated accordingly.

If search (through single-linked list) for the free region which should be concatenated with freed one will take too much time (for example SLOB allocator works with similar searching for free object and no one complains), it is possible to replace it with double-linked one and set of pointers inside node which will point directly to the free region, so it can be removed and concatenated without speed overhead, but that will require list or array of pointers inside each node, so we will have additional memory overhead.

There is also a problem with jumbo frames and more than one page allocations. For the above scheme to work with more than a page contiguous blocks it is required that there is a way to determine initial pointer (start of the contiguous block) from any pointer inside that block (which is trivial with page sized chunks). It can be solved by allocation of the big sized pages from special address ranges (i.e. we can get page address and if it is from big sized pool of addresses, it will be updated accordingly).

But enough for theory, talks are cheap, so I plan to create initial implementation today.

P.S. And how that could be easier if I would not care too much about memory overhead... But it is another story.

/devel/networking/nta :: Link / Comments (0)


Thu, 27 Jul 2006

Network tree allocator.


Brief and simplified allocation/freeing algorithm design description.

Each node inside AVL tree will have a head of the list(s), each list entry can be dereferenced into special container which holds info about it's two neighbours (if it is used region, then it's neighbours are either NULLs or free regions).

Free algorithm.
When some chunk is freed it is found in the tree and then it's neighbours are found too, then each of three containers are removed from appropriate nodes and regions are concatenated into one free region and new (or existing) node selected to store this new memeber inside node's list.

Allocation algorithm.
If there is a node with exactly requested (aligned) size, then one of it's chunks is used and that chunk is moved from free into used list.
If there is no such a node, then the nearest with larger regions is used. System removes one chunk from free list and split it into the two - chunks are moved into used and free lists inside nodes of the appropriate sizes (chunk with requested size into used list, empty chunk (and new container for it) is moved into empty list).

There are no pointers inside each AVL node (it's size is 10 bytes on every arch), but instead variable size bitfields are used which are offsets inside global array of nodes. Mentioned above containers for free and used memory regions must have at least one pointer to the actual region, a pointer (or it's id) to the next container (to create lists of used/free chunks) and two bitfields to show it's neighbours.

While moving home I changed my opinion on above notes.
Everything that uses lists and is not FIFO or LIFO is broken. Or at least a lot of.
So, no lists.
Main AVL tree will be replaced with array of trees, each one corresponds to the set of used chunks of the selected size, each tree will be indexed with a pointer to the used chunk (or it's part). All free chunks will be placed into the single-linked list. Allocation and freeing algorithms are essentially the same as described above.

/devel/networking/nta :: Link / Comments (0)


Wed, 26 Jul 2006

Network tree allocator.


AVL tree implementation has passed 90 milliards search/insert operations (tree was recreated after 10 millions operations), which confirms it's correctness.
Test used full 64k-node tree, which took about 768kb of memory, maxiumum tree height is 16. If use that tree as a base for memeory allocator in kernel and limit maximum allocation of 64kb and align each allocation to 32 bytes, the whole tree will have 2048 nodes (height does not exceed 11), which takes 24kb. If kernel power-of-2 allocator is used for 1500 bytes sized chunks, it's overhead will become higher than AVL-tree one when 50 allocations are in flight, if 9k jumbo frames are used, then it is enough to allocate 4 frames.
AVL-tree overhead does not get into account how free chunks will be placed inside the node. Since both input and output part of the stack can allocate the whole pages, it is not correct to put some list header into each page in the NTA pool, so pages and other free chunks must be stored inside arrays bound to each node inside AVL tree. That arrays will be refilled from main pools each time node is empty and there is a request for appropriate size.

/devel/networking/nta :: Link / Comments (0)


Tue, 25 Jul 2006

Network tree allocator.


I've started stress tests for AVL-tree implementation, it passed about several hundreds of millions search/insert operations (splitted into 100k and 1 million chunks) for pseudo random values, so it seems that tree has been implemented correctly. I leave stress test for about 10 hours to complete this developement stage.

Implementation does not use dynamic allocation of the nodes, the are obtained from preallocated pool. Ten millions search/insert operations take about 4.5 seconds on my AMD64 Athlon 3500+ machine.

/devel/networking/nta :: Link / Comments (0)


Mon, 24 Jul 2006

Network tree allocator.


Here is brief description of my efforts directed to full sending and receiving zero-copy support for netchannels.

Main idea is to replace slab allocator with netchannel's own, so that all allocator's pool could be mapped from userspace. In that case all skbs allocated from that pool automagically become visible to userspace applications without any single memory copy.
If that allocator will be created in a right way, it will also solve problem with memory fragmentation and more generic problem with power of 2 allocators - big alignment overhead.

I read some papers about memory allocation in MMU-less systems (like uClinux kernels), and found that it is quite common problem, and there are several non-power-of-2 allocators, which might be used with my approach.
But it is not interesting way, I think, so I started to draw some initial design notes about how netchannel allocator will look like.

It will have several pools of contiguous memory chunks, at least pool of page sized chunks (which will be used for usual alocations) and one contiguos region (probably allocated at a boot time) for jumbo frames (and actually management of that pool can be used for MMU-less allocators).
Each contiguous region will have some control block attached which will store pointers to free and used sub-blocks inside the region (I plan to use chunks aligned to predefined constant value, for example 32 bytes). When some block is freed inside the contiguous region, it will be concatenated with neighbour free blocks thus reducing memory fragmentation. Pointers to the free regions will be placed into the AVL-tree as the fastest search/insert/delete tree among existing binary trees (all even worst times never exceed O(log2(N)), where N is number of nodes).
If size of the free area inside contiguous region differs after allocation/freeing, pointer will be moved inside the tree, thus rebalancing it.

Each AVL node is designed to be very memory efficient, so it only uses 12 bytes on any arch. This limits maximum number of nodes to 64k-1, if each node corresponds to value multiple of 32 bytes, this allows maximum 2mb contiguous block allocation, which is more than enough right now. If that limit is going to be changed, one just need to change definition of one type (avl_t). Current proof-of-concept userspace model implementation of the tree works with 16bit type to store each node's value parameter.

/devel/networking/nta :: Link / Comments (0)