Zbr's days.
August
Sun Mon Tue Wed Thu Fri Sat
   
11
   
2006
Months
Aug

About TODO Blog RSS Old blog Projects Gallery Notes

Fri, 11 Aug 2006

Network tree allocator.


Scalability issues.
SLAB allocator is essentially per-cpu - memory being freed stays on the CPU which calls kfree() even if it was not originally allocated on that CPU. From one point of view it is bad (the same address must live in allocation and freeing CPUs caches and so on), but from other point it is very good, since allocator becomes lock-free. Since SLAB allocator by design can only contain chunks of memory of predefined size even from completely different pages, it can not perform any kind of fragmentation avoidance.
Network tree allocator was designed to be able to combine neighbour chunks into region of bigger size, so when freeing happens allocator will search for neighbours. So if NTA will become per-cpu, allocator must search for neighbours not on freeing CPU, but on CPU which was used for allocation, and since it is possible to simultaneously free different chunks which were originally allocated on the same CPU, there must exist some locking between them. Since freeing allows to change allocation state - i.e. some chunks of free memory can be removed and combined with other chunks, freeing logic must lock part of allocation logic (so allocator would not get chunk which is going to be combined with currently being freed one), so basically we need to introduce at least two locks - per free list (all free chunks are combined into FIFO lists) and per node (since the same node can contain chunks of the different sizes which can be simultaneously freed on different CPUs). Such complex locking can not be cheap, and the worst thing is that each node must contain a lock, which increases it's size from 12 to 36 bytes when debugging is turned off and thus does not fit into single cache line on a lot of arches. Decision to combine chunks only when freeing happens on the same CPU as allocation is not considered, since it is unlikely condition, so it will lead to constant increase of fragmentation.
As practice shows this solution is bad, since there is a problem with locking - allocation path locks list of free objects, gets free chunk, drops free list lock, locks corresponding node, updates node's bitmask, drops node lock; while freeing path gets node from freeing pointer, locks that node, updates it's bitmask, locks list of free objects of one neighbour, searches for that neighbour, drops the lock, locks list of objects for the next neighbour, searches for that neighbour, drops the lock and finally drops the lock for node. This approach has a race.

Interesting idea is not to free objects if freeing happens on different CPU than allocation, and put free object into queue for freeing on the original CPU. When CPU, where allocation originally happend, is going to perform next freeing or allocation, it can combine those batched objects.
In this scheme there is only tiny locking place when object being freed is going to be placed or removed from queue of "semi-free" objects (i.e. queue of objects allocated on different CPU and thus scheduled for freeing there).

/devel/networking/nta :: Link / Comments (0)

Please solve this captcha to be allowed to post (need to reload in a minute): 40 * 27

Comments are closed for this story.