|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Fri, 11 Aug 2006
Network tree allocator.
Scalability issues.
SLAB allocator is essentially per-cpu - memory being freed stays
on the CPU which calls kfree() even if it was not
originally allocated on that CPU. From one point of view
it is bad (the same address must live in allocation and freeing
CPUs caches and so on), but from other point it is very good, since
allocator becomes lock-free. Since SLAB allocator by design
can only contain chunks of memory of predefined size even from completely
different pages, it can not perform any kind of fragmentation avoidance.
Network tree allocator was designed to be able to combine neighbour chunks into
region of bigger size, so when freeing happens allocator will search
for neighbours. So if NTA will become per-cpu, allocator must search for
neighbours not on freeing CPU, but on CPU which was used for allocation,
and since it is possible to simultaneously free different chunks which were
originally allocated on the same CPU, there must exist some locking between
them. Since freeing allows to change allocation state - i.e. some chunks of
free memory can be removed and combined with other chunks, freeing logic
must lock part of allocation logic (so allocator would not get
chunk which is going to be combined with currently being freed one),
so basically we need to introduce at least two locks -
per free list (all free chunks are combined into FIFO lists) and per node
(since the same node can contain chunks of the different sizes which can be
simultaneously freed on different CPUs). Such complex locking can not be cheap,
and the worst thing is that each node must contain a lock, which increases
it's size from 12 to 36 bytes when debugging is turned off and thus does not fit
into single cache line on a lot of arches. Decision to combine chunks only
when freeing happens on the same CPU as allocation is not considered, since
it is unlikely condition, so it will lead to constant increase of fragmentation.
As practice shows this solution is bad, since there is a problem with locking -
allocation path locks list of free objects, gets free chunk, drops free list lock,
locks corresponding node, updates node's bitmask, drops node lock; while freeing
path gets node from freeing pointer, locks that node, updates it's bitmask,
locks list of free objects of one neighbour, searches for that neighbour, drops
the lock, locks list of objects for the next neighbour, searches for that neighbour,
drops the lock and finally drops the lock for node. This approach has a race.
Interesting idea is not to free objects if freeing happens on different CPU than allocation,
and put free object into queue for freeing on the original CPU. When CPU, where
allocation originally happend, is going to perform next freeing or allocation,
it can combine those batched objects.
In this scheme there is only tiny locking place when object being freed is going
to be placed or removed from queue of "semi-free" objects (i.e. queue of objects allocated on
different CPU and thus scheduled for freeing there).
/devel/networking/nta :: Link / Comments ()
|