|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Mon, 31 Jul 2006
Network tree allocator.
Freeing implementation uncovered a problem in the last algorithm - when
system tries to free an object and searches in the AVL tree node's bitmask
of free and used chunks, it is impossible to determine it's size if left neighbour
(with increased size) is marked as used too. So I decided that freeing function
will get a size of the object as parameter. Alternative variant is to store
pointer to free objects inside some data structure (array, list or anything else),
but that obviously requires additional memory and CPU overhead.
Network stack always knows size of the each freed object, and actually
even SLAB has special binding for cache objects and it's size, so I just decided
to not introduce additional overhead right now and implement this later
if there will be any need for that.
/devel/networking/nta :: Link / Comments (0)
Weekend.
Nothing forboded a thunder, but it happend. One-by-one.
I've awakened Saturday and went to Ilya Glazunov's gallery.
I spent there about 4 hours and under strong impression moved home. There are several gigantic
paints about Russia and it's history from about 1000 to 2000 years. I especially like "The whole Russia"
and small paint "Magic asphalt". One says that he paints special eyes. That's definitely true on
the paints devoted to Jesus Christ - they are so deep, so it is not that easy to take eyes off.
"Kulikov's battle" serie about Mongol invasion in 14 century impressed me with colour madness.
Ilya Glazunov has especially perfect red-coloured paints.
"The whole Russia" paint shows who made significant influence on my country evolution.
From scientists to commanders, from regents to people of art, from tirans to saints - there are hundreds
of people on that gigantic paint, but it leaves me with some bad thoughts about Russia evolution...
There are a lot of other great paints, some of them were taken from Tretyakov's gallery, and it is
impossible to even somehow describe my feelings in a couple of words here. I definitely recommend
you to visit it if you will be in Moscow.
But let's move further.
When I was at home, I'd known a great surprise - mephody (Alexander Boykov) is going to marry right now
(actually when I was at home he had already married). He and Ira completed that process in two days
only and decided to celebrate theirs wedding.
I moved to the restaurant, then to night club for after-pary. I met a lot of old friends,
but a lot of them did not appear, so I'm quite dissapointed about them.
Anyway it was great time. I wish Alexander and Ira the only best things ever, and I'm quite sure
they will be ok.
Sunday started with feeling that everything goes well, so I started some preparation to Jim Klimov's
wedding, which I was invited to. They celebrated it in our Alma Mater, so I moved there.
While there were some preparations I had a very interesting talk with
Russian SWSoft team leader.
On Jim's wedding I met my old university friends and we spent a lot of really good time.
I wish Jim and Kathreen love and respect, I'm sure they will be happy together.
/life :: Link / Comments (0)
Fri, 28 Jul 2006
Network tree allocator.
After some thoughts and small bottle of "Krusovice"
allocation seems to work (I create userspace model first),
although no stress tests were run yet.
It is slightly simpler than freeing, so if things will continue
to be as they are, tomorrow full userspace network memory tree
allocator will work.
/devel/networking/nta :: Link / Comments (0)
Network tree allocator.
And to complicate thins even more let's do everything simple.
Each allocated physical page (or pointers to larger contiguous blocks, they
must differ and there must exist some way to obtain pointer to the start
of the block from it's internal pointer, for example it is trivial to
obtain pointer to the physical page (do not confuse with struct page)
from the pointer inside that page - one just needs to AND given address with
PAGE_MASK) is inserted into the AVL tree and live there forever.
Each node has a bitmask of free and empty regions inside that page (if minimum
allowed allocation block is 32 bytes, this bitmask takes PAGE_SIZE/32
bits), where each bit corresponds to part of the page which is used or free.
Each free region is placed into array of single-linked lists, where each
list entry stores pointer to the actual free region and a pointer to the next
entry. Array is indexed using allocation size (aligned to the minimum allowed allocation
size).
When region is freed, appropriate page pointer is calculated (using trivial
AND with PAGE_MASK) and page is selected inside the tree. Using
offset from the page address to freed address we can select two neighbour
regions which are either used or free and theirs sizes using search inside
bitmask of free and used regions in each node. Free ones are searched
inside lists for the appropriate sizes (yes, using slow search inside single-linked list,
call me a loser), removed from that lists and concatenated (with node's bitmask update),
then new free region is placed into the list for appropriate size.
When system is asked to allocate new region it checks list of free regions
for the appropriate size, if it is empty it tries to get next list (with increased size).
Eventually it will find the list with free region, which will be splitted
into the used one (and appropriate pointer will be returned to the caller),
and free one which will be moved into the new list for appropriate size.
Bitmask of free and used regions for the node, which hosts page where returned
region is allocated, will be updated accordingly.
If search (through single-linked list) for the free region which should be
concatenated with freed one will take too much time (for example SLOB allocator
works with similar searching for free object and no one complains), it is possible to replace
it with double-linked one and set of pointers inside node which will point directly
to the free region, so it can be removed and concatenated without speed overhead,
but that will require list or array of pointers inside each node, so we will have
additional memory overhead.
There is also a problem with jumbo frames and more than one page allocations.
For the above scheme to work with more than a page contiguous blocks it is required
that there is a way to determine initial pointer (start of the contiguous block)
from any pointer inside that block (which is trivial with page sized chunks).
It can be solved by allocation of the big sized pages from special address ranges (i.e.
we can get page address and if it is from big sized pool of addresses, it will be updated
accordingly).
But enough for theory, talks are cheap, so I plan to create initial implementation today.
P.S. And how that could be easier if I would not care too much about memory overhead... But
it is another story.
/devel/networking/nta :: Link / Comments (0)
Thu, 27 Jul 2006
Network tree allocator.
Brief and simplified allocation/freeing algorithm design description.
Each node inside AVL tree will have a head of the list(s), each list entry
can be dereferenced into special container which holds info about it's two
neighbours (if it is used region, then it's neighbours are either NULLs or
free regions).
Free algorithm.
When some chunk is freed it is found in the tree and then it's neighbours
are found too, then each of three containers are removed from appropriate nodes
and regions are concatenated into one free region and new (or existing) node
selected to store this new memeber inside node's list.
Allocation algorithm.
If there is a node with exactly requested (aligned) size, then one of it's chunks
is used and that chunk is moved from free into used list.
If there is no such a node, then the nearest with larger regions is used.
System removes one chunk from free list and split it into the two - chunks
are moved into used and free lists inside nodes of the appropriate sizes
(chunk with requested size into used list, empty chunk (and new container for it)
is moved into empty list).
There are no pointers inside each AVL node (it's size is 10 bytes on every arch),
but instead variable size bitfields are used which are offsets inside global array of
nodes. Mentioned above containers for free and used memory regions must have at least
one pointer to the actual region, a pointer (or it's id) to the next container (to create
lists of used/free chunks) and two bitfields to show it's neighbours.
While moving home I changed my opinion on above notes.
Everything that uses lists and is not FIFO or LIFO is broken.
Or at least a lot of.
So, no lists.
Main AVL tree will be replaced with array of trees, each one
corresponds to the set of used chunks of the selected size,
each tree will be indexed with a pointer to the used chunk (or it's part).
All free chunks will be placed into the single-linked list.
Allocation and freeing algorithms are essentially the same as described above.
/devel/networking/nta :: Link / Comments (0)
Wed, 26 Jul 2006
Network tree allocator.
AVL tree implementation has passed 90 milliards search/insert operations (tree was recreated
after 10 millions operations), which confirms it's correctness.
Test used full 64k-node tree, which took about 768kb of memory, maxiumum
tree height is 16. If use that tree as a base for memeory allocator in kernel and limit
maximum allocation of 64kb and align each allocation to 32 bytes, the whole tree will have
2048 nodes (height does not exceed 11), which takes 24kb. If kernel power-of-2 allocator is
used for 1500 bytes sized chunks, it's overhead will become higher than AVL-tree one when
50 allocations are in flight, if 9k jumbo frames are used, then it is enough to allocate 4
frames.
AVL-tree overhead does not get into account how free chunks will be placed inside the node.
Since both input and output part of the stack can allocate the whole pages, it is
not correct to put some list header into each page in the
NTA pool,
so pages and other free chunks must be stored inside arrays bound to each node inside AVL tree.
That arrays will be refilled from main pools each time node is empty and there is a request
for appropriate size.
/devel/networking/nta :: Link / Comments (0)
Climbing.
There were a lot of people today in Skala-city,
so training was, let's say, "bursty", i.e. we wait quite a long while interesting
ropes become free and started several traces one-by-one, so I completed not so many of them,
but tired noticebly - most of the traces were combined in pairs.
Grange brought his new girlfriend, so he must
complete man's start (trace on negative slope without using legs), and he created new record
for the climbing zone (among usual visitors, but not instructors). Mine is about
2 holds lower.
Training was completed with campus-board exercises, sauna and excellent mood.
/life :: Link / Comments (0)
Kevent.
I've resent whole kevent patchset including
kevent,
network aio,
AIO, poll/select and timer notifications to linux-kernel@ and netdev@, which resulted in
interesting discussion. David Miller and (I do not believe!) Christoph Hellwig have
positive feedback on that.
/devel/kevent :: Link / Comments (0)
Tue, 25 Jul 2006
Network tree allocator.
I've started stress tests for AVL-tree implementation, it passed about
several hundreds of millions search/insert operations (splitted into 100k and 1 million chunks)
for pseudo random values, so it seems that tree has been implemented correctly. I leave
stress test for about 10 hours to complete this developement stage.
Implementation does not use dynamic allocation of the nodes, the are obtained from preallocated pool.
Ten millions search/insert operations take about 4.5 seconds on my AMD64 Athlon 3500+ machine.
/devel/networking/nta :: Link / Comments (0)
Mon, 24 Jul 2006
Climbing.
It was lazy training today - part of it was devoted
to my thoughts about Confucianism, the rest of the time
was spent in couple of traverses and some not very complex old traces.
Campus board exercises were replaced with man's starts - climbing
on negative slope without using legs - it seems that no one
(except may be instructors) can finish at least one half of the wall.
/life :: Link / Comments (0)
Network tree allocator.
Here is brief description of my efforts directed to
full sending and receiving zero-copy support for
netchannels.
Main idea is to replace slab allocator with netchannel's own,
so that all allocator's pool could be mapped from userspace.
In that case all skbs allocated from that pool automagically
become visible to userspace applications without any single memory copy.
If that allocator will be created in a right way, it will also solve problem
with memory fragmentation and more generic problem with power of 2
allocators - big alignment overhead.
I read some papers about memory allocation in MMU-less systems
(like uClinux kernels), and found that it is quite common problem,
and there are several non-power-of-2 allocators, which might be used with
my approach.
But it is not interesting way, I think, so I started to draw some initial
design notes about how netchannel allocator will look like.
It will have several pools of contiguous memory chunks, at least pool of
page sized chunks (which will be used for usual alocations) and one contiguos
region (probably allocated at a boot time) for jumbo frames (and actually
management of that pool can be used for MMU-less allocators).
Each contiguous region will have some control block attached which will store
pointers to free and used sub-blocks inside the region (I plan to use chunks aligned
to predefined constant value, for example 32 bytes). When some block is freed
inside the contiguous region, it will be concatenated with neighbour free blocks
thus reducing memory fragmentation. Pointers to the free regions will be placed into
the AVL-tree as the fastest search/insert/delete
tree among existing binary trees (all even worst times never exceed O(log2(N)),
where N is number of nodes).
If size of the free area inside contiguous region differs after allocation/freeing,
pointer will be moved inside the tree, thus rebalancing it.
Each AVL node is designed to be very memory efficient, so it only uses 12 bytes
on any arch. This limits maximum number of nodes to 64k-1, if each node
corresponds to value multiple of 32 bytes, this allows maximum 2mb contiguous block
allocation, which is more than enough right now. If that limit is going to be changed,
one just need to change definition of one type (avl_t).
Current proof-of-concept userspace model implementation of the tree works with 16bit
type to store each node's value parameter.
/devel/networking/nta :: Link / Comments (0)
Fri, 21 Jul 2006
Netchannels.
Our discussion
about netchannels ended up with following topics from socket apologists:
- process context protocol processing allows to have spurious retransmits if receiving
side is blocked on some non-tcp task like data writing.
- atcp works faster because it does not have some features, which if enabled
in socket code resulted in worse perfomance.
- atcp has some bugs (like absence of slow start), which will break connections in
some situations, and thus it is faster.
My position is following:
- socket code and netchannels behave exactly the same, when userspace task is blocked:
either socket starts to queue the data and eventually it will drop it due to memory
limits, or netchannel will queue the data, but do not process it and send acks.
Neither condition can fire in usual life, and if life becomes unusual, troubles
will be cought by both sockets and netchannels.
- no one knows where existing tcp implementation has those features, which should
be turned off by default, and no one is motivited enough to start searhing them.
- atcp has slow start and does not have other specified "bugs", and actually
opponent does not know why atcp behaves faster with netchannels (and he agree with me).
Besides above issues, I want to specify my point of view, that netchannels
perform faster not only because of different TCP implementation, but
because of architectural changes (no BH/irq processing, no bh/irq locks,
no complex queue management, no atomic allocations, no false (I mean acks
for unread data) acks, thus no possible queue overfull, natural flow
control and other things).
According to the idea, that it is impossible to have faster processing
(with existing socket code), when protocol management is moved totally
into process context, but it was shown
with my initial netchannel implementation,
that it can be done - and there was small, but 100% reproducible steady
performance win (about 2-3 MB/sec and several % of CPU usage) with
big-sized chunks. Unfortunately I did not test small-sized ones, which
show big perfomance win with netchannels and atcp. Those results were
not enough for me, so I implemented different stack, which does not
have anything related to the two step processing, and it can be one of the reasons
for faster processing. It can have bugs, but the whole idea was proven
to be absolutely correct (when using either socket code, or atcp).
After such a discussion, I expect netchannels idea will be buried by linux netdev community,
so I think I can say, that this project has completely achieved all
designed goals (not to bury the whole idea, but to show that protocol
processing in process context can be (sometimes much) faster than
two-level usual one) and will be stopped.
Right now I work on the network allocator
which can be used both by netchannels and usual networking code
to get full zero-copy sending and receiving capabilities.
/devel/networking :: Link / Comments (0)
Thu, 20 Jul 2006
Climbing.
Excellent training today - several old traces on vertical walls,
negative slope start and a lot of traverses (I went from
work in 17:00 and was in Skala-City
earlier than 18:00). I recalled very interesting traverse
over all holds which a placed on the first shield only, i.e.
about 1 meter high - it got almost all my power, but I completed
it twice (backward and forward ways). Training was finished
with already usual campus-board exercises. As you can expect, I have
excellent mood now.
/life :: Link / Comments (0)
Wed, 19 Jul 2006
Netchannels.
Interested reader can find a
discussion
about netchannels in general
and my implementation in particular, which happend after I presented my code in netdev@.
Unfortunately today I'm unable to answer due to some problems in
russian internet segment, so expect the continue tomorrow.
/devel/networking :: Link / Comments (0)
Tue, 18 Jul 2006
Climbing.
Usual good training today - nothing major: "Mini cooper" trace
over small holds, couple of traces on negative slope and several traverses,
then usual exercises on campus board, and I have
a very good mood as guaranteed result.
/life :: Link / Comments (0)
Netchannels.
I've fixed an issue with small-sized reads, and now netchannels outperform
sockets in any type of bulk transfers.
/devel/networking :: Link / Comments (0)
Mon, 17 Jul 2006
Netchannels.
Listening statate support has been completed.
The latest patch and userspace utility are available in archive.
After running some tests I've found, that the only wrong behaviour
is receiving when userspace reads data using small-sized chunks (i.e. less than MSS),
in that case netchannel behaves noticebly worse than sockets, and noticebly better otherwise.
/devel/networking :: Link / Comments (0)
Userspace network stack.
I've created separate project for userspace network stack, which was used
as a base for netchannel's
alternative TCP/IP stack.
It is released under GPL v2 or any later version.
Project's page can be found here.
/devel/networking :: Link / Comments (0)
Towards full sending and receiving zero-copy support for netchannels.
Main idea is to replace slab allocator with netchannel's own, so that all allocator's pool
could be mapped from userspace. In that case all skbs allocated from that pool
automagically become visible to userspace applications without any single memory
copy.
If that allocator will be created in a right way, it will also solve problem with memory
fragmentation and more generic problem with power of 2 allocators - big alignment overhead.
I read some papers about memory allocation in MMU-less systems (like uClinux kernels),
and found that it is quite common problem, and there are several non-power-of-2 allocators,
which might be used with my approach.
But it is not interesting way, I think, so I started to draw some initial design notes
about how netchannel allocator will look like.
Currently I think it will have several pools of contiguous memory chunks,
at least pool of page sized chunks (which will be used for usual alocations) and
one contiguos region (probably allocated at a boot time) for jumbo frames (and
actually management of that pool can be used for MMU-less allocators).
Each contiguous region will have some control block attached which will store
pointers to free and used sub-blocks inside the region (I plan to use
chunks aligned to predefined constant value, for example 1kb).
When some block is freed inside the contiguous region, it will be concatenated
with neighbour free blocks thus reducing memory fragmentation. Pointers
to the control blocks (and thus pointers to the free and used regions) will
be placed into the tree, so search for the empty block would take predisctible time.
If size of the free area inside contiguous region differs after allocation/freeing,
pointer to it's control block will be moved inside the tree, thus rebalancing it.
/devel/networking :: Link / Comments (0)
Programming the Kernel for Web 2.0.
Scary topic, isn't it?
/devel/other :: Link / Comments (0)
Sat, 15 Jul 2006
Netchannels.
Ok, I think I've solved all problems with sending (congestion control)
and receiving (there were a typo in size calculation) with netchannels.
The only thing I want to implement (actually check) now is TCP_LISTEN state support.
I'm not going to create server-like accept()
mechanism right now, but instead just allow to create netchannel in
listen state which will be changed into established after appropriate
three-way handshake.
/devel/networking :: Link / Comments (0)
Fri, 14 Jul 2006
Climbing.
It was excellent climbing training today - several old traces,
some new one. There were everything - old complex traces, negative slope,
new one, which I tried first time on previous training -
very interesting trace over extremely small holds - now it has a name -
Mini Cooper. Then several starts on campus board and I feel myself
very good. Finally there was a powerfull rain, so it is not that hot.
Not bad working week ends with tasty bottle of "Krusovice",
Machiavellian's "Sovereign" book and excellent mood.
/life :: Link / Comments (0)
Netchannels and Moore's laws.
Problems happen when you do not expect them.
It was found that my congestion avoidance algorithm uses too aggressive
cwnd update, so it frequently results in a too many duplicate ACKs and
even sometimes full traffic suspending, since there is a bug
in congestion avoidance/fast retransmit logic.
And I also found, that my latest updates completely broke receiving support.
I expect to resolve both issues tomorrow and to start
interesting implementation of true
receiving and sending zero-copy support for any given device.
I plan to use that feature with netchannels, although it is possible
to create true zero-copy sniffer without any VM hacks and header split capable hardwar,
using that ideae.
There was some conversation about that idea, but it looks like it is stopped,
so I will try to implement it as is.
/devel/networking :: Link / Comments (0)
Thu, 13 Jul 2006
Netchannels.
After fixing some bugs in duplicate ACK processing I'm pleased
to announce that netchannels
can outperform socket code even on big sized data chunks. For 40k
sending netchannel's speed is more than 88 MB/sec while socket code
can not run faster than 85.5 MB/sec.
/devel/networking :: Link / Comments (0)
Alternative TCP/IP stack.
I've released under GPL v2 or any later version my userspace alternative network
stack. One can find it in archive.
Supported features:
- TCP/UDP sending and receiving.
- Timestamp, window scaling, MSS TCP options.
- PAWS.
- Slow start and congestion control.
- Route table (including startic ARP cache).
- IP and ethernet processing code.
TODO:
- complete retransmit algorithm.
- fast retransmit support.
- data split to MSS sized chunks.
- support for TCP listen state.
Some of the todo items are already implemented in
Linux 2.6 netchannels alternative TCP/IP stack.
/devel/networking :: Link / Comments (0)
Netchannels.
After some other tricks I'm able to run netchannel sending with 87 bytes packets
with 57 MB/sec, while socket code can only achieve 22 MB/sec.
I've also implemented support for hardware checksum offloading for netchannel
packets, but I've turned that off, since for e1000 at least socket code resulted
in worse performance when checksumming is on for both sending and receiving.
Speed benchmark for 44 bytes bulk sending:
- netchannel: 32 Mb/sec
- socket: 10 MB/sec
/devel/networking :: Link / Comments (0)
Wed, 12 Jul 2006
Climbing.
It was easy training after two weeks without holds.
I've completed couple of old traces and found, that there are three
new - two of them were successfully completed on-sight (well,
I had two falls on the second one - very interesting trace on the central
sector over extremely small orange holds).
I finished a lot of excercises on campus-board, so I expect it was an easy training,
since usually I can not do more than 2-3 starts.
/life :: Link / Comments (0)
2006.07 canoe trip.
One can find my brief description of the canoe trip over Pra river from Spas-Klepiki down to Brykin Bor
here.
There were Eugene (WiJo) and Alexandra, Alexander (Mephody) and Ira, and me.
One can find a lot of photos in the gallery.
Enjoy.
/life :: Link / Comments (0)
Tue, 11 Jul 2006
Netchannels.
IBM guys showed theirs initial implementation of netchannels,
which steals data in interrupt/softint and queue theirs special buffers into some queue. Later in
process context either default netchannel (kernel thread) or (not implemented yet)
socket bound netchannel gets that data from the queue, convert to skb and
passed directly into netif_receive_skb() where it is delivered into
destination socket.
This approach is very similar to what I did for some initial netchannel
implementations, although
I used sockets callbacks directly without all netif_receive_skb() path.
That approach showed
that it is possible to have fast protocol processing in process context,
and performance improvement was about 1-2 MB/sec.
Unfortunately IBM folks did not run any tests on theirs code, so it is impossible
to say if they were able to achieve better results or not.
Currenly that approach is used in David Miller's
netchannel tree.
/devel/networking :: Link / Comments (0)
Lemmings strike back!
I'm on 9th level of "Fun" section currently.
/other :: Link / Comments (0)
Mon, 10 Jul 2006
Netchannels.
I've fixed issue with receiving window handling.
Sending now works quite good, but I do not like performance results for small-sized packets:
- netchannel: 16 MB/sec, ~50% CPU usage
- socket: 22 MB/sec, ~90% CPU usage
Current code does not use sending optimisations mentioned earlier,
so it is expected to be slower.
After turning that optimisation on, I'm pleased to announce following numbers for small packets (80 bytes):
- netchannel: 57 MB/sec, ~80% CPU usage
- socket: 22 MB/sec, ~90% CPU usage
Actually it is not 100% correct to call that results correct, since
it uses hacks which can lead to broken behaviour for non-bulk transfers
or transfers over bad lines.
There is another optimisation for netchannels userspace interface - currently each netchannel
command requires to hash input values, lock hash table and run through the list of
netchannels in the given bucket. All those things can be easily eliminated, if I decide to
return file descriptor for each netchannel, so that optimisation should endup in some CPU
usage decrease.
/devel/networking :: Link / Comments (0)
Acrypto.
I've released new set of combined patches, which include:
- acrypto core
- IPsec ESP4 port to acrypto
- dm-crypt port to acrypto
- OCF to acrypto bridge,
which allows to run OCF device drivers with acrypto (for example ixp4xx).
Work by Yakov Lerner
With this release I drop support for old style tarballs and 2.6.15 combined
patchsets. Currently supported kernel versions are 2.6.16 and 2.6.17.
IXP4xx benchmark with OCF to acrypto bridge:
- with 1500 buffers it runs with 150 Mbit/sec
- software-only ecryption on that processor only allows to get ~1.5 Mbit/sec
- IPsec shows about 20 Mbit/sec
/devel/acrypto :: Link / Comments (0)
Sun, 09 Jul 2006
Kevent patchset has been sent to linux-kernel@ and netdev@ mail lists.
Let's see where that will end up...
/devel/kevent :: Link / Comments (0)
Vacation is over.
I am sure any work after such a rest will be felt like a rest.
My skin was burned by sun continuously, so it has from crimson to brown colour.
And I have a berd now.
More detailed review of my canoe trip with photos will be ready in a couple of days.
/life :: Link / Comments (0)
|