Zbr's days.
September
Sun Mon Tue Wed Thu Fri Sat
         
2006
Months
Sep

About TODO Blog RSS Old blog Projects Gallery Notes

Sat, 30 Sep 2006

Kevent is going to be included into -mm tree.


I think I know why it happend :).

Thanks a lot!

/devel/kevent :: Link / Comments (0)


Fri, 29 Sep 2006

Climbing.


Due to flat development I still can not start regualr trainings, so I had a week dealy again. Nevertheless it was not bad training today. I completed a lot of various traverses, found two new traces, which I will try when Grange will start to climb again, one of them is on verticall wall and has not that high index, so I expect to complete it on-sight, another one is quite complex trace on negative slope, I only climbed one such complex trace on negative slope, but this one at least has not easy but middle start (I tried only three first meters where instructors allow to climb without insurance, it was about 6 or 7 holds, since trace goes zig-zag).

/life :: Link / Comments (0)


Userspace network stack.


I've finally returned to it and implemented bunch of stuff (mostly ported from netchannels kernel alternative TCP/IP stack), the main is retransmit queue implementation and some congestion control tweaks. It works quite robust but it's speed (still through packet socket) is not enough, so I will investigate this issue and then move to zero-copy sending and receiving.

/devel/networking :: Link / Comments (0)


I know what should be done for kevent so it would be included into mainline, but will not do it.


It is just to add struct sigmask into main kevent function sys_kevent_get_events(). It was done for epoll(), poll() and select() with introducting additional syscall, but let's see what is the reason for such behaviour? Why is it needed at all?

Existing system becomes more and more complex and frequently (if not always) they require some kind of asynchronous notifications, which is done through signals in modern OSes. But frequently delivered signal can not be processed (for example when signal queue overflows), and even if appropriate handler can be called, it is not a good idea to perform complex tasks in signal handler, so it should somehow notify main process about needs to perform additional operations. Frequently sys_poll and friends are called in the core of the event handling state machine, so it is very logical to put there signal processing too, but with usual asynchronous delivery of them it is required to put some locks into handler and also some locks inside the sys_poll() check loop, and even if it is done, with existing design of event delivering mechanisms it is impossible to 100% correctly determine what happend first - signal was delivered or event became ready.
To solve this problem POSIX designed set of special syscalls (sys_ppoll() as addon to sys_poll() for example), which have struct sigmask as parameter, which, if not NULL, means set of signals we are interested in, and if some of those signals happens while we were in usual sys_poll(), they are returned through that parameter. This allows to deliver signals without races, since when sys_ppoll() returns it is either due to one of specified in struct sigmask signals, or due to usual case of error, timeout of ready events.

Let's see why kevent does not need it at all.
Because kevent can drive any kind of events, not only those which are supported by sys_poll(), which means that correct solution instead of POSIX workaround is to just implement signal events as kevent users (like AIO completion) and add them in a usual way into kevent queue. Event readiness is atomic in kevent, which means that whatever happens first: signal or other event, it will be put into the ready queue first. In that case userspace should just check type of the returned event and if it is a signal just perform appropriate operations. People who want struct sigmask there, actually just do not know what kevent is (looking into comments for e-mails I doubt code/mails were even read), and how it is possible to work with it.

That is why I will not add struct sigmask into syscall parameters, and that is why kevent will stuck somewhere where it is now.

But actually I do not care, I did it not specially for the purpose of kernel inclusion (although it would be good), but just because I like it.

I've just thought about situation - it becomes just fun.
There is a very usefull (I do not exaggerate) feature, there are a lot of people who want it included and want to use, there is a demand for it for several years already with regular talks about how it could be implemented, and now when it is done (it is not only done, but with all features requested in empty talks before in mind), it will not be included, just because yet another feature(s) was not added (note, that just added, i.e. it does not require to replace, break or something, just add another kevent user), and people who want that missing feature(s) do not and is not going to implement it, which will force yet another period of time of empty talks and handwaving...
LOL.

But enough, cry is over, I have some interesting work on userspace network stack for netchannels.

/devel/kevent :: Link / Comments (0)


Thu, 28 Sep 2006

New asynchronous crypto layer (acrypto) release.


I'm pleased to announce asynchronous crypto layer (acrypto) release for 2.6.18 kernel tree. Acrypto allows to handle crypto requests asynchronously in hardware.

With this release of combined patchset for 2.6.18 I drop feature extensions for 2.6.16 and 2.6.17 trees and move them into maintenance state.

Combined patchset (190k) and drivers for various acrypto providers can be found on project's homepage.

/devel/acrypto :: Link / Comments (0)


Added some photos from my window and roof of the house I live in.


You can find them in gallery.

/life :: Link / Comments (0)


Wed, 27 Sep 2006

Grange added another one-wire adapter to OpenBSD.


Here is small dmesg of running OpenBSD system with w1 subsystem drivers:

uow0 at uhub2 port 2
uow0: Dallas Semiconductor USB-FOB/iBUTTON, rev 1.00/0.02, addr 2
onewire0 at uow0
owtemp0 at onewire0 family 0x10 sn 0008005343fb
owid0 at onewire0 family 0x01 sn 0000002078ee
owtemp1 at onewire0 family 0x10 sn 0008005343fb

[grange@fatso grange]$ sysctl hw.sensors | grep ow
hw.sensors.13=owtemp0, Temp, 23.00 degC
hw.sensors.14=owid0, ID, 2128110 raw
hw.sensors.15=owtemp1, Temp, 23.50 degC
It (read: his laziness) took him more than two years to write initial support for w1 subsystem (reset, search and bit-banging commands) after we bought first ds18b20 thermal sensors, and another half of a year to complete ds2490 (usb <-> w1 adapter) driver :)

But nevertheless, my congratulations!

/devel/other :: Link / Comments (0)


Tue, 26 Sep 2006

Day of various flat development things.


I visited again "Leroy Merlin" shop and bought bunch of small instruments, also got some dry mixes (mostly filling), several lists of cardboard(mixed with stucco which creates quite thick lists of about 9mm with 1.2x2.5 meters size), warm floor for bathroom and lavatory pan (!). It will be delivered Sep 28, so probably I will feel myself a bit more comfortable after I set it up.
I also completed bathroom filling with grounding, so it is almost ready for ceramic tile, but I did not buy it. Since electricity is completed already, I started to putty walls - as man who first time tries to do it I selected the most visisble and the biggest one - since water was over this evening I only completed half of it (being dirty as pig). Well, it really looks not bad, I would even say good (if I will not praise myself who will?), if I didn't know that walls are actually not straight, and that it is quite problematic to fix it with putty. But if you do not know where to look, you likely will not detect it. Puttying process took about 3 hours, but most of that time I selected the right technique, tried to fill main wall curviness and the like - there are only three such parts in the flat, and probably couple on the ceiling, but I have some plans to create hinged ceiling in some places... I will create couple of photos of view of my loft, view from windows and roof in a couple of days.

/devel/flat :: Link / Comments (0)


Mon, 25 Sep 2006

Acrypto has been ported to 2.6.18.


Combined patchset includes:

  • acrypto core
  • IPsec ESP4 port to acrypto
  • dm-crypt port to acrypto
  • OCF to acrypto bridge, which allows to run OCF device drivers with acrypto (for example ixp4xx)


Issue with strange ipsec behaviour with vanilla tree and my setup is not resolved yet, and although it does not matter if system works with acrypto or vanilla tree, I postpone official release notes and mail list presentation until it is resolved (if it will be, since it is my test system and users do not complain about it on theirs machines, I think it does not have too high priority and I will not bother developers if things will not be easily resolved). As for now, one can download patch from archive.

/devel/acrypto :: Link / Comments (0)


Fri, 22 Sep 2006

Climbing.


It was interesting and hard training today - I did not climb the whole week, so I was ready for good training, but my first exercises showed, that I loose a form - second traverse over all holds on the first shield only was completed only in one direction, and arms were tired a lot after it. Since Grange decided not to climb, I climbed with local climber Irin. I completed several old traces ("Mini-cooper", trace with dynamic jump, and not so old complex trace over blue holds in central sector which was created recently). She completed severa old traces and then under my guidance tried jumping one, although it was only my "improvement" to remove two holds and make a jump, so she only tried original version. It was very good training after quite big delay, I hope I will return to systematic trainings next week.

/life :: Link / Comments (0)


IPsec was changed again in 2.6.18 (and now it is broken).


So I need to run through my IPsec related acrypto changed again.
I've noticed strange thing with current port - incoming connection can be easily established and run quite smooth, while outgoing is very slow and it looks like there are a lot of spurious retransmits all over the place, which definitely does not allow to easily find single point of failure. It looks like XFRM state is destroyed very frequently, and packets are queued until renegatiation happens, but it can be just a mistake though.

It looks like it is 2.6.18 kernel bug and not acrypto, since with default kernel I get the same strange result. Here is tcpdump output between 2.6.18 kernel (192.168.4.78) and 2.6.17-1.2139_FC5smp kernel (192.168.4.79), I try telnet 192.168.4.79 22 after key daemons exchanged keys and this results in quite long response time:

15:15:47.396925 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x21), length 84
15:15:47.397391 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x18), length 84
15:15:47.397025 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x22), length 84
15:15:47.404166 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 2541002438:2541002458(20) ack 1601271418 win 91 
15:15:48.279375 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91 
15:15:50.031487 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91 
15:15:53.535710 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91 
15:16:00.544154 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91 
15:16:14.561064 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x19), length 100
15:16:14.561218 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x23), length 84
As you see there are unencrypted messages between machines, which I suspect are result of broken behaviour somewhere in XFRM stack. ping works ok though:
15:15:37.919617 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1c), length 116
15:15:37.919858 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x13), length 116
15:15:38.920772 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1d), length 116
15:15:38.920823 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x14), length 116
15:15:39.920823 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1e), length 116
15:15:39.920883 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x15), length 116
15:15:40.920848 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1f), length 116
15:15:40.920893 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x16), length 116
telnet from 2.6.17 to 2.6.18 works ok too:
15:32:57.742011 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x21), length 84
15:32:57.742173 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x33), length 84
15:32:57.742278 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x22), length 84
15:32:57.750256 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x34), length 100
15:32:57.750329 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x23), length 84
15:33:01.201502 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x24), length 84
15:33:01.201640 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x35), length 84
15:33:01.201698 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x36), length 100
It was definitely introduced somewhere in 2.6.18 release cycle, since 2.6.17 works ok both with acrypto and vanilla kernels. As far as I recall I created initial port of 2.6.18 acrypto after some major changes in XFRM stack and it worked too.

It looks like that problem exists even in 2.6.16 vanilla tree, it really looks broken to me.

/devel/acrypto :: Link / Comments (0)


Thu, 21 Sep 2006

Electricity fixup completed.


Late evening in the lights of electric torch I completed electricity panel and even created one appliance receptacle. I will create additional electricity wire tunels soon and complete lights and switches (all temporal since it will be replaced after wall filling to better ones). next task is water system setup, which will contain filters for both hot and cold water, collectors, check valves, boiler and actual pipes over small bathroom for water and sewerage system.

/life :: Link / Comments (0)


Intel folks implemented TCP socket splicing.


Here is initial presentation.
It looks like Linus prefers that way of doing receiving pseudo zero-copy, although there are some other ways too, which allows to create real zero-copy support into VFS cache and userspace (well, to be 100% fair I need to admit, that I know only two my implementations ( old one, and based on network allocator)), there is also implementation by Alexey Kuznetsov, which does one copy only and is very similar to Intel's splice work.
Main problem with receiving side is that received data is almost always unaligned to be used in VFS or userspace. No modern hardware easily allow to specify where to put that data, and only quite a few of NICs allows to create header split, i.e. put headers into skb->data and data into list of fragments. This means that most of the time data must be copied to fill gaps in VFS cache, which completely kills the whole idea.
It is very unlikely that some vendors will add header split into theirs hardware, although it can be done as marketing step by some of them, which are heavily connected with Linux network development team, like Neterion.
Using simple header split and ability to specify data alignment it is possible to completely eliminate additional copies for any kind of received data, even if it has some not rounded to power of two size of chunk, like it was shown in initial zero-copy implementation. If MMIO copy was not that slow, it would be possible with cheap card to outperform modern NICs in server-like workloads.

Let's return to Intel's TCP splice implementation. Since they use splicing they need to put data into pages and provide them as a pipe buffer, so for NICs that do not use fragment list, it will require per packet page allocation, it's mapping, copying of the data and placing it into the pipe buffer.
Splice pipe itself is just a wrapper over wake_up(), i.e. it is only called "pages were put into pipe", actually special structure, allocated in the stack is provided to splice_to_pipe() and it stores pointer to that pages, splice_to_pipe() just performs some checks and wakes remote side up, so it could get provided pages.
One can see here that splicing introduces another work postponing with sleeping/awakening, which in some places can end up with major perfromance degradation.

So, TCP splice has two major problems, which are there by splice design - needed allocation/mapping/copying (compared to copy_*_user() copying only) and additional work postponing. Usual socket code has a lot of optimisations, when receiving process does not sleep, which increases socket code performance a lot (and makes sockets to be a bit closer to netchannels by design), and which are completely removed with splicing work.
Probably all above are reasons for performance drop for receiving splice, showed by Intel folks.

/devel/networking :: Link / Comments (0)


Wed, 20 Sep 2006

New kevent release 'take19' - very minor cleanups.


Short changelog:

  • use __init instead of __devinit
  • removed 'default N' from config for user statistic
  • removed kevent_user_fini() since kevent can not be unloaded
  • use KERN_INFO for statistic output
I've sent it to linux-kernel@ and netdev@ mail lists and asked for inclusion. It looks like number of comments about kevent finally hits zero per release, and 2.6.18 was released, so let's see how it will end up.

/devel/kevent :: Link / Comments (0)


Mon, 18 Sep 2006

Theatre day.


I've visited an excellent comedy "There always is simplicity for every sage" ("Na vsyakogo mudreca dovol'no prostoti") in Russian academic youth theatre by Alexander Ostrovskiy.
Although Bezrukov forgot his words some times, it was definitely excellent performance, since his energy cured and allows to forgive small flubs. Although it is a real classic (it was written in 1868), morals and manners are the same even now.

Many thanks to TanyaZ, Abr, I really envy you that you have such wife :)

             And only then I fair was 
             When wrote all that about you.
	        Egor Glumov (Sergey Bezrukov).

P.S. Words order is specially mangled...

/life :: Link / Comments (0)


Wedding weekend is over.


I've returned from Alexander Silich and Juliana Sviridovskaya wedding, where I was a wedding witness. There were quite a lot of fun, hot and miserable moments for me. Although I should regret about some of them, but I do not. I have discussed some of them in private with conserned parties which wanted it.
Nevertheless it was really interesting and fun wedding. I'm more than sure that youngs will live very happily with love and respect, and I wish them the same - just do not forget how you feel yourself right now, and keep it in mind and heart forever.

/life :: Link / Comments (0)


Fri, 15 Sep 2006

Netchannel's atcp stack has been ported to userspace stack.


I will test it slightly with packet socket and will start to move it to network allocator and full zero-copy next week.
I've also made userspace netchannels to behave similar to sockets when connection is being established - now netchannel in 'CONNECT' state will wait for some time and receive packets instead of immediately return in 'SYN_SENT' state. If after predefined number of packets (although it should be timout) state was not changed to 'ESTABLISHED' connection and netchannel are destroyed and netchannel_create() returns with error.

/devel/networking :: Link / Comments (0)


Wed, 13 Sep 2006

I've started to move netchannels to userspace.


Initial goal is to port alternative kernel tcp stack used in netchannels to userspace network stack, which I currently try to complete. Next step is to implement netchannel receiving through network allocator, which will only require to create similar to zero-copy sniffer mechanism, which will notify userspace about new arrived through netchannel data, sending support is quite simple task - just use the same code which is used in zero-copy sending implementation.

/devel/networking :: Link / Comments (0)


Tue, 12 Sep 2006

New development toys have arrived!


BOSCH 2-24 DFR
BOSCH 2-24 DFR

BOSCH GSR 12v
BOSCH GSR 12v

BOSCH GWS 11-125 CIE V
BOSCH GWS 11-125 CIE V

/devel/flat :: Link / Comments (0)


Kevent. New record!


I've updated kevent to use RB tree instead of hash list and changed socket notification check slightly, and now it is possible to handle 3388 requests per second on my hardware. epoll() on the same hardware allows to have about 2200, sometimes upto 2500 requests per second. It is possible that above limit is due to maximum allowed kevents in a time limit, which is 4096 events.

I've released 'take18' patchset and asked for inclusion. Sort changelog:

  • use RB tree instead of hash table
  • changed readiness check for socket notifications

/devel/kevent :: Link / Comments (0)


Mon, 11 Sep 2006

Climbing day.


It was a training of old traces - I completed several very simple traces in one turn, some old and even very old ones. Aching finger still does not allow to start climbing on negative slope, so I try only verticall walls for now. Hopefully next week I will start interesting traces on negative slope, since there is only one or two left on vertical one, and with aching finger they automagically become uninteresting...

/life :: Link / Comments (0)


Kevent and trees.


Thinking some more about trees in kevents, I've come to conclusion, that, at least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time, although faster search time. So for kevents I plan to use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them.

/devel/kevent :: Link / Comments (0)


New blog tag - my loft development.


As you probably do not know, I bought myself a tiny flat on 17'th floor in nice new district (about a year ago actually), and last week I started to live there (although I do not have official permission, I'm not propertied, to get access I entered into corruption relationship (again), and other unpleasant moments in russian development process). I live like a homeless person, although I have a roof, and actually if I would tell you it's price you would not believe me (and it is just panel apartment even not in Moscow), there are completely no accommodations and even electricity, but I started to live there to fix all that things up.
I will describe repair process of my loft with this tag, I put it into devel entry, since I think it is real development, so it deserves to live with my other interesting projects.
The nearest thing to do is electricity fixup (since I finished Department of Physical and Quantum Electronics in MIPT, which is essentially the same, I think it will not be that hard) and initial repair of walls and ceiling, which I plan to start this week. Since it is my first challenge in this area, I think it will be interesting (for me at least). I ordered a lot of materials and instruments yesterday in "Leroy Merlin" development shop (actually it is not that good shop, since it was quite problematic to select sanitary engineering, so I only got there base draft materials and instruments, which is enough for now).
Sometimes I will post photos and description of the most interesting development moments. Feel free to contact with ideas and suggestions, since it looks like I completely do not have artists imagination :)

/devel/flat :: Link / Comments (0)


Fri, 08 Sep 2006

Climbing.


Second training after quite long delay was much more productive than first one - I completed several traces, new one "Mad Point" over blue holds I failed, but already at the second part, so progress is there. I also tried to complete quite complex trace over rock-cracks holds, but failed even at the first half - one of my fingers achine quite noticeble especially on the traces where third phalanx is used, which dissapoints me a lot.

/life :: Link / Comments (0)


Thu, 07 Sep 2006

What is NAT?


NAT - network address translation, is a mechanism which allows to share several hosts the same "real" IP (i.e. IP address which can be accessed from the internet or other such "real" addresses). NAT requires rebuilding of each packet so it's addresses and even ports would be changed so packet would look like it was originated from machine with "real" IP, thus machine which does NAT should keep information about each set of source/destination addresses and ports for each connection it changes (even if connection does not have ports like ICMP, in that case only IP addresses are changed). Some OSes like OpenBSD do it using similar to bind() approach, i.e. for each new connection NAT server uses unique port, i.e. it changes addresses and ports like being originated from own "real" IP address and selected port, this forces to only have 64k number of simultaneous connections, which is quite high number though, but let's try to see into the future, where 64k connections is just a 2^16 number of sockets, each TCP one eats about one page, i.e. it is just a 256Mb of memory for busy server (not including size of the data which can be placed into that 64k socket queues). One can notice that having more than 64k TCP clients is impossible, since port number is limited to 16bits, but Linux does not reserve a port for each new accepted socket, so it is possible to have as many sockets as your memory allows.

Anyway, any limitations on initial design (or even feature thoughts) phase is a very bad sign, which likely will kick us later, so port reservation is not scalable way to go.

Let's see how NAT is implemented in Linux netfilter.
Base and Holy Grail for NAT and other manipulation elements in Linux kernel is connection tracking - it is a system which holds information about each connection, even if you do not use mangling features (like NAT) but only enabled and loaded connection tracking modules. Complexity of connection tracking system can not be described in a couple of words (and actually I do not know it enough to for example fix some bugs just looking into the trace), but basically connection tracking system is set of callbacks, which allow to parse incoming and outgoing packets according to it's protocol on various levels. Each set of callbacks is organized into special structures which are placed into hash tables where lookups happen each time new packet enters connection tracking system.
Connection tracking is placed into netfilter hooks, where NAT puts itself too.

When packet reaches NAT hook, NAT core can get all relevant info from stored in skb pointer (which was placed there by connection tracking system, which parsed packet before NAT and other netfilter hooks) to connection tracking (where besides other information new addresses and ports are placed), and then change appropriate packet's data, recalculate checksum and push packet forward to the next netfilter hook or protocol processing function.

Everything looks simple, until we start to see how it is designed. NAT itself has several abstraction layers, some of them are accessed through netfilter hooks, some of them as NAT table helpers. Connection tracking itself has additional tons of layers. All entries are placed into lists inside hash tables which can not grow/shrink and which are protected using global locks. Amount of netfilter *tables and *conntrack implementations pushes me into depression. It is known issue that netfilter itself is not that fast, and actually it could be designed in slightly different manner, and that connection tracking slows things down very noticebly.

So looking into all above issues, I've come to some strange idea that fixing connection tracking is just impossible task, and heavily connected to it NAT is not worse to change. Instead one could create new implementation (and reinvent the wheel) of NAT (and even related helpers like FTP and others), which would not require existing connection tracking system at all. It is possible to put it into netfilter framework though, so people would not be so scared about such intrusion.

So how could I create NAT design? First of all: no lists and hash tables. Practice shows us every day, that good hash table can not be created in advance without major knowledge of system behaviour, and eventually it will be either too small, in which case we need to grow it, to copy data between new and old tables, or to increase number of elements in the list in each hash entry; or it will be too big, so a lot of space will be just wasted for nothing.
That problem is easily solvable by using trees and it's nodes as elements, which holds all information about packet manipulation.

Each tree can have some flags to show what part of dataflow it wants to see (for example not every netfilter rule (and actually minor number of them at all) wants to see every packet, but only initial and final set of them (NAT obviously requires all packets), so the same entry can simultaneously live in several trees (for example NAT entry should live in "connect", "established", "final" and so on trees (or how they wil be called)). But it intriduces a problem of correct selection of type of dataflow per protocol, for example in TCP it "connect" phase can be described by SYN bit set, but what about something like SCTP? Or UDP, which does not have any phases at all. It can be solved by per-protocol hook, which will return index of the tree where to search for connection entry for, in that case UDP will always return the same tree no matter how it is called.
Each connection entry must have processing function itself, some private area to store helping information, and that's all for now.

When administrator is going to setup NAT rules, he/she will frequently use wildcards, i.e. "transfer all packets from interface eth0 (or from given subnet) to look like they have source/destination/something_else which belongs to provided interface (old MASQUERADE iptables target) or is equal to provided number (like it is done for SNAT/DNAT)", in that case each packet must be checked against all such rules to find matching one, where our tree of possible manipulations will be attached. If there is matching manipulation rule, but there is no connection entry in appropriate tree, it should be created.

Tree implementation must be selected with performance in mind - the most obvious cases are RB tree, AVL tree and splay tree. The latter actually is not that good, since it requires rebalancing after each access, which is not the best aproach for server, which does not know in advance which dataflows will be accessed more frequently. Selection between AVL and RB trees, on my opinion, should me made in favour of AVL tree implementation, since searching in that tree (i.e. per-packet overhead) never exceeds 1.44*O(log(N)), where N is number of elements, while with RB tree it becomes 2*O(log(N)). But deletion in AVL tree can take upto O(log(N)) steps, while in RB trees it never exceeds 3 operations, so it is possible to create some kind of fake deletion for AVL tree, when node is just marked as deleted, but deletion itself can be either postponed to better context, or not performed at all (if system has a lot of connections opened, that empty slot will be reused eventually, and if it does not have a lot of connections, price of postponed deletion is not that high).

Appropriate tree implementation can be used by kevents to store kevents there instead of (I must admit it now) badly prepared hash table.

/devel/networking :: Link / Comments (0)


Kevent 'take17' patchset has been released.


It contains trivial cleanups mentioned before. Short changelog:

  • misc cleanups (__read_mostly, const ...)
  • created special macro which is used for mmap size (number of pages) calculation
  • export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example)

/devel/kevent :: Link / Comments (0)


Wed, 06 Sep 2006

Finally climbing day.


I did not visit climbing zone several weeks already, so today's training was really simple. I completed two traces - old interesting "Do not want" over black holds and new one "Mad Point", which I failed several times, so it was not really completed. This new trace has second place in my own range of complex traces I climbed, and since it is on the vertical wall I will definitely complete it after several starts.

/life :: Link / Comments (0)


Network allocator is now configurable.


I.e. one can select in kernel config to compile it or not, the same is applied to zero-copy sniffer, which allows to remove sniffer's overhead and fix some other mentioned temporal limitations.

/devel/networking/nta :: Link / Comments (0)


Kevent 'take16' patchset is released and subsystem is frozen for some time.


Since number of comments has come mostly to zero, I freeze for some time kevent development (since resending practically the same patches into /dev/null is not that interesting task) and switch to imeplementation of special tree, which probably will be used with kevents instead of hash table and for other projects.

Short changelog for just released 'take16' patchset:

  • converted kevent_timer to high-resolution timers, this forces timer API documentation update
  • use struct ukevent * instead of void * in syscalls (documentation has been updated)
  • added warning in kevent_add_ukevent() if ring has broken index (for testing)

/devel/kevent :: Link / Comments (0)


Mon, 04 Sep 2006

Kevent 'take15' has been released.


Short changelog:

  • added kevent_wait(). This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage.
  • added socket notifications (send/recv/accept). Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about benchmark and server itself (evserver_kevent.c) can be found on project homepage.
Splitted patches are available in archive.
I've also updated documentation here.

/devel/kevent :: Link / Comments (0)


Sat, 02 Sep 2006

Zero-copy sniffer.


I was wrong about magical ab symbols found in sniffer dump - it is usual sending data, but since sending side reserves MAX_TCP_HEADER bytes, so in my setup sending ethernet header starts with offset of 190 bytes and receiving with offset of 16 bytes from the begining of the allocated buffer.

Sending sequence number graph for tcpdump and zero-copy sniffers.
Sending sequence number graph for tcpdump and zero-copy sniffers

As you can see there are no gaps in graphs (although it is just scp transfer using 100Mb NIC (3c59x), CPU usage for zero-copy sniffer when data is being written into /dev/null is about two times less then when it is writtend using tcpdump.

I've released new version of network allocator and zero-copy sniffer and sent it to netdev@ with a question about possibility of inclusion into mainline.

Zero-copy sniffer has following overheads:

  • several atomic operations (in the worst case one atomic_set(), one atomic_inc() and one or two atomic_dec_and_test())
  • one lock (bad global lock per sniffer device), which is held when information about new packet is being put into sniffer's queue when skb is freed
  • delayed freeing which can lead to increased memory usage, or (like implemented) if introduced maximum amount of "locked" data by sniffer, some packets can be dropped by sniffer.
Limitations of current version (introduced not due to design problems, but intentionally to test various special usage cases):
  • use NTA only for netdev_alloc_skb() and sk_stream_alloc_pskb(), i.e. only for allocations of traffic received by NIC and sent through send() syscall over stream socket.
  • always compile zero-copy sniffer in, which increases memory usage and adds described above overhead.
  • skb_copy() always allocate data from SLAB allocator, although it could check if original skb's data was allocated through NTA, but I think that skb_copy() is completely incompatible with high performance.
  • it is possible to eliminate several atomic operations (I'm lazy).
  • debug code (poisoning of the tail of the buffer and additional reference counter) is always compiled in.

/devel/networking/zcs :: Link / Comments (0)


Fri, 01 Sep 2006

Zero-copy sniffer.


I've fixed mapping bug and forced network stack to use network allocator only for packets which are created either by network device (receiving) or through send() syscall over sream socket, so current version does not catch netlink messages, unix sockets and so on. Here is typical zero-copy sniffer log:

dump  447.1024: ptr: 0xc19b0f80, start: 0xc19b0000, size: 1956, off: 200576: entry: 0, cpu: 0: 
	ab:ab:ab:ab:ab:ab -> ab:ab:ab:ab:ab:ab, type: abab, 
dump  448.1024: ptr: 0xc19fa880, start: 0xc19f8000, size: 1828, off: 501888: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6, 
dump  449.1024: ptr: 0xc1a01080, start: 0xc1a00000, size: 1828, off: 528512: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6, 
dump  450.1024: ptr: 0xc19f4800, start: 0xc19f4000, size: 1828, off: 477184: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6, 
dump  451.1024: ptr: 0xc1a01f80, start: 0xc1a00000, size: 1828, off: 532352: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6,
dump  318.1024: ptr: 0xc1b80780, start: 0xc1b80000, size: 1828, off: 1920: entry: 0, cpu: 1: 
	02:30:9b:0c:89:e8 -> ff:ff:ff:ff:ff:ff, type: 0800, 192.168.4.9:43281 -> 255.255.255.255:43281, proto: 17, 
dump  330.1024: ptr: 0xc1b86580, start: 0xc1b84000, size: 1828, off: 25984: entry: 0, cpu: 1: 
	02:00:63:1f:2d:81 -> 01:00:5e:00:01:14, type: 0800, 192.168.5.231:43281 -> 224.0.1.20:43281, proto: 17, 
dump  331.1024: ptr: 0xc1b86d00, start: 0xc1b84000, size: 1828, off: 27904: entry: 0, cpu: 
	1: 02:3a:d1:7e:6e:65 -> 01:00:5e:00:01:14, type: 0800, 192.168.5.232:43281 -> 224.0.1.20:43281, proto: 17,
Look into strange line with ab symbols instead of the ethernet fields - this is an skb, which was freed in tcp_clean_rtx_queue() when ACK was received. Network allocator fills allocated area with ab bytes for debug purpose, and it looks like TCP state machine preallocates some packets and then frees them without actual usage. Number of such empty allocation is not so samll actually.
I plan to run an interesting benchmark tomorrow - test machine will generate traffic using different packet sizes and sniffer will log TCP sequence numbers on that sending machine, then I will plot a graph of sent and missed packets for zero-copy sniffer and tcpdump.

/devel/networking/zcs :: Link / Comments (0)


I've added photos from scandinavian trip to gallery.


Have a nice time!

/life :: Link / Comments (0)