|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Sat, 30 Sep 2006
Kevent is going to be included into -mm tree.
I think I know why it happend :).
Thanks a lot!
/devel/kevent :: Link / Comments (0)
Fri, 29 Sep 2006
Climbing.
Due to flat development I still can not start regualr trainings,
so I had a week dealy again. Nevertheless it was not bad training today.
I completed a lot of various traverses, found two new traces,
which I will try when Grange
will start to climb again, one of them is on verticall wall
and has not that high index, so I expect to complete it on-sight,
another one is quite complex trace on negative slope, I only climbed
one such complex trace on negative slope, but this one at least
has not easy but middle start
(I tried only three first meters where instructors allow to climb
without insurance, it was about 6 or 7 holds, since trace goes zig-zag).
/life :: Link / Comments (0)
Userspace network stack.
I've finally returned to it and implemented bunch of stuff (mostly
ported from
netchannels kernel alternative TCP/IP stack),
the main is retransmit queue implementation and some congestion
control tweaks. It works quite robust but it's speed (still through packet socket)
is not enough, so I will investigate this issue and then move to
zero-copy sending and receiving.
/devel/networking :: Link / Comments (0)
I know what should be done for kevent so it would be included into mainline, but will not do it.
It is just to add struct sigmask into main
kevent
function sys_kevent_get_events(). It was done for epoll(),
poll() and select() with introducting additional syscall,
but let's see what is the reason for such behaviour? Why is it needed at all?
Existing system becomes more and more complex and frequently (if not always) they
require some kind of asynchronous notifications, which is done through signals in
modern OSes. But frequently delivered signal can not be processed (for example when
signal queue overflows), and even if appropriate handler can be called, it is not
a good idea to perform complex tasks in signal handler, so it should somehow notify
main process about needs to perform additional operations. Frequently sys_poll
and friends are called in the core of the event handling state machine, so it is very
logical to put there signal processing too, but with usual asynchronous delivery
of them it is required to put some locks into handler and also some locks inside
the sys_poll() check loop, and even if it is done, with existing
design of event delivering mechanisms it is impossible to 100% correctly determine
what happend first - signal was delivered or event became ready.
To solve this problem POSIX designed set of special syscalls
(sys_ppoll() as addon to sys_poll() for example), which have struct sigmask
as parameter, which, if not NULL, means set of signals we are interested in,
and if some of those signals happens while we were in usual sys_poll(),
they are returned through that parameter. This allows to deliver signals
without races, since when sys_ppoll() returns it is either
due to one of specified in struct sigmask signals, or due to usual
case of error, timeout of ready events.
Let's see why kevent does not need it at all.
Because kevent can drive any kind of events, not only those which
are supported by sys_poll(), which means that correct
solution instead of POSIX workaround is to just implement signal
events as kevent users (like AIO completion) and add them in a usual
way into kevent queue. Event readiness is atomic in kevent, which
means that whatever happens first: signal or other event, it will be put
into the ready queue first. In that case userspace should just check
type of the returned event and if it is a signal just perform appropriate
operations. People who want struct sigmask there, actually just
do not know what kevent is (looking into comments for e-mails I doubt code/mails were even read),
and how it is possible to work with it.
That is why I will not add struct sigmask into syscall parameters,
and that is why kevent will stuck somewhere where it is now.
But actually I do not care, I did it not specially for the purpose of kernel inclusion
(although it would be good), but just because I like it.
I've just thought about situation - it becomes just fun.
There is a very usefull (I do not exaggerate) feature, there are a lot of people
who want it included and want to use, there is a demand for it for several years already
with regular talks about how it could be implemented, and
now when it is done (it is not only done, but with all features requested
in empty talks before in mind), it will not be included, just because
yet another feature(s) was not added (note, that just added,
i.e. it does not require to replace, break or something,
just add another kevent user), and people who want that missing feature(s)
do not and is not going to implement it, which will force
yet another period of time of empty talks and handwaving...
LOL.
But enough, cry is over, I have some interesting work on
userspace network stack for
netchannels.
/devel/kevent :: Link / Comments (0)
Thu, 28 Sep 2006
New asynchronous crypto layer (acrypto) release.
I'm pleased to announce asynchronous crypto layer (acrypto)
release for 2.6.18 kernel tree. Acrypto allows to handle crypto requests
asynchronously in hardware.
With this release of combined patchset for 2.6.18 I drop feature
extensions for 2.6.16 and 2.6.17 trees and move them into maintenance
state.
Combined patchset (190k) and drivers for various acrypto providers can
be found on project's homepage.
/devel/acrypto :: Link / Comments (0)
Added some photos from my window and roof of the house I live in.
You can find them in gallery.
/life :: Link / Comments (0)
Wed, 27 Sep 2006
Grange added another one-wire adapter to OpenBSD.
Here is small dmesg of running OpenBSD system with w1 subsystem drivers:
uow0 at uhub2 port 2
uow0: Dallas Semiconductor USB-FOB/iBUTTON, rev 1.00/0.02, addr 2
onewire0 at uow0
owtemp0 at onewire0 family 0x10 sn 0008005343fb
owid0 at onewire0 family 0x01 sn 0000002078ee
owtemp1 at onewire0 family 0x10 sn 0008005343fb
[grange@fatso grange]$ sysctl hw.sensors | grep ow
hw.sensors.13=owtemp0, Temp, 23.00 degC
hw.sensors.14=owid0, ID, 2128110 raw
hw.sensors.15=owtemp1, Temp, 23.50 degC
It (read: his laziness) took him more than two years to write initial support for w1 subsystem (reset, search
and bit-banging commands) after we bought first ds18b20 thermal
sensors, and another half of a year to complete ds2490 (usb <-> w1 adapter) driver :)
But nevertheless, my congratulations!
/devel/other :: Link / Comments (0)
Tue, 26 Sep 2006
Day of various flat development things.
I visited again "Leroy Merlin" shop and bought bunch of
small instruments, also got some dry mixes (mostly filling),
several lists of cardboard(mixed with stucco which creates
quite thick lists of about 9mm with 1.2x2.5 meters size), warm floor for bathroom
and lavatory pan (!). It will be delivered Sep 28, so
probably I will feel myself a bit more comfortable after I set it up.
I also completed bathroom filling with grounding, so it is almost
ready for ceramic tile, but I did not buy it. Since electricity
is completed already, I started to putty walls -
as man who first time tries to do it I selected the most visisble
and the biggest one - since water was over this evening I only completed
half of it (being dirty as pig).
Well, it really looks not bad, I would even say good
(if I will not praise myself who will?),
if I didn't know that walls are actually not straight, and that it is
quite problematic to fix it with putty. But if you do not know where
to look, you likely will not detect it. Puttying process took about 3 hours,
but most of that time I selected the right technique, tried to fill main
wall curviness and the like - there are only three such parts in the flat,
and probably couple on the ceiling, but I have some plans to create hinged
ceiling in some places... I will create couple of photos of view of my loft,
view from windows and roof in a couple of days.
/devel/flat :: Link / Comments (0)
Mon, 25 Sep 2006
Acrypto has been ported to 2.6.18.
Combined patchset includes:
- acrypto core
- IPsec ESP4 port to acrypto
- dm-crypt port to acrypto
- OCF to acrypto bridge, which allows to run OCF device drivers with acrypto (for example ixp4xx)
Issue with strange ipsec behaviour with vanilla tree and my setup is not resolved
yet, and although it does not matter if system works with acrypto or vanilla tree,
I postpone official release notes and mail list presentation until it is resolved
(if it will be, since it is my test system and users do not complain about it on theirs machines,
I think it does not have too high priority and I will not bother developers
if things will not be easily resolved). As for now, one can download patch
from archive.
/devel/acrypto :: Link / Comments (0)
Fri, 22 Sep 2006
Climbing.
It was interesting and hard training today - I did not climb the whole week,
so I was ready for good training, but my first exercises showed,
that I loose a form - second traverse over all holds
on the first shield only was completed only in one direction, and
arms were tired a lot after it. Since Grange
decided not to climb, I climbed with local climber Irin. I completed
several old traces ("Mini-cooper", trace with dynamic jump, and
not so old complex trace over blue holds in central sector which
was created recently). She completed severa old traces and
then under my guidance tried jumping one, although it was only my "improvement"
to remove two holds and make a jump, so she only tried original version.
It was very good training after quite big delay, I hope I will return
to systematic trainings next week.
/life :: Link / Comments (0)
IPsec was changed again in 2.6.18 (and now it is broken).
So I need to run through my IPsec related
acrypto changed
again.
I've noticed strange thing with current port - incoming connection can be easily established
and run quite smooth, while outgoing is very slow and it looks like
there are a lot of spurious retransmits all over the place, which
definitely does not allow to easily find single point of failure.
It looks like XFRM state is destroyed very frequently, and packets are queued
until renegatiation happens, but it can be just a mistake though.
It looks like it is 2.6.18 kernel bug and not acrypto, since with default kernel
I get the same strange result. Here is tcpdump output between 2.6.18 kernel (192.168.4.78)
and 2.6.17-1.2139_FC5smp kernel (192.168.4.79), I try telnet 192.168.4.79 22
after key daemons exchanged keys and this results in quite long response time:
15:15:47.396925 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x21), length 84
15:15:47.397391 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x18), length 84
15:15:47.397025 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x22), length 84
15:15:47.404166 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 2541002438:2541002458(20) ack 1601271418 win 91
15:15:48.279375 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91
15:15:50.031487 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91
15:15:53.535710 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91
15:16:00.544154 IP 192.168.4.79.ssh > 192.168.4.78.47256: P 0:20(20) ack 1 win 91
15:16:14.561064 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x19), length 100
15:16:14.561218 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x23), length 84
As you see there are unencrypted messages between machines, which I suspect are result
of broken behaviour somewhere in XFRM stack. ping works ok though:
15:15:37.919617 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1c), length 116
15:15:37.919858 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x13), length 116
15:15:38.920772 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1d), length 116
15:15:38.920823 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x14), length 116
15:15:39.920823 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1e), length 116
15:15:39.920883 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x15), length 116
15:15:40.920848 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x1f), length 116
15:15:40.920893 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x16), length 116
telnet from 2.6.17 to 2.6.18 works ok too:
15:32:57.742011 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x21), length 84
15:32:57.742173 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x33), length 84
15:32:57.742278 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x22), length 84
15:32:57.750256 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x34), length 100
15:32:57.750329 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x23), length 84
15:33:01.201502 IP 192.168.4.79 > 192.168.4.78: ESP(spi=0x0961a360,seq=0x24), length 84
15:33:01.201640 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x35), length 84
15:33:01.201698 IP 192.168.4.78 > 192.168.4.79: ESP(spi=0x027181f9,seq=0x36), length 100
It was definitely introduced somewhere in 2.6.18 release cycle, since 2.6.17 works ok both with
acrypto and vanilla kernels. As far as I recall I created initial port of 2.6.18 acrypto
after some major changes in XFRM stack and it worked too.
It looks like that problem exists even in 2.6.16 vanilla tree, it really looks broken to me.
/devel/acrypto :: Link / Comments (0)
Thu, 21 Sep 2006
Electricity fixup completed.
Late evening in the lights of electric torch I completed
electricity panel and even created one appliance receptacle.
I will create additional electricity wire tunels soon and
complete lights and switches (all temporal since it will be replaced
after wall filling to better ones). next task is water system setup,
which will contain filters for both hot and cold water, collectors,
check valves, boiler and actual pipes over small bathroom
for water and sewerage system.
/life :: Link / Comments (0)
Intel folks implemented TCP socket splicing.
Here
is initial presentation.
It looks like Linus prefers that way of doing receiving pseudo zero-copy,
although there are some other ways too, which allows to create real zero-copy
support into VFS cache and userspace (well, to be 100% fair I need to admit,
that I know only two my implementations (
old one,
and based on network allocator)),
there is also implementation by Alexey Kuznetsov, which does one copy only
and is very similar to Intel's splice work.
Main problem with receiving side is that received data is almost always unaligned
to be used in VFS or userspace. No modern hardware easily allow to specify
where to put that data, and only quite a few of NICs allows to create header split,
i.e. put headers into skb->data and data into list of fragments.
This means that most of the time data must be copied to fill gaps in VFS cache,
which completely kills the whole idea.
It is very unlikely that some vendors will add header split into theirs hardware,
although it can be done as marketing step by some of them, which are
heavily connected with Linux network development team, like Neterion.
Using simple header split and ability to specify data alignment
it is possible to completely eliminate additional copies for any kind
of received data, even if it has some not rounded to power of two size of chunk,
like it was shown in
initial zero-copy implementation.
If MMIO copy was not that slow, it would be possible with cheap card
to outperform modern NICs in server-like workloads.
Let's return to Intel's TCP splice implementation.
Since they use splicing they need to put data into pages and provide them as a pipe
buffer, so for NICs that do not use fragment list, it will require per packet page allocation,
it's mapping, copying of the data and placing it into the pipe buffer.
Splice pipe itself is just a wrapper over wake_up(), i.e.
it is only called "pages were put into pipe", actually special structure, allocated
in the stack is provided to splice_to_pipe() and
it stores pointer to that pages, splice_to_pipe() just performs
some checks and wakes remote side up, so it could get provided pages.
One can see here that splicing introduces another work postponing with sleeping/awakening,
which in some places can end up with major perfromance degradation.
So, TCP splice has two major problems, which are there by splice design -
needed allocation/mapping/copying (compared to copy_*_user() copying only)
and additional work postponing. Usual socket code has a lot of optimisations,
when receiving process does not sleep, which increases socket code performance a lot
(and makes sockets to be a bit closer to netchannels by design),
and which are completely removed with splicing work.
Probably all above are reasons for performance drop for receiving splice, showed
by Intel folks.
/devel/networking :: Link / Comments (0)
Wed, 20 Sep 2006
New kevent release 'take19' - very minor cleanups.
Short changelog:
- use
__init instead of __devinit
- removed 'default N' from config for user statistic
- removed
kevent_user_fini() since kevent can not be unloaded
- use
KERN_INFO for statistic output
I've sent it to linux-kernel@ and netdev@ mail lists and asked for inclusion.
It looks like number of comments about kevent finally hits zero per release,
and 2.6.18 was released, so let's see how it will end up.
/devel/kevent :: Link / Comments (0)
Mon, 18 Sep 2006
Theatre day.
I've visited an excellent comedy
"There always is simplicity for every sage" ("Na vsyakogo mudreca dovol'no prostoti")
in Russian academic youth theatre by Alexander Ostrovskiy.
Although Bezrukov forgot his words some times, it was definitely excellent performance, since his energy
cured and allows to forgive small flubs. Although it is a real classic (it was written in 1868),
morals and manners are the same even now.
Many thanks to TanyaZ, Abr, I really envy you that you have such wife :)
And only then I fair was
When wrote all that about you.
Egor Glumov (Sergey Bezrukov).
P.S. Words order is specially mangled...
/life :: Link / Comments (0)
Wedding weekend is over.
I've returned from Alexander Silich and Juliana Sviridovskaya wedding,
where I was a wedding witness. There were quite a lot of fun, hot and miserable moments
for me. Although I should regret about some of them, but I do not.
I have discussed some of them in private with conserned parties which
wanted it.
Nevertheless it was really interesting and fun wedding. I'm more than sure
that youngs will live very happily with love and respect, and I wish
them the same - just do not forget how you feel yourself right now, and keep
it in mind and heart forever.
/life :: Link / Comments (0)
Fri, 15 Sep 2006
Netchannel's atcp stack has been ported to userspace stack.
I will test it slightly with packet socket and will start to move it to
network allocator and full
zero-copy next week.
I've also made userspace netchannels to behave similar to sockets when connection
is being established - now netchannel in 'CONNECT' state will wait for some time
and receive packets instead of immediately return in 'SYN_SENT' state.
If after predefined number of packets (although it should be timout)
state was not changed to 'ESTABLISHED' connection and netchannel are destroyed
and netchannel_create() returns with error.
/devel/networking :: Link / Comments (0)
Wed, 13 Sep 2006
I've started to move netchannels to userspace.
Initial goal is to port alternative kernel tcp stack used in netchannels
to userspace network stack,
which I currently try to complete. Next step is to implement netchannel receiving through
network allocator,
which will only require to create similar to
zero-copy sniffer
mechanism, which will notify userspace about new arrived through netchannel data, sending
support is quite simple task - just use the same code which is used in zero-copy
sending implementation.
/devel/networking :: Link / Comments (0)
Tue, 12 Sep 2006
New development toys have arrived!

BOSCH 2-24 DFR

BOSCH GSR 12v

BOSCH GWS 11-125 CIE V
/devel/flat :: Link / Comments (0)
Kevent. New record!
I've updated kevent to use RB tree instead of hash list and changed
socket notification check slightly, and now it is possible to handle
3388 requests per second on my hardware.
epoll() on the same hardware allows to have about 2200, sometimes upto 2500 requests
per second. It is possible that above limit is due to maximum allowed kevents
in a time limit, which is 4096 events.
I've released 'take18' patchset and asked for inclusion. Sort changelog:
- use RB tree instead of hash table
- changed readiness check for socket notifications
/devel/kevent :: Link / Comments (0)
Mon, 11 Sep 2006
Climbing day.
It was a training of old traces - I completed several very simple traces in one turn,
some old and even very old ones. Aching finger still does not allow to
start climbing on negative slope, so I try only verticall walls for now.
Hopefully next week I will start interesting traces on negative slope, since there is
only one or two left on vertical one, and with aching finger they automagically become uninteresting...
/life :: Link / Comments (0)
Kevent and trees.
Thinking some more about trees in kevents,
I've come to conclusion, that, at least for a web sever, frequency of addition/deletion of new kevent
is comparable with number of search access, i.e. most of the time events
are added, accesed only couple of times and then removed, so it justifies
RB tree usage over AVL tree, since the latter does have much slower deletion time,
although faster search time. So for kevents I plan to use RB tree for now
and later, when my AVL tree implementation is ready, it will be possible
to compare them.
/devel/kevent :: Link / Comments (0)
New blog tag - my loft development.
As you probably do not know, I bought myself a tiny flat on 17'th floor
in nice new district (about a year ago actually),
and last week I started to live there (although I do not have official
permission, I'm not propertied, to get access I entered into corruption
relationship (again), and other unpleasant moments in russian development process).
I live like a homeless person, although I have a roof, and actually
if I would tell you it's price you would not believe me (and it is just
panel apartment even not in Moscow), there are completely no accommodations
and even electricity, but I started to live there to fix all that things up.
I will describe repair process of my loft with
this tag,
I put it into devel
entry, since I think it is real development, so it deserves to live with my other
interesting projects.
The nearest thing to do is electricity fixup (since I finished
Department of Physical and Quantum Electronics in
MIPT, which is essentially the same, I think it will
not be that hard) and initial repair of walls and ceiling,
which I plan to start this week. Since it is my first challenge in this area,
I think it will be interesting (for me at least). I ordered a lot of materials
and instruments yesterday in "Leroy Merlin"
development shop (actually it is not that good shop, since it was quite problematic
to select sanitary engineering, so I only got there base draft materials and instruments,
which is enough for now).
Sometimes I will post photos and description of the most interesting development moments.
Feel free to contact with ideas and suggestions, since it looks like I completely do not have artists imagination :)
/devel/flat :: Link / Comments (0)
Fri, 08 Sep 2006
Climbing.
Second training after quite long delay was much more productive than first one -
I completed several traces, new one "Mad Point" over blue holds I failed,
but already at the second part, so progress is there. I also tried to complete
quite complex trace over rock-cracks holds, but failed even at the first half -
one of my fingers achine quite noticeble especially on the traces where third
phalanx is used, which dissapoints me a lot.
/life :: Link / Comments (0)
Thu, 07 Sep 2006
What is NAT?
NAT - network address translation, is a mechanism which allows to share several
hosts the same "real" IP (i.e. IP address which can be accessed from the internet
or other such "real" addresses). NAT requires rebuilding of each
packet so it's addresses and even ports would be changed so packet would look
like it was originated from machine with "real" IP, thus machine which does NAT
should keep information about each set of source/destination addresses and ports
for each connection it changes (even if connection does not have ports like ICMP,
in that case only IP addresses are changed). Some OSes like OpenBSD do it
using similar to bind() approach, i.e. for each new connection
NAT server uses unique port, i.e. it changes addresses and ports like being originated
from own "real" IP address and selected port, this forces to only have 64k number
of simultaneous connections, which is quite high number though, but let's try
to see into the future, where 64k connections is just a 2^16 number of sockets,
each TCP one eats about one page, i.e. it is just a 256Mb of memory for busy server
(not including size of the data which can be placed into that 64k socket queues).
One can notice that having more than 64k TCP clients is impossible, since port number
is limited to 16bits, but Linux does not reserve a port for each new accepted socket,
so it is possible to have as many sockets as your memory allows.
Anyway, any limitations on initial design (or even feature thoughts) phase is a very bad sign,
which likely will kick us later, so port reservation is not scalable way to go.
Let's see how NAT is implemented in Linux netfilter.
Base and Holy Grail for NAT and other manipulation elements in Linux kernel is connection
tracking - it is a system which holds information about each connection, even if you do not
use mangling features (like NAT) but only enabled and loaded connection tracking modules.
Complexity of connection tracking system can not be described in a couple of words
(and actually I do not know it enough to for example fix some bugs just looking
into the trace), but basically connection tracking system is set of callbacks,
which allow to parse incoming and outgoing packets according to it's protocol on various
levels. Each set of callbacks is organized into special structures which are placed into
hash tables where lookups happen each time new packet enters connection tracking system.
Connection tracking is placed into netfilter hooks, where NAT puts itself too.
When packet reaches NAT hook, NAT core can get all relevant info from stored in skb pointer
(which was placed there by connection tracking system, which parsed packet before NAT and other
netfilter hooks)
to connection tracking (where besides other information new addresses and ports are placed),
and then change appropriate packet's data, recalculate checksum and push packet forward
to the next netfilter hook or protocol processing function.
Everything looks simple, until we start to see how it is designed. NAT itself
has several abstraction layers, some of them are accessed through netfilter hooks, some of them
as NAT table helpers. Connection tracking itself has additional tons of layers.
All entries are placed into lists inside hash tables which can not grow/shrink and which
are protected using global locks. Amount of netfilter *tables and *conntrack implementations
pushes me into depression. It is known issue that netfilter itself is not that fast,
and actually it could be designed in slightly different manner, and that connection tracking
slows things down very noticebly.
So looking into all above issues, I've come to some strange idea that fixing connection
tracking is just impossible task, and heavily connected to it NAT is not worse to change.
Instead one could create new implementation (and reinvent the wheel) of NAT (and even related
helpers like FTP and others), which would not require existing connection tracking system at all.
It is possible to put it into netfilter framework though, so people would not be so scared
about such intrusion.
So how could I create NAT design? First of all: no lists and hash tables. Practice shows us
every day, that good hash table can not be created in advance without major knowledge of
system behaviour, and eventually it will be either too small, in which case we need
to grow it, to copy data between new and old tables, or to increase number of elements
in the list in each hash entry; or it will be too big, so a lot of space will be just wasted
for nothing.
That problem is easily solvable by using trees and it's nodes as elements, which
holds all information about packet manipulation.
Each tree can have some flags to show what part of dataflow it wants to see
(for example not every netfilter rule (and actually minor number of them at all)
wants to see every packet, but only initial and final set of them (NAT obviously requires all packets),
so the same entry can simultaneously live in several trees (for example NAT entry
should live in "connect", "established", "final" and so on trees (or how they wil be called)).
But it intriduces a problem of correct selection of type of dataflow per protocol,
for example in TCP it "connect" phase can be described by SYN bit set, but what about something like
SCTP? Or UDP, which does not have any phases at all. It can be solved by per-protocol
hook, which will return index of the tree where to search for connection entry for,
in that case UDP will always return the same tree no matter how it is called.
Each connection entry must have processing function itself, some private area to store helping information,
and that's all for now.
When administrator is going to setup NAT rules, he/she will frequently use wildcards, i.e.
"transfer all packets from interface eth0 (or from given subnet) to look like they have
source/destination/something_else which belongs to provided interface (old MASQUERADE iptables target)
or is equal to provided number (like it is done for SNAT/DNAT)", in that case each packet
must be checked against all such rules to find matching one, where our tree of possible manipulations
will be attached. If there is matching manipulation rule, but there is no connection entry in appropriate tree,
it should be created.
Tree implementation must be selected with performance in mind - the most obvious cases are
RB tree, AVL tree and splay tree. The latter actually is not that good, since it requires
rebalancing after each access, which is not the best aproach for server, which does not know
in advance which dataflows will be accessed more frequently. Selection between AVL and RB trees,
on my opinion, should me made in favour of AVL tree implementation, since searching in that
tree (i.e. per-packet overhead) never exceeds 1.44*O(log(N)), where N is number of elements, while with
RB tree it becomes 2*O(log(N)). But deletion in AVL tree can take upto O(log(N))
steps, while in RB trees it never exceeds 3 operations, so it is possible to create some kind of fake
deletion for AVL tree, when node is just marked as deleted, but deletion itself can be either postponed
to better context, or not performed at all (if system has a lot of connections opened,
that empty slot will be reused eventually, and if it does not have a lot of connections,
price of postponed deletion is not that high).
Appropriate tree implementation can be used by kevents
to store kevents there instead of (I must admit it now) badly prepared hash table.
/devel/networking :: Link / Comments (0)
Kevent 'take17' patchset has been released.
It contains trivial cleanups mentioned before. Short changelog:
- misc cleanups (
__read_mostly, const ...)
- created special macro which is used for mmap size (number of pages) calculation
- export
kevent_socket_notify(), since it is used in network protocols which can be
built as modules (IPv6 for example)
/devel/kevent :: Link / Comments (0)
Wed, 06 Sep 2006
Finally climbing day.
I did not visit climbing zone several weeks already,
so today's training was really simple. I completed
two traces - old interesting "Do not want" over black
holds and new one "Mad Point", which I failed several times,
so it was not really completed. This new trace
has second place in my own range of complex traces I climbed,
and since it is on the vertical wall I will definitely
complete it after several starts.
/life :: Link / Comments (0)
Network allocator is now configurable.
I.e. one can select in kernel config to compile it or not,
the same is applied to zero-copy sniffer,
which allows to remove sniffer's overhead and fix some other
mentioned temporal limitations.
/devel/networking/nta :: Link / Comments (0)
Kevent 'take16' patchset is released and subsystem is frozen for some time.
Since number of comments has come mostly to zero, I freeze for some time kevent
development (since resending practically the same patches into /dev/null
is not that interesting task) and switch to imeplementation of special tree,
which probably will be used with kevents instead of hash table and for other
projects.
Short changelog for just released 'take16' patchset:
- converted
kevent_timer to high-resolution timers, this forces timer API
documentation update
- use
struct ukevent * instead of void * in syscalls (documentation has been updated)
- added warning in
kevent_add_ukevent() if ring has broken index (for testing)
/devel/kevent :: Link / Comments (0)
Mon, 04 Sep 2006
Kevent 'take15' has been released.
Short changelog:
- added
kevent_wait(). This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
- added socket notifications (send/recv/accept). Using trivial web server based on kevent and this features
instead of
epoll it's performance increased more than noticebly.
More details about benchmark and server itself (evserver_kevent.c)
can be found on project homepage.
Splitted patches are available in archive.
I've also updated documentation here.
/devel/kevent :: Link / Comments (0)
Sat, 02 Sep 2006
Zero-copy sniffer.
I was wrong
about magical ab symbols found in sniffer dump - it is usual sending data,
but since sending side reserves MAX_TCP_HEADER bytes,
so in my setup sending ethernet header starts with offset of 190 bytes and receiving with
offset of 16 bytes from the begining of the allocated buffer.
Sending sequence number graph for tcpdump and zero-copy sniffers.

As you can see there are no gaps in graphs (although it is just scp transfer
using 100Mb NIC (3c59x), CPU usage for zero-copy sniffer when data is being written into /dev/null
is about two times less then when it is writtend using tcpdump.
I've released new version of network allocator
and zero-copy sniffer and sent it to netdev@
with a question about possibility of inclusion into mainline.
Zero-copy sniffer has following overheads:
- several atomic operations (in the worst case one
atomic_set(), one
atomic_inc() and one or two atomic_dec_and_test())
- one lock (bad global lock per sniffer device), which is held when
information about new packet is being put into sniffer's queue when skb is freed
- delayed freeing which can lead to increased memory usage,
or (like implemented) if introduced maximum amount of "locked" data by sniffer,
some packets can be dropped by sniffer.
Limitations of current version (introduced not due to design problems,
but intentionally to test various special usage cases):
- use NTA only for
netdev_alloc_skb() and sk_stream_alloc_pskb(), i.e.
only for allocations of traffic received by NIC and sent through send() syscall
over stream socket.
- always compile zero-copy sniffer in, which increases memory usage
and adds described above overhead.
skb_copy() always allocate data from SLAB allocator, although it
could check if original skb's data was allocated through NTA, but I
think that skb_copy() is completely incompatible with high
performance.
- it is possible to eliminate several atomic operations (I'm lazy).
- debug code (poisoning of the tail of the buffer and additional
reference counter) is always compiled in.
/devel/networking/zcs :: Link / Comments (0)
Fri, 01 Sep 2006
Zero-copy sniffer.
I've fixed mapping bug and forced network stack to use
network allocator
only for packets which are created either by network device (receiving) or through
send() syscall over sream socket, so current version does not catch netlink messages,
unix sockets and so on. Here is typical zero-copy sniffer log:
dump 447.1024: ptr: 0xc19b0f80, start: 0xc19b0000, size: 1956, off: 200576: entry: 0, cpu: 0:
ab:ab:ab:ab:ab:ab -> ab:ab:ab:ab:ab:ab, type: abab,
dump 448.1024: ptr: 0xc19fa880, start: 0xc19f8000, size: 1828, off: 501888: entry: 0, cpu: 0:
00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6,
dump 449.1024: ptr: 0xc1a01080, start: 0xc1a00000, size: 1828, off: 528512: entry: 0, cpu: 0:
00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6,
dump 450.1024: ptr: 0xc19f4800, start: 0xc19f4000, size: 1828, off: 477184: entry: 0, cpu: 0:
00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6,
dump 451.1024: ptr: 0xc1a01f80, start: 0xc1a00000, size: 1828, off: 532352: entry: 0, cpu: 0:
00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6,
dump 318.1024: ptr: 0xc1b80780, start: 0xc1b80000, size: 1828, off: 1920: entry: 0, cpu: 1:
02:30:9b:0c:89:e8 -> ff:ff:ff:ff:ff:ff, type: 0800, 192.168.4.9:43281 -> 255.255.255.255:43281, proto: 17,
dump 330.1024: ptr: 0xc1b86580, start: 0xc1b84000, size: 1828, off: 25984: entry: 0, cpu: 1:
02:00:63:1f:2d:81 -> 01:00:5e:00:01:14, type: 0800, 192.168.5.231:43281 -> 224.0.1.20:43281, proto: 17,
dump 331.1024: ptr: 0xc1b86d00, start: 0xc1b84000, size: 1828, off: 27904: entry: 0, cpu:
1: 02:3a:d1:7e:6e:65 -> 01:00:5e:00:01:14, type: 0800, 192.168.5.232:43281 -> 224.0.1.20:43281, proto: 17,
Look into strange line with ab symbols instead of the ethernet fields - this is
an skb, which was freed in tcp_clean_rtx_queue() when ACK was received.
Network allocator fills allocated area with ab bytes for debug purpose,
and it looks like TCP state machine preallocates some packets and then frees them without
actual usage. Number of such empty allocation is not so samll actually.
I plan to run an interesting benchmark tomorrow - test machine will generate traffic using different packet sizes
and sniffer will log TCP sequence numbers on that sending machine, then I will plot a graph of sent and
missed packets for zero-copy sniffer and tcpdump.
/devel/networking/zcs :: Link / Comments (0)
I've added photos from scandinavian trip to gallery.
Have a nice time!
/life :: Link / Comments (0)
|