Zbr's days.
November
Sun Mon Tue Wed Thu Fri Sat
     
   
2006
Months
Nov

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Thu, 30 Nov 2006

New kevent 'take26'release.


Short changelog:

  • use struct timespec as timeout parameter.
  • added high-resolution timer to handle absolute timeouts.
  • added flags to waiting and initialization syscalls.
  • kevent_commit() has new_uidx parameter.
  • kevent_wait() has old_uidx parameter, which, if not equal to u->uidx, results in immediate wakeup (usefull for the case when entries are added asynchronously from kernel (not supported for now)).
  • added interface to mark any event as ready.
  • kevent POSIX timers support.
  • return -ENOSYS if there is no registered event type.
  • provided file descriptor must be checked for fifo type (spotted by Eric Dumazet).
  • signal notifications.
  • documentation update.
  • lighttpd patch updated.
Appropriate applications (ring_buffer.c, signal.c, posix_timer.c) and lighttpd patch can be found in archive.

/devel/kevent :: Link / Comments (0)


Hoax or not? New kevent benchmark.


I've completed a bunch of kevent updates including major kevent interface changes and will release new version soon. I've also created some locking changes, which I call speculative locking (i.e. some fields (ret_flags mostly) can be updated without locks, but it should not harm the system), which can be wrong, but results are impressive.

Hardware setup.
Client: Intel Core Duo 3.4 Ghz (run with 3.7 Ghz each core), 2 GB of RAM, sky2 driven gigabit NIC (Marvell 88E8053).
Server: AMD Athlon 64 3500+ 2.2 Ghz, 1 GB of RAM, r8169 gigabit NIC.

Software.
Both client and server systems have this network timewait socket setup:

# echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
# echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
# ulimit -n100000
it is the only changes from default startup.

Client runs Apache benchmark utility, which was started with following command line:
  ab -n80000 -c8000 http://192.168.0.48/
i.e. 80k requests total with 8k requests started concurrently.

Results.
kevent about 7200-7400 requests per second
epoll about 4300-5300 requests per second
With 10k concurrence performance drops to
kevent about 4000 requests per second
epoll about 3800 requests per second

it is possible that I made some errors in that benchmarks though, so I will not publish results to linux-kernel@ or make awailable from kevent homepage.

Update: even without speculative lock changes I got the same outperforming results. Kernel is compiled for SMP setup, all debugging is turned off (in my previous run on that platform I had some debugging options turned on, and performance was about the same and even worse for kevent), preemption model is 'Low-Latency Desktop'.

Update 2: forgot to say, that web server is lighttpd-1.4.13.

/devel/kevent :: Link / Comments (0)


Grange's song on OpenBSD hardware mini-hachathon h2k6.




Here is mp3 version (6.1 Mb, renamed to moscow.mp for those who has corporate filters for mp3 files) of the song from Mongol Shuudan band. This song and some others are available from band's homepage - mostly punk style.

/devel/other :: Link / Comments (0)


My J. Michael TR-300S trumpet has arrived.


I have not tried to extract sounds from it yet - I'm in office and I want to get home without major damages, which are unavoidable if I would try to 'play' right now - there are several people here, and they even try to work, so my first lesson is postponed to the evening.

I will start to rape and crack neighbours ears later today at home - likely no one will suffer too much, building is quite new, so there are only about 20 'alive' flats in the whole building (about 350 apartments).
I think my sounds will be in a harmony with sounds extracted from concrete breaking perforators and corner-grinding machines.

/life :: Link / Comments (0)


Wed, 29 Nov 2006

Perfect. Just perfect.


I'm talking about my climbing training, if you have not understood.
Actually there were nothing special from climbing point of view - the same traverses and bouldering, some progress here and there, bits of mixed traces and so on. But the whole training was extremely good - when interesting people smile to you and you smile to them, when you found new moving on the old trace, when body aching after the holds, when ... - all that small bits create excellent mood and I become to think that I just spent perfect time, and actually most of the time (I would even say overwhelming part of the time) I spent was really good.

/life :: Link / Comments (0)


OpenBSD hardware minihackathon is over.


Grange has really burnt them brains - in the brazilian pub he took away a guitar from singer and started to sing russian song "Moscow" with Sergey Esenin words.
I would say that Grange signs not that bad, and especially when he is drunk, and song itself is very good too, but the whole situation made it even better, so he was called 'the most talented man at h2k6'. It is quite possible. One can even find his records on the homepage (in russian).
So, it was very interesting meeting, as we can see.
Among good mood he also got bunch of various hardware, so I hope he will finally start hacking interesting stuff instead of slacking his ... fifth point.

/devel/other :: Link / Comments (0)


Multidimensional trie implementation.


Although long-running test was completed without errors (actually it is not interesting test at all, since 10 rules is a too small set), after some thoughts I've found a bug in the algorithm - I tried to avoid recursive searching in case of wildcard nodes found traversing trie, but it seems that it is unavoidable.
I also added trivial memory leak detector, which found another bug in algorithm - there is possibility to not free next-dimensional roots in some cases.
So I started to change algorithm itself, and although its theoretical ground is fully completed, practical implementation has a place where bug lives - I know exactly what is the bug and where is that buggy place in the code, what are the reasons, actually I know everything to fix it, but do not know how to make a good fix. Yet.

/devel/networking :: Link / Comments (0)


Tue, 28 Nov 2006

Multidimensional trie implementation has been completed.


It is a first and the most significant step in netchannels wildcard implementation. I tested it with different number of dimensions (1-5) and different number of added/searched/removed rules (upto several millions in each dimension).
Algorithm is simple and fast - my previous estimation was incorrect - currently 1 million 5-dimensional insertions take about 6 seconds, 1 million 1-dimension insertions take about 1.1 seconds on my test machine.
When I add search/remove operation (one-by-one, i.e. add one, search for it and remove it), the whole tuple takes about 15 and 3.2 seconds for 5-dimensional and 1-dimensional case accordingly.

I will (probably) create some graphs describing algorithm and how it will be used with netchannels tomorrow, but if algorithm itself is implemented correctly, it is the most significant part of wildcards support fot netchannels.

I have started following testing case for 5-dimensional case for the night: 10 rules are added, then 10 various values (actually inserted values, but with 0xffffffff mask) are searched for, and then all they are removed, let's see how this will end up tomorrow.

/devel/networking :: Link / Comments (0)


New musical toy.


JMichael trumpet

P.S. Do not scream and laugh, but I can not play trumpet - will start to learn.

/life :: Link / Comments (0)


That is what I call good hacking.


Good hacking evening

Other photos from h2k6 hackathon.

/devel/other :: Link / Comments (0)


Mon, 27 Nov 2006

Climbing evening.


After quite empty day (I did not hack, did not work, did not do anything usefull) I moved to climbing zone and spent there excellent time - there were several old traverses completed, couple of new boulderings. The most interesting part was relief traverse - each day I add more and more holds there, although it is marked not as the most complex trace, I only saw one person who tried to work with it (except me), others somehow do not like that trace. unfortunately in Skala-city there are no relief vertiacl traces, which I'm quite sure would be my favourite ones.

/life :: Link / Comments (0)


New Year is coming.


I've started to think about presents.
First of all, present for myself (well, second one actually) - I think I want to start to make photos, so I decided to get Cannon EOS 400D. Not sure if I will really buy it, but at least I think about it.
The best present for myself would be flat repair complete, but we will see...

So, to touch the perfect, I've subscribed to couple of interesting livejournals of photo masters.
There are indeed several types of photos I like - first of all, it is night photos, then probably building/stuff photos (especially from higher floors or roofs) and panorams.
I have extremely small experience in that area, but will try to learn. Maybe.

/life :: Link / Comments (0)


OpenBSD h2k6 hardware mini-hackathon.


Day first (in russian).
Grange did not forget bottle of "Russian Size", but it was seized in Zurich.

As became known from authoritative sources, Grange tries to add support for IBM ServeRAID to OpenBSD.
It looks roughly like this:
Some hardware

Small photo archive of the meeting.

This algorithm describes what is hackathon and what they are doing there:
OpenBSD hackathon

For comparison: Linux network developers conference NetConf is completely different meeting - hackers sing karaoke songs and seems to not drink at all, I never participated there though so can not say for sure. On picture core linux network hackers (David Miller, Rusty Russell and not-yet-a-netdev-hacker Soyoung Park) select karaoke song.

Linux network hackers select karaoke song

/devel/other :: Link / Comments (0)


Sat, 25 Nov 2006

Initial trie implementations for netchannels wildcard support.


I completed 1-dimensional trie implementation and started its wild testing - random prefix and address mask is created in infinite loop and then random address is being searched there - it has passed more than 123 millions insertion/searches and ate about 1.8Gb of ram - each trie node is 20 bytes in size, which would take 2.29 Gb of RAM if they lived as-is, but due to the nature of tries, which allows to reuse the same entries for several prefixes, memory usage is smaller. 100k insert+search operations took always less than 2.5 seconds on my test machine.

Next step, which I will start after weekend, is multi-dimensional trie implementation, which is actually netchannel wildcard support.

/devel/networking :: Link / Comments (0)


Grange has flown to OpenBSD hardware mini-hackathon.


This small party, also known as 'I-can-drink-more-than-you-can-hack' (there is no real evidence that they spend there time with profit for the world or at least for own health), will be spent in University of Coimbra in Portugal.
Grange will hack there probably Dallas one-wire bus, maybe something else. Actually I do not know, why he is going there except that to become drunk with people which live about tens of thousands kilometers away.
I hope he did not forget to get one bottle of 'Russian Size'.
So, let's wish them nice time there.

/devel/other :: Link / Comments (0)


Fri, 24 Nov 2006

The main development words.


Rusty Russell's "Hanging Out With Smart People".
Hanging Out With Smart People

/devel :: Link / Comments (0)


New books.


I bought yesterday evening two books about history of two interesting states:

  • "History of Russia. From the ancient times to 21 century." A. N. Saharov (Russain Academy of Sciences).
  • "All under heaven. A complete history of China." Rayne Kruger.
Work and other is postponed - I'm going another world.

/life :: Link / Comments (0)


Kevent last words.


Ok, after I was requested to change interfaces again to the state they were in the previous release, I've thrown the last words - I'm waiting for about one week and listen for other developers to comment on the interfaces, if no one will comment, I will gather all the latest suggestions, implement them and make the new release - no interface changes after that day.
And I do not care if there will be some problems and/or some people do not have time to review all my releases or something else - I have time to implement them on per-week basis (I could create new release every day actually), so there is plenty of time to review for those who cares.
If you have something to say about kevent, your opinion on its inclusion/declining or on my behaviour and so on, it is the right time to say.

Thank you.
I'm going to implement netchannels wildcards - it is much more interesting than spending hours on completely void tasks.

/devel/kevent :: Link / Comments (0)


New kevent features scheduled for the next release.


  • struct timespec and possibility to have absolute timeouts in syscalls instead of simple number of nanoseconds.
  • signal mask parameter for syscalls to allow temporal blocking of the signals while syscall is executed (almost, but I'm still trying to show that it is stupid idea with kevents).
  • kevent posix timers support.
  • allow kevent to mark any events as ready from userspace.
  • allow to break any waiting syscalls.
  • various errno return values update.
  • various interface changes, like addition of some new fields, and removal of otheres, which will not affect functionality behind that syscalls.
  • remove some locks in the hottest path - it is indeed needed task. I plan to refactor searching/modifying code so that there were places in the struct ukevent which are never modified by another than caller users (like simultaneous changes in readiness state machine and searching part), so there would be possible to only have one lock in the readiness state machine around ready queue.
Call me a looser, but although I'm strongly against first two entries, I likely will add them to make at least some progress on kevents.

Crap, I spent 2.5 hours sending e-mails into empty discussion. I wish my day would have not 36, but 48 hours...

/devel/kevent :: Link / Comments (0)


Thu, 23 Nov 2006

Kevent posix timers support.


Early morning hack has been sent to mail lists for review.
I have not tested it (except compilation test), and frankly do not want, but if my flow of thoughts is correct, I will create simple application, which will use timer_create() and friends to test kevent posix timers support.
Implementation (and design of POSIX timers API) does not allow to have KEVENT and SIGNAL or THREAD delivery mechanism simultaneously, at least I concluded it after quick look into the code, which is probably is not correct.
Let's see if my patch is good or not.

/devel/kevent :: Link / Comments (0)


Wed, 22 Nov 2006

Climbing evening.


That was quite easy training - I ran several old traverses and couple of bouldering, also created one traverse, which previously contained green holds, but most of them were removed or replaced with time, so it was a new trace, that's all - nothing special this day.
The most pleasant part was sauna and shower which definitely were required by development related dirtiness.
Now I feel mysef just bloody excellent.

/life :: Link / Comments (0)


New features scheduled (implemented) for the next kevent release.


Most of them are implemented in the local tree, I'm just waiting for ack from Ulrich Drepper:

  • new command, which allows to mark any events as ready. It is also possible to just wake up listener of the given kevent queue if number of events requested to be marked as ready is zero. It allows glibc to notify other event processing threads, which are parked in the waiting syscalls like kevent_wait() or kevent_get_events(), that they can process new events in cae of originally awakened thread was terminated.
  • fixed unsufficient parameters check in kevent pipe notifications (spotted by Eric Dumazet).
  • return -ENOSYS if there is no registered event type instead of -EINVAL.
  • add 'flags' parameters to kevent_init() syscall (not implemented yet).

Actually looking at how kevent makes fast progress and how slowly I get positive feedback and how complex are negotiations about features and requirement addons are performed with some people, I seriously doubt there will be enough resources to really complete the task in the way I want and think is the best for future development.
I have several possible ways to solve existing (stagnating) problem with APIs and feature addons:
  • just implement all horrible workarounds and ugliness I'm asked about, which I think will allow quite fast kevent integration into mainline.
  • continuously perform seems to be endless negotiations with mutual insults in linux-kernel@. This will not force kevent inclusion, and likely will not end up with it at all, but there is probability (I think it is somewhere between zero and void), that I can convince people that I am right. Practice shows that it is hardly to convince people, which do not hear what you talk to them, no matter correct it is or not.
  • say that 'I do not care and will not even discuss all requested utterly damaged ideas' or something like that and stop discussion. It can or can not help with kevent integration, but in any case I at least will have enough time for more interesting tasks than empty talks in linux-kernel@. For example netchannels wildcard support and fast NAT implementation, zero-copy support for netchannels, asynchronous IO, threading library, log-structured filesystem
The more I think, the more I like third variant...

/devel/kevent :: Link / Comments (0)


First part (about one fourth) of the hinged ceiling is completed.


Something like that.
I screwed it yesterday evening, and can say that although making hinged ceiling alone is not that simple task, it is quite doable.
Next task is to perform second layer, it will be lower and take more square, but it is task for another Sunday.

Hinged ceiling setup

Will create better photos when it is ready.

/devel/flat :: Link / Comments (0)


Tue, 21 Nov 2006

New kevent 'take25' release.


Short changelog:

  • new (old (new)) ring buffer implementation with kernel and user indexes.
  • added initialization syscall instead of opening /dev/kevent.
  • kevent_commit() syscall to commit ring buffer entries.
  • changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes only first thread always if that flag is not set.
  • KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue instead of copying back to userspace when kevent is ready immediately when it is added.
  • lighttpd patch (Hail! Although nothing really outstanding compared to epoll).

  • And special blog edition changelog entry (interesting, do people read my patches to detect this lyrics? Let's see...):
  • small three-line verse (I think I created not bad step for non-english speaking person) about one-line optimisation. Casted by Shakespeare.
I installed slightly used, but still functional (bought on ebay) remote mind reader, and set it up to read Ulrich's alpha brain waves (I hope he agrees that it is a good decision), which took me the whole week. So I think the last ring buffer implementation is what we all wanted. Details in documentation part.

/devel/kevent :: Link / Comments (0)


Hinged ceiling installation.


That is how I spent Sunday.
Weight of that part is about 20 kg, it was challenging to lift and fasten the detail and it is still in a hung state, since I do not have long enough screws. But I will complete it soon, maybe even today.

Hinged ceiling setup

/devel/flat :: Link / Comments (0)


Mon, 20 Nov 2006

New kevent release.


It will be 'take25', but I have not released it yet, since it is in testing mode right now.
This release will include:

  • new (old) ring buffer implementation with head and tail in userspace buffer with two syscalls to wait and to commit events (the latter becomes just an index update).
  • additional feature of the ring buffer - overflow counter to prevent situation when two threads are going to free the same events, but one of them was scheduled away for too long, so ring indexes were wrapped, so when that thread will be awakened, it will free not those events, which it suppose to free.
  • wake-all-threads flag - when it is set all interested threads will be awakened, otherwise only first one.
  • initalization syscall and removed /dev dependency.
  • small three-line verse (I think I created not bad step for non-english speaking person) about on-line optimisation. Casted by Shakespeare.

I wil probably include there userspace notifications.

Did I say that httperf is unfair benchmark?
It definitely is, and actually all existing kevent results are unfair. So...
Apache ab benchmarks shows kevent win, but I do not know how fair it is and how it works at all.
It is also possible, that my epoll based server is broken too.

So, let's state for now, that my benchmarks can not show kevent superiority in socket tests.
PIPE/socket test, created by Eric Dumazet proves kevent works better that epoll.
Johann Borck's tests show that kevent slightly outperforms epoll in web benchamrks.
But not mine.

So I have not gone to climbing zone (and I'm dirty as pig after weekend hinged ceiling setup) and started lighttpd patch to create real-world benchmark.

Here we go.
I've patched lighttpd with kevent support (although it is limited, since lighttpd does not support edge-triggered notifications and immediate events).
Server runs on Xeon 2.4 Ghz with 1 GB of ram and e100 NIC.
Client uses apache ab benchmark, which was started with following parameters:
  ab -n50000 -c5000 http://192.168.4.78/
i.e. 50k requests with 5k concurrent requests. Document size is 699 bytes to not make bandwidth a limiting factor.

Here are the results:
lighttpd_kevent:  4902.17 req/sec
lighttpd_epoll :  4843.71 req/sec
With 6k concurrent requests rates dropped down to:
lighttpd_kevent:   4307.18 req/sec
lighttpd_epoll :   4027.34 req/sec
although it can be just a file/socket rollup.
So essentially kevent socket code behaves like epoll in real-world benchmarks, as long as allows a lot of other interesting features.

/devel/kevent :: Link / Comments (0)


New district and shops.


I live in a new district which is only became to be inhabited.
There is a bus station near the buildings and generally there are a lot of people there and obviously it is a good place for small shop or stall which would have cigaretts, beer and other such stuff.

And what would you think is the first shop opened there?
It is flavour shop!
I love my country.

/life :: Link / Comments (0)


Sat, 18 Nov 2006

New generic wildcard search design.


Well, it is not new, but it is different from what was described before. And although previous variant has higher lookup speed, it requires too much memory, so it is not acceptible.

New one is quite simple - it will be implemented on top of trie structures. Main feature of tries is that they are quite compact (in case of PATRICIA tries) and allow W^d lookup speed in the worst case, where W is width of the interval in bits (i.e. 32 bits in current netchannels implementation, which does not support IPv6 due to my laziness and absence of requirements feedback from the community), and d is number of dimensions (5 currently), which results in very bad number of 33.5 millions operations in the worst case of (2^W)*(2^d) == 2^(W+d) (137.5 milliards) rules.
Actually with netchannels it will be much smaller since only two dimensions use full 32 bit space (source and destination addresses), while others use either 16 bits (source and destination ports) or 8 bits (protocol number), so it will be something about 2 millions operations.

Basic idea behind tries is following: trie is a simple unbalanced tree each node of which represent only one bit (i'th level of the tree represent bits in i-1 position in the searched value), so it's maximum depth is 32 (in PATRICIA trie several bits can be combined into the same node if there are no other rules, which have different bits in that positions). In that case rule 192.168.0.0 - 192.168.0.255 will be presented as 0b11000000101010000000000000000000(192.168.0.0) - 0b11000000101010000000000011111111(192.168.0.255), so actually trie will be stopped at bit 24 (counted from the most significant one) and the rest will present a wildcard (0-255 in the least significant byte in host order) or it will contain additional nodes if there are additional rules which has bits in that range, for example 192.168.0.0-192.168.0.20.
Each direction is a base for own tree.
Each node contains a pointer to the subtree of the next-dimension tree, where its rule starts. So when each node is examined (i.e. each bit is checked) in the first dimension and bit is found to to belong to some rule, next dimension is checked from the bit position specified in the node and this is being done recursively for each dimension. If in some dimension requested bit is not found in the trie (where it should be by the previous dimension rule), it means that there are no rules for requested values.

In the previous memory-hungry design, if rules do not have ranges but only single tuples (i.e. single source/destination IP address and single ports and protocol), or if all rules present the same range (for example all port rules have the same 1-65535 range), it can not lead to the described problem with memory usage. The more range overlaps rulset has, the more memory such algorithm will use - in theory, when each new rule requires interval split, i.e. when each new rule overlaps with all previous and has additional range (for example when each new rule is slightly bigger than previous one), it is O(d*W*(N^d)) where N is number of rules and d is number of directions, W is width in bits of each dimension (actually it can be hidden by O(), since it is a constant for different set of rules).

/devel/networking :: Link / Comments (0)


I wish my day would have 36 hours.


So I could quickly complete flat development, moved my laptop to HP repair (its disk fails after couple of hours of work) and started to work more than currently.

Likely if you do not see entry in the blog, that means that I'm too busy at paid work (or even unavailable at all) to work on my own projects.

/devel/other :: Link / Comments (0)


True asynchronous IO.


All existing AIO - both mainline and kevent based lack major feature - they are not fully asyncronous, i.e. they require synchronous set of steps, some of which can be asynchronous. For example aio_sendfile() requires open of the file descriptor and only then aio_sendfile() call. The same applies to mainline AIO and read/write calls.

My idea is to create real asyncronous IO - i.e. some entity which will describe set of tasks which should be performed asynchronously (from user point of view, although read and write obviously must be done after open and before close), for example syscall which gets as parameter destination socket and local filename (with optional offset and length fields), which will asynchronously from user point of view open a file and transfer requested part to the destination socket and then return opened file descriptor (or it can be closed if requested). Similar mechanism can be done for read/write calls.

This approach as long as asynchronous IO at all requires access to user memory from kernels thread or even interrupt handler (that is where kevent based AIO completes its requests) - it can be done in the way similar to how existing kevent ring buffer implementation and also can use dedicated kernel thread or workqueue to copy data into process memory.

It is very interesting task and should greatly speed up workloads of busy web/ftp and other servers, which can work with a huge number of files and huge number of clients.
I've put it into TODO list.

Someone, please stop the time for several days, so I could create some really good things for the universe.

/devel/other :: Link / Comments (0)


Fri, 17 Nov 2006

Climbing evening.


It was quite easy training, since there were crowds of people, which usually means, that it is quite problematic to climb horizontal traverses. But nevertheless I completed several old traces and started parts of the relief traverse (longer and longer each training, which is a good sign). At the end I also completed couple of start of complex traces on negative slope and found an interesting bouldering (read: lurked in climbing school training which happend near the place where I had a rest), but failed to complete it since was tired already.
Sauna, shower at climbing zone and a little beer with Gabriel Garcia Marquez book at home completed the day - yet another good day has gone...

/life :: Link / Comments (0)


Netchannels wildcard design and implementation.


It was first described here, but I did not present memory usage for that scenario.
After some calculations I found it is about O(d*(N^d)/4 + N) where N is number of netchannels in the whole tree and d is number of dimensions we search in. With 5 dimensions (source/desctination address/port and protocol number) it becomes completely unfeasible.

And after my implementation completed (although it requires some cleaning now) I've found that memory limit on practice.

Presented design allows extremely fast searching, but it is too memory hungry, so I will modify it to not replicate sub-trees of intervals in each interval, but instead to have different kind of trees which will reduce memory consumption down to O(d*N) instead of above O(d*(N^d)) and increased search time.
So, stay tuned...

/devel/networking :: Link / Comments (0)


Wed, 15 Nov 2006

Theatre evening.


I've seen Eugene Grishkovec's 'Po Po' performance created under expression of Edgar Po works.
Performance is a compilation of many novels written by famous author, all of them are easily detectible.
Grishkovec plays with Alexander Cekalo, they call it 'fun story about the death' and it is really so. I would even say that Grishkovec plays second role in that single-actor monologue performance.
There are some opinions that Grishkovec has star fever, but I would not say so - neither during his presentation before the performance, nor during the whole act he showed any signs of it, I managed to get tickets in the first raw, so I saw actors right in a couple of meters. Even after performance, when we met Grishkovec on the street with his girlfriend and asked couple of questions and presented a flower he behaved very good.
I like his performances, they are really different from the mos others, but unfortunately he plays quite rarely in Moscow. I hope I will see his 'Drednouti' ('Dreadnoughts') performance this season.

Later we got couple of beers in 'Vobla' beer restaurant and moved home. I think the whole day indeed succeeded, although it looks I caught a cold in addon to my current sick...

/life :: Link / Comments (0)


Packets in alternative worlds of netchannels.


Do you have an excellent spatial imagination?
Can you consider yourself and easily operate say in 3-dimensional world?
Likely you can, since we live in such a world. Probably we can think about 4-dimensional world too.

But can you describe to someone 5-dimensional world?
Or can you put yourself into one of the N worlds, where N is potentially millions, each of which has number dimensions smaller than previous world by one and upto N subworlds in each dimension? And while you - poor walker - are searching for the shortest way to the destination point in that brothel, someone looks into your universe through the glas and loudly lough counting your times clock.

I've just described couple of moments in a single network packet live, when it enters netchannels subsystem with this wildcards design.
It looks like it will have complex live, but we will see, since I've implemented generic multi-dimensional tree hierarchy and will complete searching algorithm and start testing tomorrow.

And now I want to finish my fantastic narration, since I'm going into a theatre.

/devel/networking :: Link / Comments (0)


Reading S. Minaev "Duhless" (aka "Spirit less").


It is extremely 'trendy' book, here one can find a chapter called 'Office'.

Crap. Complete crap - some kind of cool artistic slightly squeamish flood of mind(less).
One can probably think that it is trendy to read it, that it is cool, it is another view on our usual life and so on...
Maybe it is right, but I think it is a crap created to show 'how different author is from all those stupids'.

There are quite a few fun moments though - probably life of the top management really looks like it is described.
"What about putting the sea or beach or wild nature against a background of a woman".

But why so declamative? Why so depreciatingly? Crap.

/life :: Link / Comments (0)


Tue, 14 Nov 2006

Delayed allocation and fragmentation.


Quotation from Andreas Dilger:

Essentially, delaying the disk allocation until a file is large/complete
(delalloc) and then using a buddy allocator in memory to get contiguous
chunks of disk is better than hard 64K+ cluster because it avoids
internal fragmentation and allows much more optimal/efficient placement
than just a factor of 16 reduction in the block count.
This completely confirms my thoughts about delayed allocation.

But I want to add even more - log-structured file system, i.e. file systems, which allows to write updated block essentialy everywhere on the disk instead of the place where previous one was placed, have self-defragmentation mechanism by design - there is no need to reallocate inodes and move files all over the disk - new version can be put in new location, which can be carefully selected and can be written together with unupdated part thus removing fragmentation. Journaling filesystems require defragmentator for that task.

/devel/fs :: Link / Comments (0)


Why trees are better than binary search.


For netchannels wildcard support I need to create set of non-crossing intervals which will be examined for each packet. It is possible to put them one-by-one and thus form an array of intervals which can be traversed using binary search, but there is a major problem when we will add new interval, which should be placed somewhere inside the existing array - that will force array reallocation and copy. This problem also applies to hash tables and actually all other table-based approaches, but does not exist in trees, although tree requres a bit more memory.
both have the same search speed about O(log2(N)), where N is number of intervals.

Rough code snippet (no return value checks and reference counters update), which splits intervals into non-crossing ranges:


	if (in->min < new->min) {
		netchannel_interval_split(in, in->min, new->min-1, GFP_KERNEL);
		if (in->max > new->max) {
			netchannel_interval_split(NULL, new->min, new->max, GFP_KERNEL);
			netchannel_interval_split(new, new->max+1, in->max, GFP_KERNEL);
		} else if (in->max < new->max) {
			netchannel_interval_split(NULL, new->min, in->max, GFP_KERNEL);
			netchannel_interval_split(new, in->max+1, new->max, GFP_KERNEL);
		} else {
			netchannel_interval_split(new, new->min, in->max, GFP_KERNEL);
		}
	} else if (in->min > new->min) {
		netchannel_interval_split(in, new->min, in->min-1, GFP_KERNEL);
		if (in->max > new->max) {
			netchannel_interval_split(NULL, in->min, new->max, GFP_KERNEL);
			netchannel_interval_split(new, new->max+1, in->max, GFP_KERNEL);
		} else if (in->max < new->max) {
			netchannel_interval_split(NULL, in->min, in->max, GFP_KERNEL);
			netchannel_interval_split(new, in->max+1, new->max, GFP_KERNEL);
		} else {
			netchannel_interval_split(new, in->min, new->max, GFP_KERNEL);
		}
	} else {
		if (in->max > new->max) {
			netchannel_interval_split(in, in->min, new->max, GFP_KERNEL);
			netchannel_interval_split(new, new->max+1, in->max, GFP_KERNEL);
		} else if (in->max < new->max) {
			netchannel_interval_split(in, in->min, in->max, GFP_KERNEL);
			netchannel_interval_split(new, in->max+1, new->max, GFP_KERNEL);
		} else {
			netchannel_interval_split(in, in->min, in->max, GFP_KERNEL);
		}
	}
I think my head will be broken in a couple of moments...
Tricky algorithm should be used there probably to split them up, but I currently fail to research one, so I use stupid but the fastest - there are only two checks and one allocation (or two if new interval was allocated in stack) for the worst case of crossing, when two crossed intervals are put into three non-crossing ones.

/devel/networking :: Link / Comments (0)


Filesystem stress testing and emulation tool.


I wonder why Linux does not have FS stress testing and emulation tool like netem for network, which allows to perform drop, corruption, reordering, duplication and other interesting options extremely usefull for protocol testing.

Consider the same tool which works under FS layer, for example as ATA/SCSI controller, which can perform various message mangling, delaying, reordering and other features emulating incorrect behaviour of the hardware. It can even send data down to the real hard drive, thus emulating even various software conditions like scheduling starvation, irq storms and so on.

I think such tool is a must for fs development and should be implemented at least in parallel or even before various userlevel stress tools are started to stress the filesystem.

This deserves entry in TODO list.

/devel/fs :: Link / Comments (0)


Do you know what Google Earth lacks at the first place?


No, not 3d buildings - it lacks ion cannon.

/other :: Link / Comments (0)


Mon, 13 Nov 2006

Climbing day.


It was day of bouldering training - I tried different parts of interesting relief traverse and several old traverses over usual holds.
There are quite interesting moments when one needs to place his/her leg above the head to complete the hold and be able to get next one.
I think I like relif pathes more than any other traces, but unfortunately there are no vertical traces over reiefs in Skala-city.
Training was completed with campus-board exercises, sauna, shower and excellent mood.
I like it.

/life :: Link / Comments (0)


Ceiling design.


I think I will end up with something like this (handmade paintings with very bad quality).
If you feel yourself having enough spatial imagination, try this picture of the room I have in mind.

Room's ceiling model

If you know some site where I can get an inspiration, related to my interests, do not hesitate and drop me a link...

/devel/flat :: Link / Comments (0)


Sun, 12 Nov 2006

Meanwhile on flat development front.


I completed to work with plaster - all walls and ceiling are ready, although they require polishing, which is quite fast procedure, but too dusty, so it will be done slowly.
I've found a way to create very nice corners in doorways using special fabric called 'serpyanka'.

Next step is to start hanged ceiling implementation in the room - I still have not created design of that monster, but I already completed electricity wiring for additional lights and neon cord.

I also decided to not put neither bath nor shower cabin into my bathroom, instead I will develop special podium for water sink and create some kind of a wall - it will be hand-made shower cabin 'on the floor', currently without glass dorrs, but I will add them eventually.

/devel/flat :: Link / Comments (0)


Sat, 11 Nov 2006

Added kevent signal notifications.


Signals which were requested to be delivered through kevent subsystem must be registered through usual signal() and others syscalls, this option allows alternative delivery. With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of signals, they will not be delivered in a usual way. Kevents for appropriate signals are not copied when process forks, new process must add new kevents after fork(). Mask of signals is copied as before.

Here is output of the test program signal.c available in archive:

sig_handler, signo: 10.
Dequeued signal: 10, nodel: 0.
Dequeued signal: 12, nodel: 1.
Program registers SIGUSR1 and SIGUSR2 signals to be delivered through kevent, the former is delivered both through signal callback and kevent, the latter only through kevent.

/devel/kevent :: Link / Comments (0)


Fri, 10 Nov 2006

Climbing evening.


That was great time!
I do not exaggerate - I spent perfect time there - although nothing major happend, but the whole process was not usual (or it was, but I had different shape) - I completed several old traverses and couple of new starts, finished several interesting boulderings - just like usual training, but something was different.
I do not know what, but I want more - yes it is addictive.
Since Grange still can not get his, well something, out of hypo-narcotic state of delirious metaphysical processing or whatever he is doing instead of climbing, it is possible that regular trainigs with vertical climbing will be postponed, since it is not always possible to get a mate in climbing zone, so I will climb for along for some time.

/life :: Link / Comments (0)


Netchannels wildcards algos.


Here was initial design of netchannels wildcards with appropriate picture.

Let's analyze first problem we strike when trying to implement that design.
It is presentation of wildcard intervals (for example two rules: 192.168.0.0-192.168.0.255 and 192.168.0.10-192.168.0.20) in one-dimensional set of non-crossing ranges. Our example splits into three ranges: 192.168.0.0-192.168.0.9, 192.168.0.10-192.168.0.20, 192.168.0.21-192.168.0.255.
All those intervals can be easily represented by structures in C language and can be placed into tree for the fastest access. Each interval structure will have a pointer to the next dimansion range tree (like tree for ports and protocols), nodes of the last tree of intervals will contain pointers to the actual netchannels, which parameters (address, ports and so on) cover selected intervals in appropriate dimensions.
Interval splitting happens in context of netchannel creation and must be protected against simultaneous access both from softirqs (where packets enter network stack) and process context, where other netchannel can be added or removed.
In theory all netchannel trees of intervals can be protected by BH-disabling locks - but it is not nice solution - even if tree traversal itself is very fast and can be done under BH-disabling lock, there is possibility that interval currently being checked must be split into several ones, which requires intervals allocations, which should not be perfomed in atomic context.
So I propose simple scheme lurked in Linux socket locking - under BH-disabling lock we set the flag, that we currently process the tree, so process context will sleep on that flag until it is cleared and (under the same BH-disabling lock) set it again protecting itself against other process trying to access the tree. So usual lock will serialize process and softirq process and semaphore-like locking will be used to serialize process context.

/devel/networking :: Link / Comments (0)


New CARP release.


I've ported Common Address Redundancy Protocol (CARP) module to 2.6.19 API and added INSTALL instructions and also updated README.
Now it just works after reading README and INSTALL files, although much more testing is required.
I only ran it on x86_64 and x86 arches with different failover modes - nodes switch to/from master/backup mode and appropriate userspace applications (and registered kernel users like connector) are executed when machines are turned on/off.

More details on configuration and various parameters as long as step-by-step intruction can be found in README and INSTALL files in the sources.

Feel free to download the sources and create your own high-availability cluster.

/devel/other :: Link / Comments (0)


Thu, 09 Nov 2006

POSIX threading model vs. Erlang threads.


I decided to check if this benchmark of different threading models is correct.
So I've installed Erlang on my Debian Etch desktop (Intel Core Duo 3.4 Ghz (speeded up to 3.7 Ghz), 2 GB of RAM) and tried this Erlang application. It was compiled fine, but failed to execute.
So I've read some FAQs and initial tutorials about Erlang and ended up removing some bits of initialization, since my version of Erlang failed to translate string to number, so I hardcoded required number by hand into the program.
I also compiled this (1) C++ example and this (2) and performed a comparison.

Basically all those programs do is to create 500 threads and exchange 3000 messages between them one-by-one.

Results are quite deplorable for POSIX threads and C++:

  Erlang                   : 0.443 seconds
  C++/POSIX NPTL threads 1 : 4.374 seconds
  C++/POSIX NPTL threads 2 : 0.190 seconds
i.e. C++ application which uses POSIX NPTL threads is about 10 (!) times slower than native Erlang threads, but if thread start/stop functions are written in assembler (if I did not forget it completely), C++ variant becomes 2 times faster than Erlang's one, which seems to be either real magic, or some bug.

I understand that Erlang has completely differet thread model (i.e. it is part of it's virtual machine, but not system as a whole), but there is major issue to think about.

Eventually I created own application, which uses NPTL POSIX threads, which I compared to above (this (1)).
Above numbers are the best numbers reported by time test, so I put gettimeofday() inside main() in both C++ and my C versions and run them 13 times.
Here are averaged results:
  C++/POSIX NPTL threads 1                        : 5083343 usecs
  C/POSIX NPTL threads                            : 4821212 usecs
  C++/POSIX NPTL threads 1 (-fomit-frame-pointer) : 4738057 usecs
  C/POSIX NPTL threads (-fomit-frame-pointer)     : 4430010 usecs
Sources are available here.
Compilation options were: -pipe -Wall -O3 -lpthread -fomit-frame-pointer

This is definitely an issue to think about...

/devel/other :: Link / Comments (0)


New version of the CARP module has been released.


I'm an author of the Common Address Redundancy Protocol (CARP) module for Linux kernel 2.6.
CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard. The latest protocol to help provide high availability and network redundancy, it was developed because router giant Cisco Systems believes that its Hot Standby Router Protocol (HSRP) patent covers some of the same technical areas as VRRP.

Newsforge article about CARP.
Linux kernel CARP is based on OpenBSD CARP protocol but is not compatible with it, since OpenBSD implementation does not contain protection against repeated message sending attack.

New release includes compilation fixes for the latest kernels.

/devel/networking :: Link / Comments (0)


New kevent 'take24' release.


Short changelog:

  • added kevent PIPE notifications
  • KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing time
  • fixed poll/select notifications (were broken due to tree manipulations)
  • made Documentation/kevent.txt look nice in 80-col terminal
  • fix for copy_to_user() failure report for the first kevent (Andrew Morton)
  • minor function renames

/devel/kevent :: Link / Comments (0)


Kevent pipe benchmark.


Using the same benchmark created by Eric Dumazet to test epoll, which I ported to kevent, it is possible to test unix sockets, inet sockets and pipes.
Application has additional posix thread, which writes into number of preallocated sockets or pipes, main process reads from appropriate pairs. Each successfull read is counted as event.

Here is pipe result with kevent_pipe kernel kevent part with 2000 pipes:

epoll (edge-triggered):   248408 events/sec
kevent (edge-triggered):  269282 events/sec
Busy reading loop:        269519 events/sec
Kevent is definitely a winner with extremely small overhead.

/devel/kevent :: Link / Comments (0)


Wed, 08 Nov 2006

Kevent updates.


Some kevent fixes for misteriously lived in the tree errors.
poll/select notifications were completely broken, which I was pointed to by Davide Libenzi.
I do not know, how it could happen, but probably during tree changes from old one, which contained network and FS AIO, to current I dropped some file updates.
I've fixed it and also changed a bit logic behind poll/select notifications - they are called with newly introduced KEVENT_REQ_LAST_CHECK flag, which if set allows to perform the last check on kevent (call appropriate callback) when kevent is marked as ready and has been removed from ready queue. If it will be confirmed that kevent is ready (k->callbacks.callback(k) returns true) then kevent will be copied to userspace, otherwise it will be requeued back to storage. Second (checking) call is performed with this bit _cleared_ so callback can detect when it was called from kevent_storage_ready() - bit is set, or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued, bit will be set again.

Benchmarks compared to epoll() shows sightly better results, but it can be fluctuation and can be unrelated to that change.

I also created pipe/fifo notifications, but have not aggressively tested it yet, will do it tomorrow as long as release new kevent patchset with poll/select notifications fix.

/devel/kevent :: Link / Comments (0)


Threading libraries and it's speed.


Looking into threading benhcmark hosted on http://shootout.alioth.debian.org I've found that synthetic threading model (i.e. which has threading capabilities built into language interpreter like Erlang) works generally faster than posix threading model.
I think I will try to implement that benchmark in C and run different applications hosted there to compare if NPTL is really that slow.

If it is really so, I will think some more about possibility to create such library. Obviously I will not start it's implementation tomorrow, but some generic thoughts about how it could look like, what algorithms could be used and so on... Just a thoughts...

/devel/other :: Link / Comments (0)


I want to have 36 hours in my day.


This ppc32 work is (probably already 'was') in my todo list too.

Just thought, that ppc32 uses round-robin algorithm to update TLB cache. It looks like it would be possible to use weighted RR algo instead - i.e. create each entry's usage statistic and flush only entry with the smallest value.

I always wanted to work with low-level hardware and specially with some interesting arch. I probably would not want to be a maintainer of some common arch like PPC, but work with something interesting like LUNA machines based on Motorola 88000 CPU for example. Just a though...

/devel/other :: Link / Comments (0)


Tue, 07 Nov 2006

I am ready new sick is kevent release.


No, I am sick and new kevent release is ready.
Actually not. Looking at how much I've eaten and how much I still want to eat, one can say that I'm definitely ok.

Nevertheless, new kevent release 'take23' is out.
There are several major changes in this release:

  • new ring buffer implementation in process' memory. it works similar to this description, but structure has changed - I removed user's index, instead there is only kernel one (i.e. index where kernel will put new ready event). When user calls kevent_get_events() or kevent_wait() requested number of kevents is copied into ring buffer (if it was initialized using
    int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int num);
    function and kernel's index is increased. Simple. Userspace should itself store pointer to it's previous index (actually it can be calculated using returned value from kevent_wait() - it is number of events copied into ring buffer and current kernel's index, but it is only correct in case of special locking between threads.
  • wakeup-one-thread flag. When several threads wait on the same kevent queue and requested the same event, for example 'wake me up when new client has connected, so I could call accept()', then all threads will be awakened when new client has connected, but only one of them can process the data. This problem is known as 'thundering nerd problem'. Events which have this flag set will not be marked as ready (and appropriate threads will not be awakened) if at least one event has been already marked.
  • edge-triggered behaviour. It is an optimisation which allows to move ready and dequeued (i.e. copied to userspace) event to move into set of interest for given storage (socket, inode and so on) again. It is very usefull for cases when the same event should be used many times (like reading from pipe). It is similar to epoll()'s EPOLLET flag.

Eric Dumazet created special benchmark which creates set of AF_INET sockets and two threads start to simultaneously read and write data from/into them to test epoll() and I ported it to kevent (ring_buffer.c application can be found in archive).
Here are the results:
  epoll (no EPOLLET):                  57428 events/sec
  kevent (no ET):                      59794 events/sec
  epoll (with EPOLLET):                71000 events/sec
  kevent (with ET):                    78265 events/sec
  Maximum (busy loop reading events):  88482 events/sec

So, kevent works faster than epoll() and it was confirmed by three different benchmarks (Eric's code (epoll_bench.c and appropriate kevent_bench.c can be found in archive), Johann Borck's web server and my benchmarks presented on kevent homepage)).

Jeff Garzik today joined kevent supporting team.
So there are serious people on kevent side including such heavyweights like David Miller (network core maintainer) and Jeff Garzik (network device drivers, SATA maintainer). Andrew Morton (I think you heard this name) included kevent in his -mm tree several releases ago.

So, things develop from bad to probably not bad.

/devel/kevent :: Link / Comments (0)


Mon, 06 Nov 2006

New userspace ring buffer implementation.


It looks like I've completed new ring buffer implementation, but I have not extensively tested it due to aching head. Will complete it tomorrow and also implement wake-one-thread flag.

Background bits.
New ring buffer is implemented fully in userspace in process' memory, which means that there are no memory pinned, its size can have almost any length, several threads and processes can access it simultaneously.
There is new system call

int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int num);
which initializes kevent's ring buffer (int ctl_fd is a kevent file descriptor, struct ring_buffer *ring is a userspace allocated ring buffer, and unsigned int num is maximum number of events (struct ukevent) which can be placed into that buffer).
Ring buffer is described with following structure:
struct kevent_ring
{
	unsigned int		ring_kidx, ring_uidx;
	struct ukevent		event[0];
};
where unsigned int ring_kidx, ring_uidx are last kernel's position (i.e. position which points to the first place after the last kevent put by kernel into the ring buffer) and last userspace commit (i.e. position where first unread kevent lives) positions appropriately.
I will release appropriate userspace test application when tests are completed.

When kevent is removed (not dequeued when it is ready, but just removed), even if it was ready, it is not copied into ring buffer, since if it is removed, no one cares about it (otherwise user would wait until it becomes ready and got it through usual way using kevent_get_events() or kevent_wait()) and thus no need to copy it to the ring buffer.
Dequeueing of the kevent (calling kevent_get_events()) means that user has processed previously dequeued kevent and is ready to process new one, which means that position in the ring buffer previously ocupied but that event can be reused by currently dequeued event. In the world where only one type of syscalls to get events is used (either usual way and kevent_get_events() or ring buffer and kevent_wait()) it should not be a problem, since kevent_wait() only allows to mark number of events as processed by userspace starting from the beginning (i.e. from the last processed event), but if several threads will use different models, that can rise some questions, for example one thread can start to read events from ring buffer, and in that time other thread will call kevent_get_events(), which can rewrite that events. Actually other thread can call kevent_wait() to commit that events (i.e. mark them as processed by userspace so kernel could free them or requeue), so appropriate locking is required in userspace in any way.
So I want to repeat, that it is possible with userspace ring buffer, that events in the ring buffer can be replaced without knowledge for the thread currently reading them (when other thread calls kevent_get_events() or kevent_wait()), so appropriate locking between threads or processes, which can simultaneously access the same ring buffer, is required.

Having userspace ring buffer allows to make all kevent syscalls as so called 'cancellation points' by glibc, i.e. when thread has been cancelled in kevent syscall, thread can be safely removed and no events will be lost, since each syscall will copy event into special ring buffer, accessible from other threads or even processes (if shared memory is used).

/devel/kevent :: Link / Comments (0)


Life sucks - I caught a cold.


Crap, it is second time this year already (first one was early spring) and third one during the last preiod I recall (about 3 years I think), so it becomes quite bad. I only know how to treat head aching during hang-over (the whole two drugs!), so I decided to surrender to collegues - I obtained some drugs, which among others can result in nausea, retching, aching in epigastric area, hepatotoxic and nefrotoxic action, hepatic failure, encephalopathy and state of coma.

So I decided to drink a lot of tea with lemon instead of that shit.
But there is a positive moment in it - when I have such condition I frequently switch into creative mode (probably due to increased blood circulation in head), which results in very interesting talks, stories and the like (you probably do not know, but some times ago I created world evolution model based on liquid for optical axes ointment usage in various areas of Earth)...
Right now I discuss in linux-kernel@ why kevent was not a pure FreeBSD kqueue rewrite (my infinite patience comes to it's end). In a couple of messages I think I will switch and start to write a creatiff.

And I was invited to cinema today... I hate to be sick.

/life :: Link / Comments (0)


Sun, 05 Nov 2006

Netchannels wildcard design picture.


Netchannels wildcard design picture (modified with some description).

Each dimension of the ruleset is splitted into non-crossing intervals, which form the tree (AVL tree on the picture). Each interval has a list of rules, which include part of that interval, second dimension of that rules forms new set of intervals, which forms tree, bound to the interval in previous dimension.
The worst search time complexity is O(d*log2(N)), where d is number of dimensions, and N is number of rules.

Theoretical research is over, let's implement this for Linux netchanels.

Actually it is some kind of the Grand Unified Flow Cache, described by Rusty Russel, netchannels already can host different protocols in the same cache, but only on top of IPv4 currently, since I have not implemented comparison helpers for different protocols.
Each netchannel has initialization and cleanup callbacks and processing function, which is called when packet has arrived. This set of callbacks allow to create netfilter, IPsec and any other transformation/encapsulation protocols as long as usual ones. Each netchannels also has data reading and writing callbacks, which can be used to copy data to/from userspace or for other usage.
Using netchannels it is possible to implement direct hardware-user bridges without any influence from OS.

/devel/networking :: Link / Comments (0)


Sat, 04 Nov 2006

Did dolphins live on the land?


Japanese researchers found dolphin with 'remains of legs'.
Japanese researchers said Sunday that a bottlenose dolphin captured last month has an extra set of fins that could be the remains of back legs, a discovery that may provide further evidence that ocean-dwelling mammals once lived on land.
A freak mutation may have caused the ancient trait to reassert itself, Osumi said.

And maybe that dolphin just wanted to move to the land and is not an ancient but modern trait? Probably in that case we would find more and more such a dolphins with different sized extra fins. Who knows...

/other :: Link / Comments (0)


Fri, 03 Nov 2006

I'm becoming a pigman or finally powerfull climbing training.


It was really hard training, although I did not run a lot of traces or performed complex exercises. I just did not climb for a long time - body lost lightness, muscles became inert, blood - slow. But everything has changed - I climbed for about 3 hours, completed one new trace and couple of old traverses, and now everything is in pain - arms, legs, back, the whole body rises from the sleep.
I have tired a lot, but I feel myself just bloody excellent!

/life :: Link / Comments (0)


This is not an any kind of political blog, but...


Tomorrow (Nov 4) Russia celebrates "Day of people's unity", some stupid people want to change this day into nationalistic and nazi appearance.

Do not participate in idiotic "Russian march" - just be smart.
Move to the cinema, watch a video, play computer games or build an ikebana, but do not jeopardize your life and lifes of your friends.

If some persons are just idiots, it can not be changed even if you beat them in public, and some really good people suffer from such stupids.
Be smart, be gentle, be cool.

P.S. Nov 4 1612 irregulars under the direction of Minin and Pozharskiy turned out interventionists from Moscow.

/life :: Link / Comments (0)


My english language troubles.


I always thought that having inverted comma (') and character s after the word means some belongings to that word.
For example with following sentence 'it's property' I meant 'property of it', but actually it means 'it is property', correct is 'its property' without inverted comma after 'it'.

So don't be confused - english is not my native language (I spoke with other person in english only once (although about 4 hours), not counting simple phrases like 'where is the bus station'). Until I find some manual I will try to restructure the sentences to avoid problematic places.

Ugh, well, and definite/undefinite articles...

/other :: Link / Comments (0)


Netchannels roadmap.


Following issues have the highest priority in my network development process:

  • netchannels wildcard support.
  • NAT implementation.
  • more extended userspace network stack testing.

And I need to implemnt most of the kevent and netchannels roadmaps very quickly (preferably this weekend) due to some other issues...

/devel/networking :: Link / Comments (0)


Kevent roadmap.


I plan to implement following features in kevent in order as they are presented:

  • new ring buffer implementation.
    It will be exactly like existing one, except that it will work with provided by userspace using special syscall.
    This will allow to make all kevent syscalls as so called 'cancellation points' by glibc, i.e. when thread has been cancelled in kevent syscall, thread can safely removed and no events will be lost, since each syscall will copy event into special ring buffer, accessible from other threads or even processes (if shared memory is used).
    Actually it is already possible, and it was there from the beginning of the kevent - one can provide shared memory (or memory used as shared buffer between threads) to kevent_get_events(), but in this case userspace should manage indexes (offsets where to put new ready events) by itself.
  • wake-up-one-thread flag. When several threads wait on the same kevent queue and requested the same event, for example 'wake me up when new client has connected, so I could call accept()', then all threads will be awakened when new client has connected, but only one of them can process the data. This problem is known as 'thundering nerd problem'. With new flag it will be possible to mark some events, so only threads requested to be in the thundering nerd (i.e. those threads which have not set the flag on given event) would be awakened when event is ready.
  • FIFO read/write notifications.
  • userspace notifications. Userspace will be able to register a special event, which can be marked as ready by some other thread. In theory it is just usual pipe between users, so it is possible (and likely) that I will not do it.
  • signal notifications. Unlikely to be implemented by me - I already have huge experience in 'working' with signal/rt-signal gurus, so additional project will likely mar my karma even more.
First one will be available in the new release (I expect to start working on it either after confirmation from other developers that our private discussion and my conclusions are correct or after some timeout, likely this weekend). I will think some more on second point (wake-up-one-thread flag), and if things will not have major problems, will try to implement this too.
I schedule next release for the next week, likely it's start, but it is possible that it will take much longer.

/devel/kevent :: Link / Comments (0)


Thu, 02 Nov 2006

Photos of the river trip early october have been added to gallery.


Enjoy the views of the Moscow center from small ship called "Chiszhik-2" (small siskin number two). There were only two ships - 'Chiszhik-2' and 'Atlantis' for us, we decided to select first one, although crew refused to answer, what happend with the first 'Chiszhik'.

/life :: Link / Comments (0)


NAT implementation on top of netchannels is not ready yet.


I was a bit optimistic about the fact, that netchannels support wildcards now - there is some work there (including theoretical aspects), so stay tuned.

/devel/networking :: Link / Comments (0)


Wed, 01 Nov 2006

Thanks to timing gods!


I almost completed NAT implementation on top of netchannels.
Of course 'almost' is not counted, but tomorrow I plan to finish it.
It requires connector for configuration, but there is no userspace yet.
The most complex part was actually wildcards support for netchannels tree.

And now I need some sleep...

/devel/networking :: Link / Comments (0)


Linux event handling.

Here is my reply to Ulrich Drepper's article about event handling (Ulrich's text is marked with italic style).

I had a little chat with Andrew Morton on Monday. The topic was the event handling, async I/O, etc. All the same stuff I talked about in Ottawa at OLS this year (paper and slides on my web page).

Unfortunately not much has happened since OLS. Lots of people said they would be willing and able to help but life intervened, I guess. I'm no kernel developer but perhaps I have to start taking this up (this is a threat...).

There is one stream of patches which exist but they are not usable in my opinion. I'm talking about Evgeniy Polyakov's kevent patches. I send my analysis to lkml but this didn't result into much (except removing the ring buffer support).


It is not correct.
After ring buffer implementation request, no one wanted to participate in discussion, so I implemented it like I wanted. After 13 releases I was pointed that 'it is broken' and Andrew Morton was asked to not include kevent into official releases due to that. I removed ring buffer implementation and started new discussion on how it is supposed to be implemented where only me and Eric Dumazet participated.
After two weeks of waiting for additional commends I implemented new version, which, looking into description, I think was not seen by those who requested the feature.

There are more fundamental problems with his proposed approach but I don't get through to him and it appears as if I'm the only one complaining. Andrew keeps saying that this is because nobody else understands the topic. Well, let's be a bit more specific. I'll try again to motivate the case, describe how I think the userlevel interface should look like and then deduce something about the kernel interfaces. A bit of the latter Andrew and I did today.
There are also network and file AIO aspects to this and then the DMA handling, but I'm restricting myself here to the event handling.

The fundamental premise is that multi-threaded code is absolutely necessary going forward. All processors grow horizontally by having ever and ever larger numbers of execution contexts (cores, hyper-threads, ...). For applications to scale there are two types of approaches:

  • Parallelize large chunks of code. This is not a suitable approach for small chunks of code (because the startup, and distribution and collection of results are not free) or for code which is hardly parallelizable. If code can be parallelized people hopefully will start using OpenMP and similar methods. Unfortunately not everything is parallelizable in this form.
  • Parallelize individual requests.

A web server is an example where the former approach is not usable but the latter is. Usually each request is easy to handle. Even for dynamically built pages the work is usually small. It makes no sense to parallelize the code. But many requests can be processed in parallel.

We are talking about completely differnet approaches to parallelize application - code and data parallelization. The former approach works with already separated data, so we have individual requests moved into different code processing units, so it already requires to split a flow of events. The whole topic is devoted exaclty to that problem.

So it is logical to achieve parallelism by having many threads (or processes as in Apache).

Such approach, if being used largely, quickly beckomes a bottleneck of the whole system.
Existing scheduling allows fast context switching and serialization could be reduced, but it is overhead which in badly designed cases becomes the slowest part of the system.
There should not be any 'logical to do' things in system design, instead the maximum performance should be the main goal, which means removing all unnecessarry overheads from context switches and serialization. It is only doable when one CPU runs one thread, and that thread has correct set of tasks to do, so it should not be scheduled when there is no work for it, it should not busywait until other threads complete it's work.

This is what programs do today. But this is not without problems:
  • The current network interfaces are not well prepared. If you park all inactive threads in the accept() call to get a new incoming restriction you get a trampling herd effect if there is an incoming connection: all threads get woken up. But only one or a few will find that the accept() succeeded. The same is true for poll() etc calls.
  • A simple solution - do not park all threads into accept(). Application design should (and I would even say must) not split single dataflow into different users (like split flow of clients to different users), instead several dataflows should be created, so each thread would have been awakened only when it can handle the data. It is easily doable with kevents, but not implemented yet. I think next release will have this feature (by introducing special per event flag).
  • If the number is requests is high it is a bad approach to increase the number of threads until no connection remains unanswered. The problem is that this adds a lot of administrative overhead in the kernel and processor (context switches). If the processing of the requests uses 100% of the CPU time we end up with a higher total processing time because of the overhead.
  • If the CPU time is not 100% we are wasting even more. If a thread does not have 100% utilization (and this is almost always the case) then the threads blocks somewhere, either a system call or a synchronization primitive. But this means there is CPU time available for the thread to do something else like working on another requests. So the total system time increases because of management overhead for no good reason: one thread could do the work needed for two or more requests.

I completely agree.

The optimal programming model would therefore use only as many threads as are needed to utilize the system at 100%. This means one or two threads per CPU or core.

My point is still the same - only without _any_ kind of overhead introduced by context switches and serialization problems perfomance of the system can achieve the maximum.

To scale with this fixed number of threads and achieve 100% utilization it is necessary to avoid any blocking. If a thread has to perform an operation which might block (such as reading from a socket or waiting on a mutex) the normal call cannot be executed. Instead the thread requests an event to be signaled when the reason for the blocking is gone and while this is going on it starts/continues working on another request.

This is a model of appealing simplicity: scaling is achieved by adding threads into a central loop. Add threads corresponding to the capabilities of the machine not the complexity of the program. All threads are using the same code.

Obviously such approach introduces major problems for naive usage.
Having one thread per taks essentially leads to the situation when number of threads which can proceed with data exceeds number of idle CPUs which can run that threads, such situation must be generally avoided.

This simplicity in the inner loop does come at a price: code has to be restructured to avoid all blocking calls and the central loop has to be able to wait for events of all types corresponding to operations which can block and are avoided.

This is where my proposal comes in. The new event handling mechanism is the bit in the inner loop. We can define its abilities and properties as this:
  • We can wait for all kinds of events corresponding to all the possible interfaces which can block

  • kevents were designed to be able to host all existing event types.

  • Event notification better be fast. Already today with 1Gb/s Ethernet we can create requests faster than they can be processed. This is only to increase with 10Gb/s Ethernet or fabrics.

  • kevent notifications are fast - there are number of benchmarks where they are compared with epoll(). It is also possible to get into account network AIO benchmark, which I created on top of kevents.

  • Obviously, the interface must be usable in multi-threaded situations. This means multiple threads must be able to wait in the kernel for event. The alternative is to have one waiting and signal a thread in a thread pool that an event is available. This is unnecessarily slow. When multiple threads are used it must be avoided that an unnecessary number gets woken if only one or a few events are available. Ideally we wake one thread per available event.

  • kevents are completely thread-safe.

  • Ideally the interface is also usable from multiple processes.

  • Is it about having the same queue of events for several processes?
    Is it about the way of putting more and more users for the same pipe, when it is possible to create aditional pipe?
    Is it about the case, when application is designed so, that it is possible to lose events, and thus they would be picked by other application?
    In any way, one can create a fifo and put all events from the same queue into it, so several application could read them.

A consequence waking only one thread per available event is that no wakeup must be lost. This is a new concept which does not exist in the POSIX interfaces and the kernel (except the level triggered epoll interface). For this to work we have to have a mechanism to tell the kernel I cannot handle this event, tell another thread. This is important, for instance, when the system call to wait for the event is canceled (as in thread cancellation). In this case the userlevel function call does not return and the program does not even know the notification is lost. Therefore the system library wrapper the wait-for-event system call needs to tell the kernel about the notification loss.

It is not correct behaviour - it is possible to use different queues for different types of events and bind threads to that queues, so only users interested in given event would be awakened. Any case when event is dequeued and then put back are error-prone - it is possible than new event can not be put back due to memory pressure. Any case when something can be lost is just unacceptible - there are a lot of troubles for those who work with rt-signals and when queue overflows, it is a real pain. So no new design sould be even remotely possible to have possibilities to just lost an event.

The requirement for such a system call to request additional notifications has itself a consequence. If the wait-for-event system call returns itself the data for the events instead of just a notification, then this data would have to be pushed back to the kernel as well. This is a high overhead. And more problematic: if there is not sufficient memory to store the event data, what to do then?

I completely agree that it is a dead way.

It is therefore better to separate the signalling of an event from delivering the data associated with the event. The data can be retrieved using another system call. Or: we can use shared memory. Shared between kernel and userlevel. Or userlevels (plural), see the multi-process requirement above.

If we implement the shared memory mechanism this calls for a ring buffer. Since there can be multiple threads working on the same ring buffer it must be possible to mark a ring buffer entry as worked on and/or at least as finished. The bulk of the ring buffer needs not be modified. In fact, it might be best to not allow the userlevel code to change it at all. This way the kernel care store information in the ring buffer in a form which might cause problems if the values are changed (e.g., pointers of some sort). Not all the fields of a ring buffer entry need to be used or even usable by the userlevel code.

There are at least two shared ring buffer implementation, created by me, but I never got any feedback at design descussion time, and even after it was implemented, it wook quite a lot of time and pressure from my side (and other developers) to get back the feedback, that it is broken.

Btw, it is possible to use shared memory for kevent_get_events() syscall, so that ready events put there could be shared between processes. It is even possible to completely implement such ring buffer in userspace without any additional copies.

Proposed Kernel Interfaces

This is the current state of my/our thinking. We want to use a ring buffer. The memory must not be required to be pinned down. The memory might only be used among the threads in a process or among several processes. Depending on the needs the application should allocate memory either anonymously or file-backed. The size of the buffer can be chosen by the application.

To create an event queue we pass the address and size of the buffer to the kernel:

int kev_create (void *addr, size_t size, int flags);

We throw in a flags parameter to be on the safe side for future use.

We need some structure in the buffer. At the very least we need to be able to locate the ring buffer and identify the elements of the buffer which are currently active. If we are going to implement CPU ID-based cache line optimization (see next section) we need some sort of mapping from CPU ID to ring buffer entry. And we need to chain the ring buffer entries to reach only the entries for the CPU. Searching for them defeats the purpose.

Ideally we would split the memory area in two pieces. One piece is the ring buffer itself. This part needs not be written to by the userlevel code. See above for the advantages. The writable section is the part where the program(s) provide feedback to the kernel about the entries which have been handled. Here we have two possibilities:
  • We use only a front and a tail pointer. The front pointer points to the first unused entry, the tail pointer to the last still-used entry. Everything between tail and front is assumed used even though some entries might actually have been handled already. Everything between front and tail is free and available to receive new events. This approach requires little work from the kernel when queuing new events but it means we might run out of space even if there are empty slots. The latter can be avoided by making it the policy that userlevel code has to evacuate the ring buffer slots ASAP.
  • The alternative is to use per-ring buffer-entry accounting. Each entry has a flag associated with it. This means less memory requirement. We can still have front and tail pointers to mark regions about which we have absolute knowledge but the kernel would also try to look for holes and use those entries.

The alternatives do not differ too much. When moving the tail pointer we always have to have knowledge about the individual entries. It's just that in the second case the work is all done by the userlevel code. The kernel has an easier life and we avoid possibly long delays (imagine a huge ring buffer with many thousand entries). Maybe the best argument in favor of the first solution: we only have one pointer is modifiable state (the tail pointer, the kernel maintains the front pointer). This means we potentially can do away with the writable part of the mapping altogether. Just pass a void ** value to kev_create() and maintain this variable in the runtime. This advantage would go away if we need more writable state.

A few words about the memory handling of the ring buffer. I would imagine that on 64bit platforms we can use large areas. Several MBs if necessary. This would cover worst case scenarios. The key points are that
  • the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that fails because the memory is swapped out) and
  • we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)). This can be done implicitly in the kernel or explicitly at userlevel. There is therefore no huge memory usage associated with the ring buffers unless the program never frees entries. It is all ordinary anonymous or file-backed memory, subject to existing rlimit restrictions. No need to invent yet more rlimits.

There are two differencies in existing ring buffer implementation:
  • it does not use special syscall to reserve that memory, instead predefined number of pages is used. Since kevent initialization syscall was removed, I do not have easy way to allocate required number of bytes, althoug it is possible to change that.
  • main difference - existing implementation uses pinned memory (Here I was asked for it), but in this post Ulrich proposes not pinned memory to be used. It requires that data copying happens always in process context, so there must be either special kernel thread (and thus additional context switches in addon to context swich, required to start actual working userspace process), or syscall. So userspace already should run a syscall, so that kernel would put events into ring buffer. In existing syscall kevent_get_event() exactly the same is porformed, only pointer to area where to put new event is provided as parameter. So it _is_ the same without any special ring buffer handling and additional interfaces.

Until I completely do not understand what is the point, it looks like it is redundant functionality, since it is exactly how kevents work right now, but with additional unneded complexity.

To register delivery of events through an event queued we likely need a number of interfaces. Fortunately one interface already exists: the sigevent structure. I described all this in my paper but some people keep misunderstanding this point. The use of sigevent has nothing whatsoever to do with using signals. It's just a convenient way to request for a whole bunch of existing interfaces that the new method of event delivery should be used. This applies, so far, to POSIX AIO (which might be extended for networking) and POSIX timers.

As a side note - there is kevent based network AIO, which has much more convenient and intuitevly used interface than POSIX ones.

Other uses will need new interfaces. For instance, signal delivery could be requested through an extension of sigaction(). We define a new flag to be ORed into sa_flags and add int sa_kev to the union already containing sa_handler and sa_sigaction. None of these would be use at the same time.

My biggest worry in the moment is to find a suitable interface for futexes. I'll think about it once the work actually starts.

I always asked myself, why should we use POSIX, when we have excellent poll() interface?
It is ery convenient, it does not have unneded parameters, it is simple and it perfectly works for our needs.
Anyway, anyone who likes to create new interfaces can use special function in kevent to add them from your own calls - that is how network AIO aio_send()/aio_recv()/aio_sendfile() work.

Implementation Notes

These are a few notes about implementation details Andrew and I came up with during the discussions. It is not necessarily all usable:
  • Ring buffer entries can be annotated with CPU ID (or perhaps even core ID). This would allow the scheduler to chose among the threads which are waiting for events to chose one which is scheduled for the same CPU. At userlevel we have the getcpu() system call and we are therefore able to pick an entry from the ring buffer which matches the CPU. This means no unnecessary cache line transfers.
  • To locate the entries for a CPU the ring buffer could have a hash table or something similar which points to the appropriate entry. It would also be possible for the wait-for-event system call to return the index of the ring buffer entry for the current CPU.
  • I think the memory for the ring buffer should be either anonymous or backed by tmpfs. Nothing in a persistent file system. We should not in any form or away encourage that people even look at the data in the file. It should be meaningless and should not survive a reboot.


So in general I agree with Ulrich, although ring buffer implementation in userspace not-pinned memory is a real questing for me, since it looks like it does not differ from what kevent already provides.

But if we dig deeper, we can see that actually main problem is not technical one.
Things develop from bad to worse - I do not get feedback, try to determine it by myself, Ulrich proposes already implemnted ideas, votes against kevent, I try to understand and fix my implementation, but do not get feedback... and so on.

/devel/kevent :: Link / Comments (0)


New kevent 'take22' has been released.


Short changelog:

  • minor cleanups (different return values, removed unneded variables, whitespaces and so on)
  • fixed bug in kevent removal in case when kevent being removed is the same as overflow_kevent (spotted by Eric Dumazet)

/devel/kevent :: Link / Comments (0)