added high-resolution timer to handle absolute timeouts.
added flags to waiting and initialization syscalls.
kevent_commit() has new_uidx parameter.
kevent_wait() has old_uidx parameter, which, if not equal to u->uidx,
results in immediate wakeup (usefull for the case when entries
are added asynchronously from kernel (not supported for now)).
added interface to mark any event as ready.
kevent POSIX timers support.
return -ENOSYS if there is no registered event type.
provided file descriptor must be checked for fifo type (spotted by Eric Dumazet).
signal notifications.
documentation update.
lighttpd patch updated.
Appropriate applications (ring_buffer.c, signal.c, posix_timer.c) and lighttpd patch
can be found in archive.
I've completed a bunch of kevent updates including major kevent interface changes
and will release new version soon. I've also created some locking changes, which
I call speculative locking (i.e. some fields (ret_flags mostly)
can be updated without locks, but it should not harm the system),
which can be wrong, but results are impressive.
Hardware setup.
Client: Intel Core Duo 3.4 Ghz (run with 3.7 Ghz each core), 2 GB of RAM, sky2 driven gigabit NIC
(Marvell 88E8053).
Server: AMD Athlon 64 3500+ 2.2 Ghz, 1 GB of RAM, r8169 gigabit NIC.
Software.
Both client and server systems have this network timewait socket setup:
Client runs Apache benchmark utility, which was started with following command line:
ab -n80000 -c8000 http://192.168.0.48/
i.e. 80k requests total with 8k requests started concurrently.
Results.
kevent about 7200-7400 requests per second
epoll about 4300-5300 requests per second
With 10k concurrence performance drops to
kevent about 4000 requests per second
epoll about 3800 requests per second
it is possible that I made some errors in that benchmarks though, so I will not publish
results to linux-kernel@ or make awailable from kevent
homepage.
Update: even without speculative lock changes I got the same outperforming results. Kernel is compiled for SMP setup,
all debugging is turned off (in my previous run on that platform I had some debugging options turned on,
and performance was about the same and even worse for kevent), preemption model is 'Low-Latency Desktop'.
Update 2: forgot to say, that web server is lighttpd-1.4.13.
Here is mp3 version (6.1 Mb, renamed to moscow.mp for those who
has corporate filters for mp3 files) of the song from Mongol Shuudan band.
This song and some others are available from band's homepage - mostly punk style.
I have not tried to extract sounds from it yet - I'm in office and I want to get home without
major damages, which are unavoidable if I would try to 'play' right now - there are several people
here, and they even try to work, so my first lesson is postponed to the evening.
I will start to rape and crack neighbours ears later today at home - likely no one will suffer too much,
building is quite new, so there are only about 20 'alive' flats in the
whole building (about 350 apartments).
I think my sounds will be in a harmony with sounds extracted from concrete breaking perforators and corner-grinding machines.
I'm talking about my climbing training, if you have not understood.
Actually there were nothing special from climbing point of view - the same
traverses and bouldering, some progress here and there, bits of mixed traces
and so on. But the whole training was extremely good - when
interesting people smile to you and you smile to them, when you found
new moving on the old trace, when body aching after the holds, when ... -
all that small bits create excellent mood and I become to think that I just
spent perfect time, and actually most of the time (I would even say overwhelming
part of the time) I spent was really good.
Grange has really burnt them brains -
in the brazilian pub he took away a guitar from singer and started
to sing russian song "Moscow" with Sergey Esenin words.
I would say that Grange signs not that bad, and especially when
he is drunk, and song itself is very good too, but the whole situation made it even better,
so he was called 'the most talented man at h2k6'. It is quite possible. One can even find
his records on the homepage (in russian).
So, it was very interesting meeting, as we can see.
Among good mood he also got bunch of various hardware, so I hope he will finally
start hacking interesting stuff instead of slacking his ... fifth point.
Although long-running test was completed without errors (actually
it is not interesting test at all, since 10 rules is a too small set),
after some thoughts I've found a bug in the algorithm - I tried to avoid recursive searching
in case of wildcard nodes found traversing trie, but it seems that it is unavoidable.
I also added trivial memory leak detector, which found another bug in algorithm -
there is possibility to not free next-dimensional roots in some cases.
So I started to change algorithm itself, and although its theoretical ground
is fully completed, practical implementation has a place where bug lives - I know
exactly what is the bug and where is that buggy place in the code, what are the reasons,
actually I know everything to fix it, but do not know how to make a good fix. Yet.
It is a first and the most significant step in netchannels wildcard implementation.
I tested it with different number of dimensions (1-5) and different number
of added/searched/removed rules (upto several millions in each dimension).
Algorithm is simple and fast - my previous
estimation was incorrect - currently 1 million 5-dimensional insertions take about 6 seconds,
1 million 1-dimension insertions take about 1.1 seconds on my test machine.
When I add search/remove operation (one-by-one, i.e. add one, search for it and remove it),
the whole tuple takes about 15 and 3.2 seconds for 5-dimensional and 1-dimensional case accordingly.
I will (probably) create some graphs describing algorithm and how it will be used
with netchannels tomorrow, but if algorithm itself is implemented correctly, it is
the most significant part of wildcards support fot netchannels.
I have started following testing case for 5-dimensional case for the night:
10 rules are added, then 10 various values (actually inserted values, but with 0xffffffff mask)
are searched for, and then all they are removed, let's see how this will end up tomorrow.
After quite empty day (I did not hack, did not work, did not do anything usefull)
I moved to climbing zone and spent there excellent time - there were
several old traverses completed, couple of new boulderings. The most interesting
part was relief traverse - each day I add more and more holds there, although
it is marked not as the most complex trace, I only saw one person who tried
to work with it (except me), others somehow do not like that trace.
unfortunately in Skala-city there are no
relief vertiacl traces, which I'm quite sure would be my favourite ones.
I've started to think about presents.
First of all, present for myself (well, second one actually) -
I think I want to start to make photos, so I decided to get
Cannon EOS 400D. Not sure if I will really buy it, but at least
I think about it.
The best present for myself would be flat repair complete,
but we will see...
So, to touch the perfect, I've subscribed to couple of interesting
livejournals of photo masters.
There are indeed several types of photos
I like - first of all, it is night photos, then probably building/stuff
photos (especially from higher floors or roofs) and panorams.
I have extremely small experience in that area, but will try to learn. Maybe.
This algorithm describes what is hackathon and what they are doing there:
For comparison: Linux network developers conference NetConf
is completely different meeting - hackers sing karaoke songs and seems to not drink at all, I never participated there though
so can not say for sure. On picture core linux network hackers
(David Miller,
Rusty Russell and
not-yet-a-netdev-hacker Soyoung Park) select karaoke song.
I completed 1-dimensional trie implementation and started its wild testing -
random prefix and address mask is created in infinite loop and then
random address is being searched there - it has passed more than 123 millions insertion/searches
and ate about 1.8Gb of ram - each trie node is 20 bytes in size,
which would take 2.29 Gb of RAM if they lived as-is, but due to the nature
of tries, which allows to reuse the same entries for several prefixes, memory usage
is smaller. 100k insert+search operations took always less than 2.5 seconds on my test machine.
Next step, which I will start after weekend, is multi-dimensional trie implementation,
which is actually netchannel wildcard support.
Grange has flown to OpenBSD hardware mini-hackathon.
This small party, also known as 'I-can-drink-more-than-you-can-hack'
(there is no real evidence that they spend there time with profit for the world
or at least for own health), will be spent
in University of Coimbra in Portugal.
Grange will hack there probably Dallas one-wire bus, maybe something else.
Actually I do not know, why he is going there except that to become drunk with
people which live about tens of thousands kilometers away.
I hope he did not forget to get one bottle of 'Russian Size'.
So, let's wish them nice time there.
Ok, after I was requested to change interfaces again to the state they were in the previous release,
I've thrown the last words - I'm waiting for about one week and listen for other developers to comment
on the interfaces, if no one will comment, I will gather all the latest suggestions, implement them and make
the new release - no interface changes after that day.
And I do not care if there will be some problems and/or
some people do not have time to review all my releases or something else - I have time to implement them on per-week
basis (I could create new release every day actually), so there is plenty of time to review for those who cares.
If you have something to say about kevent, your opinion on its inclusion/declining or on my behaviour and so on,
it is the right time to say.
Thank you.
I'm going to implement netchannels wildcards -
it is much more interesting than spending hours on completely void tasks.
struct timespec and possibility to have absolute timeouts in syscalls instead of simple number of nanoseconds.
signal mask parameter for syscalls to allow temporal blocking of the signals while syscall is executed
(almost, but I'm still trying to show that it is stupid idea with kevents).
kevent posix timers support.
allow kevent to mark any events as ready from userspace.
allow to break any waiting syscalls.
various errno return values update.
various interface changes, like addition of some new fields, and removal of otheres, which will not affect
functionality behind that syscalls.
remove some locks in the hottest path - it is indeed needed task. I plan to refactor searching/modifying code so that
there were places in the struct ukevent which are never modified by another than caller users (like
simultaneous changes in readiness state machine and searching part), so there would be possible to only
have one lock in the readiness state machine around ready queue.
Call me a looser, but although I'm strongly against first two entries, I likely will add them
to make at least some progress on kevents.
Crap, I spent 2.5 hours sending e-mails into empty discussion. I wish my day
would have not 36, but 48 hours...
Early morning hack has been sent to mail lists for review.
I have not tested it (except compilation test), and frankly do not want,
but if my flow of thoughts is correct, I will create simple application,
which will use timer_create() and friends to test
kevent posix timers support.
Implementation (and design of POSIX timers API) does not allow to have
KEVENT and SIGNAL or THREAD delivery mechanism simultaneously, at least I
concluded it after quick look into the code, which is probably is not correct.
Let's see if my patch is good or not.
That was quite easy training - I ran several old traverses
and couple of bouldering, also created one traverse, which previously
contained green holds, but most of them were removed or replaced with time,
so it was a new trace, that's all - nothing special this day.
The most pleasant part was sauna and shower which definitely were required
by development related dirtiness.
Now I feel mysef just bloody excellent.
Most of them are implemented in the local tree, I'm just waiting
for ack from Ulrich Drepper:
new command, which allows to mark any events as ready. It is also possible to just
wake up listener of the given kevent queue if number of events requested to be marked as ready is zero.
It allows glibc to notify other event processing threads, which are parked in the
waiting syscalls like kevent_wait() or kevent_get_events(),
that they can process new events in cae of originally awakened thread was terminated.
fixed unsufficient parameters check in kevent pipe notifications (spotted by Eric Dumazet).
return -ENOSYS if there is no registered event type instead of -EINVAL.
add 'flags' parameters to kevent_init() syscall (not implemented yet).
Actually looking at how kevent
makes fast progress and how slowly I get positive feedback and how complex are negotiations
about features and requirement addons are performed with some people, I seriously doubt
there will be enough resources to really complete the task in the way I want and think
is the best for future development.
I have several possible ways to solve existing (stagnating) problem with APIs and feature addons:
just implement all horrible workarounds and ugliness I'm asked about, which I think
will allow quite fast kevent integration into mainline.
continuously perform seems to be endless negotiations with mutual insults in linux-kernel@. This will not force
kevent inclusion, and likely will not end up with it at all, but there is probability (I think it is somewhere
between zero and void), that I can convince people that I am right. Practice shows that it is hardly to
convince people, which do not hear what you talk to them, no matter correct it is or not.
say that 'I do not care and will not even discuss all requested utterly damaged ideas' or something like that
and stop discussion. It can or can not help with kevent integration, but in any case I at least will have enough
time for more interesting tasks than empty talks in linux-kernel@.
For example netchannels wildcard
support and fast NAT implementation, zero-copy
support for netchannels, asynchronous IO, threading library,
log-structured filesystem
The more I think, the more I like third variant...
Something like that.
I screwed it yesterday evening, and can say that although making hinged ceiling alone
is not that simple task, it is quite doable.
Next task is to perform second layer, it will be lower and take more square, but it is
task for another Sunday.
new (old (new)) ring buffer implementation with kernel and user indexes.
added initialization syscall instead of opening /dev/kevent.
kevent_commit() syscall to commit ring buffer entries.
changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
only first thread always if that flag is not set.
KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
instead of copying back to userspace when kevent is ready immediately when it is added.
lighttpdpatch
(Hail! Although nothing really outstanding compared to epoll).
And special blog edition changelog entry (interesting, do people read my patches to detect
this lyrics? Let's see...):
small three-line verse (I think I created not bad step for non-english speaking person) about one-line optimisation.
Casted by Shakespeare.
I installed slightly used, but still functional (bought on ebay) remote
mind reader, and set it up to read Ulrich's alpha brain waves (I hope he
agrees that it is a good decision), which took me the whole week.
So I think the last ring buffer implementation is what we all wanted.
Details in documentation part.
That is how I spent Sunday.
Weight of that part is about 20 kg, it was challenging to lift and fasten the detail and
it is still in a hung state, since I do not have long enough screws. But I will complete it soon,
maybe even today.
It will be 'take25', but I have not released it yet, since
it is in testing mode right now.
This release will include:
new (old) ring buffer implementation with head and tail in userspace buffer with two syscalls
to wait and to commit events (the latter becomes just an index update).
additional feature of the ring buffer - overflow counter to prevent situation when two threads
are going to free the same events, but one of them was scheduled away for too long,
so ring indexes were wrapped, so when that thread will be awakened, it will free
not those events, which it suppose to free.
wake-all-threads flag - when it is set all interested threads will be awakened, otherwise
only first one.
initalization syscall and removed /dev dependency.
small three-line verse (I think I created not bad step for non-english speaking person) about on-line optimisation.
Casted by Shakespeare.
I wil probably include there userspace notifications.
Did I say that httperf is unfair benchmark?
It definitely is, and actually all existing kevent results are unfair. So... Apache ab benchmarks shows
kevent win, but I do not know how fair it is and how it works at all.
It is also possible, that my epoll based server is broken too.
So, let's state for now, that my benchmarks can not show kevent superiority in socket tests.
PIPE/socket test, created by Eric Dumazet proves kevent works better that epoll.
Johann Borck's tests show that kevent slightly outperforms epoll in web benchamrks.
But not mine.
So I have not gone to climbing zone (and I'm dirty as pig after weekend hinged ceiling setup) and
started lighttpd patch to create real-world benchmark.
Here we go.
I've patched lighttpd with kevent support (although it is limited,
since lighttpd does not support edge-triggered notifications and immediate events).
Server runs on Xeon 2.4 Ghz with 1 GB of ram and e100 NIC.
Client uses apache ab benchmark,
which was started with following parameters:
ab -n50000 -c5000 http://192.168.4.78/
i.e. 50k requests with 5k concurrent requests. Document size is 699 bytes to not make bandwidth a limiting factor.
although it can be just a file/socket rollup.
So essentially kevent socket code behaves like epoll in real-world benchmarks,
as long as allows a lot of other interesting features.
I live in a new district which is only became to be inhabited.
There is a bus station near the buildings and generally there are
a lot of people there and obviously it is a good place for small
shop or stall which would have cigaretts, beer and other such stuff.
And what would you think is the first shop opened there?
It is flavour shop!
I love my country.
Well, it is not new, but it is different from what was described
before.
And although previous variant has higher lookup speed,
it requires too much memory, so it is not acceptible.
New one is quite simple - it will be implemented on top
of trie structures. Main feature of tries is that they are quite compact
(in case of PATRICIA tries) and allow W^d lookup speed in the worst case,
where W is width of the interval in bits (i.e. 32 bits in current netchannels
implementation, which does not support IPv6 due to my laziness and absence of
requirements feedback from the community), and d is number of dimensions
(5 currently), which results in very bad number of 33.5 millions operations
in the worst case of (2^W)*(2^d) == 2^(W+d) (137.5 milliards) rules.
Actually with netchannels it will be much smaller since only two dimensions use
full 32 bit space (source and destination addresses), while others use either 16 bits
(source and destination ports) or 8 bits (protocol number), so it will be
something about 2 millions operations.
Basic idea behind tries is following: trie is a simple unbalanced tree each node
of which represent only one bit (i'th level of the tree represent bits in
i-1 position in the searched value), so it's maximum depth is 32 (in PATRICIA
trie several bits can be combined into the same node if there are no other rules,
which have different bits in that positions). In that case rule 192.168.0.0 - 192.168.0.255
will be presented as 0b11000000101010000000000000000000(192.168.0.0) - 0b11000000101010000000000011111111(192.168.0.255),
so actually trie will be stopped at bit 24 (counted from the most significant one) and the rest will present a wildcard
(0-255 in the least significant byte in host order) or it will contain additional nodes if there are
additional rules which has bits in that range, for example 192.168.0.0-192.168.0.20.
Each direction is a base for own tree.
Each node contains a pointer to the subtree of the next-dimension tree, where its rule starts.
So when each node is examined (i.e. each bit is checked) in the first dimension and bit is found
to to belong to some rule, next dimension is checked from the bit position specified in the node
and this is being done recursively for each dimension. If in some dimension requested bit is not found
in the trie (where it should be by the previous dimension rule), it means that there are no rules
for requested values.
In the previous memory-hungry design,
if rules do not have ranges but only single tuples (i.e. single source/destination IP address and single ports and protocol),
or if all rules present the same range (for example all port rules have the same 1-65535 range),
it can not lead to the described problem with memory usage. The more range overlaps rulset has,
the more memory such algorithm will use - in theory, when each new rule requires interval split,
i.e. when each new rule overlaps with all previous and has additional range (for example when each new rule
is slightly bigger than previous one), it is O(d*W*(N^d)) where N
is number of rules and d is number of directions, W is width in bits of each dimension (actually
it can be hidden by O(), since it is a constant for different set of rules).
So I could quickly complete
flat development,
moved my laptop to HP repair (its disk fails after couple of hours of work) and started to work
more than currently.
Likely if you do not see entry in the blog, that means that I'm too busy at paid work
(or even unavailable at all) to work on my own projects.
All existing AIO - both mainline and kevent based lack major feature -
they are not fully asyncronous, i.e. they require synchronous set of steps,
some of which can be asynchronous. For example
aio_sendfile() requires
open of the file descriptor and only then aio_sendfile() call. The same applies
to mainline AIO and read/write calls.
My idea is to create real asyncronous IO - i.e. some entity which will describe
set of tasks which should be performed asynchronously (from user point of view,
although read and write obviously must be done after open and before close),
for example syscall which gets as parameter destination socket and local filename
(with optional offset and length fields), which will asynchronously from user point of view
open a file and transfer requested part to the destination socket and then return opened
file descriptor (or it can be closed if requested). Similar mechanism can be done for
read/write calls.
This approach as long as asynchronous IO at all requires access to user memory from
kernels thread or even interrupt handler (that is where kevent based AIO completes
its requests) - it can be done in the way similar to how existing
kevent ring buffer
implementation and also can use dedicated kernel thread or workqueue to copy data into process
memory.
It is very interesting task and should greatly speed up workloads of busy web/ftp and other servers,
which can work with a huge number of files and huge number of clients.
I've put it into TODO list.
Someone, please stop the time for several days, so I could create some really good things for the universe.
It was quite easy training, since there were crowds of people,
which usually means, that it is quite problematic to climb horizontal
traverses. But nevertheless I completed several old traces
and started parts of the relief traverse (longer and longer each training,
which is a good sign). At the end I also completed couple of start of complex
traces on negative slope and found an interesting bouldering
(read: lurked in climbing school training which happend near the place
where I had a rest), but failed to complete it since was tired already.
Sauna, shower at climbing zone and a little beer with Gabriel Garcia Marquez book
at home completed the day - yet another good day has gone...
It was first described here,
but I did not present memory usage for that scenario.
After some calculations I found it is about O(d*(N^d)/4 + N) where
N is number of netchannels in the whole tree and d is
number of dimensions we search in. With 5 dimensions (source/desctination address/port and protocol number)
it becomes completely unfeasible.
And after my implementation completed (although it requires some cleaning now)
I've found that memory limit on practice.
Presented design allows extremely fast searching, but it is too memory hungry,
so I will modify it to not replicate sub-trees of intervals in each interval,
but instead to have different kind of trees which will reduce
memory consumption down to O(d*N) instead of above O(d*(N^d))
and increased search time.
So, stay tuned...
I've seen Eugene Grishkovec's 'Po Po' performance
created under expression of Edgar Po works.
Performance is a compilation of many novels written by famous author,
all of them are easily detectible.
Grishkovec plays with Alexander Cekalo, they call it 'fun story about the death'
and it is really so. I would even say that Grishkovec plays second role in that
single-actor monologue performance.
There are some opinions that Grishkovec has star fever, but I would not say so - neither
during his presentation before the performance, nor during the whole act he showed
any signs of it, I managed to get tickets in the first raw, so I saw actors right in a couple of
meters. Even after performance, when we met Grishkovec on the street with his girlfriend
and asked couple of questions and presented a flower he behaved very good.
I like his performances, they are really different from the mos others, but unfortunately
he plays quite rarely in Moscow. I hope I will see his 'Drednouti' ('Dreadnoughts')
performance this season.
Later we got couple of beers in 'Vobla' beer restaurant and moved home. I think the whole
day indeed succeeded, although it looks I caught a cold in addon to my current sick...
Do you have an excellent spatial imagination?
Can you consider yourself and easily operate say in 3-dimensional world?
Likely you can, since we live in such a world. Probably we can think about 4-dimensional world too.
But can you describe to someone 5-dimensional world?
Or can you put yourself into one of the N worlds, where N is potentially millions,
each of which has number dimensions smaller than previous world by one and upto N subworlds in each dimension?
And while you - poor walker - are searching for the shortest way to the destination point
in that brothel, someone looks into your universe through the glas and loudly lough counting
your times clock.
I've just described couple of moments in a single network packet live,
when it enters netchannels subsystem with
this wildcards design.
It looks like it will have complex live, but we will see, since I've implemented
generic multi-dimensional tree hierarchy and will complete searching algorithm
and start testing tomorrow.
And now I want to finish my fantastic narration, since I'm going into
a theatre.
It is extremely 'trendy' book, here
one can find a chapter called 'Office'.
Crap. Complete crap - some kind of cool artistic slightly squeamish flood of mind(less).
One can probably think that it is trendy to read it, that it is cool, it is another
view on our usual life and so on...
Maybe it is right, but I think it is a crap created to show 'how different author is from all
those stupids'.
There are quite a few fun moments though - probably life of the top management really looks
like it is described.
"What about putting the sea or beach or wild nature against a background of a woman".
But why so declamative? Why so depreciatingly? Crap.
Essentially, delaying the disk allocation until a file is large/complete
(delalloc) and then using a buddy allocator in memory to get contiguous
chunks of disk is better than hard 64K+ cluster because it avoids
internal fragmentation and allows much more optimal/efficient placement
than just a factor of 16 reduction in the block count.
This completely confirms my thoughts
about delayed allocation.
But I want to add even more - log-structured file system, i.e. file systems, which
allows to write updated block essentialy everywhere on the disk instead of the place
where previous one was placed, have self-defragmentation mechanism by design - there is no need
to reallocate inodes and move files all over the disk - new version can be put
in new location, which can be carefully selected and can be written together with unupdated part
thus removing fragmentation. Journaling filesystems require defragmentator for that task.
For netchannels wildcard support I need to create set of non-crossing
intervals which will be examined for each packet. It is possible
to put them one-by-one and thus form an array of intervals which can be traversed
using binary search, but there is a major problem when we will add new interval, which
should be placed somewhere inside the existing array - that will force array reallocation
and copy. This problem also applies to hash tables and actually all other table-based
approaches, but does not exist in trees, although tree requres a bit more memory.
both have the same search speed about O(log2(N)), where N is number
of intervals.
Rough code snippet (no return value checks and reference counters update),
which splits intervals into non-crossing ranges:
I think my head will be broken in a couple of moments...
Tricky algorithm should be used there probably to split them up, but I currently fail
to research one, so I use stupid but the fastest - there are only two checks
and one allocation (or two if new interval was allocated in stack)
for the worst case of crossing, when two crossed intervals are put into three
non-crossing ones.
I wonder why Linux does not have FS stress testing and emulation tool
like netem for network,
which allows to perform drop, corruption, reordering, duplication and other
interesting options extremely usefull for protocol testing.
Consider the same tool which works under FS layer, for example as ATA/SCSI controller,
which can perform various message mangling, delaying, reordering and other
features emulating incorrect behaviour of the hardware. It can even send data
down to the real hard drive, thus emulating even various software conditions like
scheduling starvation, irq storms and so on.
I think such tool is a must for fs development and should be implemented at least
in parallel or even before various userlevel stress tools are started to stress the filesystem.
It was day of bouldering training - I tried different parts of
interesting relief traverse and several old traverses over usual holds.
There are quite interesting moments when one needs to place his/her leg above the head to complete
the hold and be able to get next one.
I think I like relif pathes more than any other traces, but unfortunately
there are no vertical traces over reiefs in Skala-city.
Training was completed with campus-board exercises, sauna, shower and excellent mood.
I like it.
I think I will end up with something like
this (handmade paintings with very bad quality).
If you feel yourself having enough spatial imagination, try this picture of the room I have in mind.
If you know some site where I can get an inspiration, related to my interests,
do not hesitate and drop me a link...
I completed to work with plaster - all walls and ceiling are ready, although they require
polishing, which is quite fast procedure, but too dusty, so it will be done slowly.
I've found a way to create very nice corners in doorways using special fabric called 'serpyanka'.
Next step is to start hanged ceiling implementation in the room - I still have not created
design of that monster, but I already completed electricity wiring for additional lights and neon cord.
I also decided to not put neither bath nor shower cabin into my bathroom, instead I will develop
special podium for water sink and create some kind of a wall - it will be hand-made shower cabin 'on the floor',
currently without glass dorrs, but I will add them eventually.
Signals which were requested to be delivered through kevent
subsystem must be registered through usual signal() and others
syscalls, this option allows alternative delivery.
With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of
signals, they will not be delivered in a usual way.
Kevents for appropriate signals are not copied when process forks,
new process must add new kevents after fork(). Mask of signals
is copied as before.
Here is output of the test program signal.c available in archive:
Program registers SIGUSR1 and SIGUSR2 signals
to be delivered through kevent, the former is delivered both through
signal callback and kevent, the latter only through kevent.
That was great time!
I do not exaggerate - I spent perfect time there - although nothing major
happend, but the whole process was not usual (or it was, but I had different shape) -
I completed several old traverses and couple of new starts, finished several
interesting boulderings - just like usual training, but something was different.
I do not know what, but I want more - yes it is addictive.
Since Grange still can not get his,
well something, out of hypo-narcotic state of delirious metaphysical processing
or whatever he is doing instead of climbing, it is possible that regular trainigs
with vertical climbing will be postponed, since it is not always possible to get
a mate in climbing zone, so I will climb for along for some time.
Here
was initial design of netchannels wildcards with appropriate picture.
Let's analyze first problem we strike when trying to implement that design.
It is presentation of wildcard intervals (for example two rules: 192.168.0.0-192.168.0.255
and 192.168.0.10-192.168.0.20) in one-dimensional set of non-crossing ranges.
Our example splits into three ranges: 192.168.0.0-192.168.0.9, 192.168.0.10-192.168.0.20,
192.168.0.21-192.168.0.255.
All those intervals can be easily represented by structures in C language and can be placed
into tree for the fastest access. Each interval structure will have a pointer
to the next dimansion range tree (like tree for ports and protocols), nodes of the last tree of
intervals will contain pointers to the actual netchannels, which parameters (address, ports and so on)
cover selected intervals in appropriate dimensions.
Interval splitting happens in context of netchannel creation and must be protected
against simultaneous access both from softirqs (where packets enter network stack) and process
context, where other netchannel can be added or removed.
In theory all netchannel trees of intervals can be protected by BH-disabling locks - but it is
not nice solution - even if tree traversal itself is very fast and can be done under BH-disabling
lock, there is possibility that interval currently being checked must be split into several
ones, which requires intervals allocations, which should not be perfomed in atomic context.
So I propose simple scheme lurked in Linux socket locking - under BH-disabling lock we set the flag,
that we currently process the tree, so process context will sleep on that flag until it is cleared
and (under the same BH-disabling lock) set it again protecting itself against other process
trying to access the tree. So usual lock will serialize process and softirq process and semaphore-like locking
will be used to serialize process context.
I've ported Common Address Redundancy Protocol (CARP) module
to 2.6.19 API and added INSTALL instructions and also updated README.
Now it just works after reading README and INSTALL files, although much more testing is required.
I only ran it on x86_64 and x86 arches with different failover modes - nodes switch to/from master/backup
mode and appropriate userspace applications (and registered kernel users like connector) are executed
when machines are turned on/off.
More details on configuration and various parameters as long as step-by-step intruction can be found in
README and INSTALL files in the sources.
Feel free to download the sources
and create your own high-availability cluster.
I decided to check if this
benchmark of different threading models is correct.
So I've installed Erlang on my Debian Etch desktop (Intel Core Duo 3.4 Ghz (speeded up to 3.7 Ghz), 2 GB of RAM)
and tried this
Erlang application. It was compiled fine, but failed to execute.
So I've read some FAQs and initial tutorials about Erlang and ended up
removing some bits of initialization, since my version of Erlang failed
to translate string to number, so I hardcoded required number by hand into the program.
I also compiled this (1)
C++ example and this (2)
and performed a comparison.
Basically all those programs do is to create 500 threads and exchange 3000 messages
between them one-by-one.
Results are quite deplorable for POSIX threads and C++:
i.e. C++ application which uses POSIX NPTL threads is about 10 (!) times slower than native Erlang threads,
but if thread start/stop functions are written in assembler (if I did not forget it completely),
C++ variant becomes 2 times faster than Erlang's one, which seems to be either real magic,
or some bug.
I understand that Erlang has completely differet thread model
(i.e. it is part of it's virtual machine, but not system as a whole),
but there is major issue to think about.
Eventually I created own application, which uses NPTL POSIX threads, which
I compared to above
(this (1)).
Above numbers are the best numbers reported by time test, so I put gettimeofday()
inside main() in both C++ and my C versions and run them 13 times.
Here are averaged results:
I'm an author of the
Common Address Redundancy Protocol (CARP)
module for Linux kernel 2.6.
CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard.
The latest protocol to help provide high availability and network redundancy,
it was developed because router giant Cisco Systems believes that its Hot Standby Router Protocol (HSRP)
patent covers some of the same technical areas as VRRP.
Newsforge article about CARP.
Linux kernel CARP
is based on OpenBSD CARP protocol but is not compatible with it, since OpenBSD implementation
does not contain protection against repeated message sending attack.
New release includes compilation fixes for the latest kernels.
Using the same benchmark created by Eric Dumazet to test
epoll, which I ported to kevent, it is possible
to test unix sockets, inet sockets and pipes.
Application has additional posix thread, which writes into number
of preallocated sockets or pipes, main process reads from appropriate pairs.
Each successfull read is counted as event.
Here is pipe result with kevent_pipe kernel kevent part with
2000 pipes:
Some kevent fixes for misteriously lived in the tree errors. poll/select notifications were completely broken,
which I was pointed to by Davide Libenzi.
I do not know, how it could happen, but probably during tree changes from
old one, which contained network and FS AIO, to current I dropped some file
updates.
I've fixed it and also changed a bit logic behind poll/select notifications -
they are called with newly introduced KEVENT_REQ_LAST_CHECK flag,
which if set allows to perform the last check on kevent (call appropriate callback) when
kevent is marked as ready and has been removed from ready queue.
If it will be confirmed that kevent is ready
(k->callbacks.callback(k) returns true) then kevent will be copied
to userspace, otherwise it will be requeued back to storage.
Second (checking) call is performed with this bit _cleared_ so
callback can detect when it was called from
kevent_storage_ready() - bit is set, or kevent_dequeue_ready() - bit is cleared.
If kevent will be requeued, bit will be set again.
Benchmarks compared to epoll() shows sightly better results,
but it can be fluctuation and can be unrelated to that change.
I also created pipe/fifo notifications, but have not aggressively tested it yet,
will do it tomorrow as long as release new kevent patchset with poll/select notifications fix.
Looking into threading benhcmark
hosted on http://shootout.alioth.debian.org I've found
that synthetic threading model (i.e. which has threading capabilities built into language interpreter
like Erlang) works generally faster than posix threading model.
I think I will try to implement that benchmark in C and run different applications hosted there
to compare if NPTL is really that slow.
If it is really so, I will think some more about possibility to create such library.
Obviously I will not start it's implementation tomorrow, but some generic thoughts about how
it could look like, what algorithms could be used and so on... Just a thoughts...
This
ppc32 work is (probably already 'was') in my todo list too.
Just thought, that ppc32 uses round-robin algorithm to update TLB cache.
It looks like it would be possible to use weighted RR algo instead - i.e.
create each entry's usage statistic and flush only entry with the smallest
value.
I always wanted to work with low-level hardware and specially with some
interesting arch. I probably would not want to be a maintainer of some common
arch like PPC, but work with something interesting like LUNA
machines based on Motorola 88000 CPU for example.
Just a though...
No, I am sick and new kevent release is ready.
Actually not. Looking at how much I've eaten and how much
I still want to eat, one can say that I'm definitely ok.
Nevertheless, new kevent release 'take23' is out.
There are several major changes in this release:
new ring buffer implementation in process' memory.
it works similar to this
description, but structure has changed - I removed user's index,
instead there is only kernel one (i.e. index where kernel will put new ready event).
When user calls kevent_get_events() or kevent_wait() requested
number of kevents is copied into ring buffer (if it was initialized using
int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int num);
function and kernel's index is increased. Simple. Userspace should itself store pointer to
it's previous index (actually it can be calculated using returned value from
kevent_wait() - it is number of events copied into ring buffer and
current kernel's index, but it is only correct in case of special locking between threads.
wakeup-one-thread flag.
When several threads wait on the same kevent queue and requested the same event,
for example 'wake me up when new client has connected, so I could call accept()',
then all threads will be awakened when new client has connected, but only one of them
can process the data. This problem is known as 'thundering nerd problem'.
Events which have this flag set will not be marked as ready (and appropriate threads
will not be awakened) if at least one event has been already marked.
edge-triggered behaviour.
It is an optimisation which allows to move ready and dequeued (i.e. copied to userspace)
event to move into set of interest for given storage (socket, inode and so on) again.
It is very usefull for cases when the same event should be used many times (like reading
from pipe). It is similar to epoll()'s EPOLLET flag.
Eric Dumazet created special benchmark which creates set of AF_INET
sockets and two threads start to simultaneously read and write data from/into them
to test epoll() and I ported it to kevent (ring_buffer.c
application can be found in archive).
Here are the results:
epoll (no EPOLLET): 57428 events/sec
kevent (no ET): 59794 events/sec
epoll (with EPOLLET): 71000 events/sec
kevent (with ET): 78265 events/sec
Maximum (busy loop reading events): 88482 events/sec
So, kevent works faster than epoll() and it was confirmed by three different
benchmarks (Eric's code (epoll_bench.c and appropriate kevent_bench.c
can be found in archive),
Johann Borck's web server and my benchmarks presented on kevent
homepage)).
Jeff Garzik today joined kevent supporting team.
So there are serious people on kevent side including such heavyweights like David Miller (network core maintainer) and
Jeff Garzik (network device drivers, SATA maintainer). Andrew Morton (I think you heard this name)
included kevent in his -mm tree several releases ago.
It looks like I've completed new
ring buffer implementation, but I have not extensively tested it due to aching head.
Will complete it tomorrow and also implement wake-one-thread flag.
Background bits.
New ring buffer is implemented fully in userspace in process' memory, which means that there are no memory pinned,
its size can have almost any length, several threads and processes can access it simultaneously.
There is new system call
int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int num);
which initializes kevent's ring buffer (int ctl_fd is a kevent file descriptor, struct ring_buffer *ring
is a userspace allocated ring buffer, and unsigned int num is maximum number of events
(struct ukevent) which can be placed into that buffer).
Ring buffer is described with following structure:
where unsigned int ring_kidx, ring_uidx are last kernel's position
(i.e. position which points to the first place after the last kevent put by kernel into the ring buffer)
and last userspace commit (i.e. position where first unread kevent lives) positions appropriately.
I will release appropriate userspace test application when tests are completed.
When kevent is removed (not dequeued when it is ready, but just removed), even if it was ready, it is not
copied into ring buffer, since if it is removed, no one cares about it (otherwise user would wait until
it becomes ready and got it through usual way using kevent_get_events() or kevent_wait())
and thus no need to copy it to the ring buffer.
Dequeueing of the kevent (calling kevent_get_events()) means that user has processed previously dequeued kevent and is ready to
process new one, which means that position in the ring buffer previously ocupied but that event
can be reused by currently dequeued event. In the world where only one type of syscalls to get events is used
(either usual way and kevent_get_events() or ring buffer and kevent_wait())
it should not be a problem, since kevent_wait() only allows to mark number of events as processed
by userspace starting from the beginning (i.e. from the last processed event), but if several threads will use different models,
that can rise some questions, for example one thread can start to read events from ring buffer,
and in that time other thread will call kevent_get_events(), which can rewrite that events.
Actually other thread can call kevent_wait() to commit that events (i.e. mark them as
processed by userspace so kernel could free them or requeue), so appropriate locking is required in userspace in any way.
So I want to repeat, that it is possible with userspace ring buffer, that events in the ring buffer can be replaced
without knowledge for the thread currently reading them (when other thread calls
kevent_get_events() or kevent_wait()), so appropriate locking
between threads or processes, which can simultaneously access the same ring buffer,
is required.
Having userspace ring buffer allows to make all kevent syscalls as so called 'cancellation points' by glibc,
i.e. when thread has been cancelled in kevent syscall, thread can be safely removed and no events will be lost,
since each syscall will copy event into special ring buffer, accessible from other threads or even processes
(if shared memory is used).
Crap, it is second time this year already (first one was early spring) and third one
during the last preiod I recall (about 3 years I think), so it becomes
quite bad. I only know how to treat head aching during hang-over (the whole two drugs!),
so I decided to surrender to collegues - I obtained some drugs, which among others can
result in nausea, retching, aching in epigastric area, hepatotoxic and nefrotoxic action,
hepatic failure, encephalopathy and state of coma.
So I decided to drink a lot of tea with lemon instead of that shit.
But there is a positive moment in it - when I have such condition I frequently switch into
creative mode (probably due to increased blood circulation in head), which results
in very interesting talks, stories and the like (you probably do not know, but some times
ago I created world evolution model based on liquid for optical axes ointment usage in
various areas of Earth)...
Right now I discuss in linux-kernel@ why kevent was not a pure FreeBSD kqueue rewrite (my infinite patience comes to it's end).
In a couple of messages I think I will switch and start to write a creatiff.
And I was invited to cinema today... I hate to be sick.
Netchannels wildcard design picture
(modified with some description).
Each dimension of the ruleset is splitted into non-crossing intervals,
which form the tree (AVL tree on the picture). Each interval has a list of rules,
which include part of that interval, second dimension of that rules forms new set of
intervals, which forms tree, bound to the interval in previous dimension.
The worst search time complexity is O(d*log2(N)), where d is number
of dimensions, and N is number of rules.
Theoretical research is over, let's implement this for Linux netchanels.
Actually it is some kind of the Grand Unified Flow Cache, described by Rusty Russel,
netchannels already can host different protocols in the same cache, but only on top of IPv4
currently, since I have not implemented comparison helpers for different protocols.
Each netchannel has initialization and cleanup callbacks and processing function, which is called
when packet has arrived. This set of callbacks allow to create netfilter, IPsec and any other
transformation/encapsulation protocols as long as usual ones. Each netchannels also has data reading
and writing callbacks, which can be used to copy data to/from userspace or for other usage.
Using netchannels it is possible to implement direct hardware-user bridges without any influence
from OS.
Japanese researchers found dolphin with 'remains of legs'.
Japanese researchers said Sunday that a bottlenose dolphin captured last month has
an extra set of fins that could be the remains of back legs, a discovery that may
provide further evidence that ocean-dwelling mammals once lived on land.
A freak mutation may have caused the ancient trait to reassert itself, Osumi said.
And maybe that dolphin just wanted to move to the land and is not an ancient but modern trait?
Probably in that case we would find more and more such a dolphins with different sized extra fins.
Who knows...
It was really hard training, although I did not run a lot of traces
or performed complex exercises. I just did not climb for a long time -
body lost lightness, muscles became inert, blood - slow. But everything has changed -
I climbed for about 3 hours, completed one new trace and couple of old traverses,
and now everything is in pain - arms, legs, back, the whole body rises from the sleep.
I have tired a lot, but I feel myself just bloody excellent!
Tomorrow (Nov 4) Russia celebrates "Day of people's unity", some stupid people want to
change this day into nationalistic and nazi appearance.
Do not participate in idiotic "Russian march" - just be smart.
Move to the cinema, watch a video, play computer games or build an ikebana,
but do not jeopardize your life and lifes of your friends.
If some persons are just idiots, it can not be changed even if you beat them in public,
and some really good people suffer from such stupids.
Be smart, be gentle, be cool.
P.S. Nov 4 1612 irregulars under the direction of Minin and Pozharskiy turned out interventionists from Moscow.
I always thought that having inverted comma (')
and character s after the word means some belongings to that word.
For example with following sentence 'it's property' I meant 'property of it',
but actually it means 'it is property', correct is 'its property' without inverted
comma after 'it'.
So don't be confused - english is not my native language (I spoke with
other person in english only once (although about 4 hours), not counting simple phrases like
'where is the bus station'). Until I find some manual I will try to restructure
the sentences to avoid problematic places.
I plan to implement following features in kevent in order
as they are presented:
new ring buffer implementation.
It will be exactly like existing one, except that it will work
with provided by userspace using special syscall.
This will allow to make all kevent syscalls as so called 'cancellation points' by glibc,
i.e. when thread has been cancelled in kevent syscall, thread can safely removed and no events
will be lost, since each syscall will copy event into special ring buffer, accessible from other threads
or even processes (if shared memory is used).
Actually it is already possible, and it was there from the beginning of the kevent - one can
provide shared memory (or memory used as shared buffer between threads) to kevent_get_events(),
but in this case userspace should manage indexes (offsets where to put new ready events) by itself.
wake-up-one-thread flag. When several threads wait on the same kevent queue and requested the same event,
for example 'wake me up when new client has connected, so I could call accept()',
then all threads will be awakened when new client has connected, but only one of them can process the data.
This problem is known as 'thundering nerd problem'. With new flag it will be possible to mark
some events, so only threads requested to be in the thundering nerd (i.e. those threads which
have not set the flag on given event) would be awakened when event is ready.
FIFO read/write notifications.
userspace notifications. Userspace will be able to register a special event, which can be marked as ready
by some other thread. In theory it is just usual pipe between users, so it is possible (and likely) that I will
not do it.
signal notifications. Unlikely to be implemented by me - I already have huge experience in 'working'
with signal/rt-signal gurus, so additional project will likely mar my karma even more.
First one will be available in the new release (I expect to start working on it either after confirmation
from other developers that our private discussion and my conclusions are correct or after some timeout,
likely this weekend). I will think some more on second point (wake-up-one-thread flag), and if
things will not have major problems, will try to implement this too.
I schedule next release for the next week, likely it's start, but it is possible that it will take much longer.
Enjoy the views of the Moscow center
from small ship called "Chiszhik-2" (small siskin number two). There were only two ships - 'Chiszhik-2' and 'Atlantis'
for us, we decided to select first one, although crew refused to answer, what happend
with the first 'Chiszhik'.
I was a bit optimistic about the fact, that netchannels support
wildcards now - there is some work there (including theoretical aspects),
so stay tuned.
I almost completed
NAT implementation on top of
netchannels.
Of course 'almost' is not counted, but tomorrow I plan to finish it.
It requires connector
for configuration, but there is no userspace yet.
The most complex part was actually wildcards
support for netchannels tree.
Here is my reply to Ulrich Drepper's article about event
handling (Ulrich's text is marked with italic style).
I had a little chat with Andrew Morton on Monday. The topic was the event handling, async I/O, etc.
All the same stuff I talked about in Ottawa at OLS this year (paper and slides on my web page).
Unfortunately not much has happened since OLS. Lots of people said they would be willing and able
to help but life intervened, I guess. I'm no kernel developer but perhaps I have to start taking
this up (this is a threat...).
There is one stream of patches which exist but they are not usable in my opinion.
I'm talking about Evgeniy Polyakov's kevent patches. I send my analysis to lkml but this didn't
result into much (except removing the ring buffer support).
It is not correct.
After ring buffer implementation request, no one wanted to participate in discussion, so I implemented
it like I wanted. After 13 releases I was pointed that 'it is broken' and Andrew Morton was asked to not
include kevent into official releases due to that. I removed ring buffer implementation and started new
discussion on how it is supposed to be implemented where only me and Eric Dumazet participated.
After two weeks of waiting for additional commends I implemented new version, which, looking into description,
I think was not seen by those who requested the feature.
There are more fundamental problems
with his proposed approach but I don't get through to him and it appears as if I'm the only one complaining.
Andrew keeps saying that this is because nobody else understands the topic. Well, let's be a bit more specific.
I'll try again to motivate the case, describe how I think the userlevel interface should look like and then
deduce something about the kernel interfaces. A bit of the latter Andrew and I did today.
There are also network and file AIO aspects to this and then the DMA handling, but I'm restricting myself here
to the event handling.
The fundamental premise is that multi-threaded code is absolutely necessary going forward.
All processors grow horizontally by having ever and ever larger numbers of execution contexts
(cores, hyper-threads, ...). For applications to scale there are two types of approaches:
Parallelize large chunks of code. This is not a suitable approach for small chunks of code
(because the startup, and distribution and collection of results are not free) or for code which is
hardly parallelizable. If code can be parallelized people hopefully will start using OpenMP and similar
methods. Unfortunately not everything is parallelizable in this form.
Parallelize individual requests.
A web server is an example where the former approach is not usable but the latter is.
Usually each request is easy to handle. Even for dynamically built pages the work is usually small.
It makes no sense to parallelize the code. But many requests can be processed in parallel.
We are talking about completely differnet approaches to parallelize application - code and data parallelization.
The former approach works with already separated data, so we have individual requests moved
into different code processing units, so it already requires to split a flow of events.
The whole topic is devoted exaclty to that problem.
So it is logical to achieve parallelism by having many threads (or processes as in Apache).
Such approach, if being used largely, quickly beckomes a bottleneck of the whole system.
Existing scheduling allows fast context switching and serialization could be reduced, but it is
overhead which in badly designed cases becomes the slowest part of the system.
There should not be any 'logical to do' things in system design, instead the maximum performance
should be the main goal, which means removing all unnecessarry overheads from context switches and
serialization. It is only doable when one CPU runs one thread, and that thread has correct
set of tasks to do, so it should not be scheduled when there is no work for it, it should not busywait
until other threads complete it's work.
This is what programs do today. But this is not without problems:
The current network interfaces are not well prepared. If you park all inactive threads in the accept()
call to get a new incoming restriction you get a trampling herd effect if there is an incoming connection:
all threads get woken up. But only one or a few will find that the accept() succeeded.
The same is true for poll() etc calls.
A simple solution - do not park all threads into accept().
Application design should (and I would even say must) not split single dataflow into different users (like
split flow of clients to different users), instead several dataflows should be created, so each thread
would have been awakened only when it can handle the data. It is easily doable with kevents, but not implemented
yet. I think next release will have this feature (by introducing special per event flag).
If the number is requests is high it is a bad approach to increase the number of threads until no connection
remains unanswered. The problem is that this adds a lot of administrative overhead in the kernel and processor
(context switches). If the processing of the requests uses 100% of the CPU time we end up with a higher total
processing time because of the overhead.
If the CPU time is not 100% we are wasting even more. If a thread does not have 100% utilization (and this is
almost always the case) then the threads blocks somewhere, either a system call or a synchronization primitive.
But this means there is CPU time available for the thread to do something else like working on another requests.
So the total system time increases because of management overhead for no good reason: one thread could do the
work needed for two or more requests.
I completely agree.
The optimal programming model would therefore use only as many threads as are needed to utilize the system at 100%.
This means one or two threads per CPU or core.
My point is still the same - only without _any_ kind of overhead introduced by context switches and serialization
problems perfomance of the system can achieve the maximum.
To scale with this fixed number of threads and achieve 100% utilization it is necessary to avoid any blocking.
If a thread has to perform an operation which might block (such as reading from a socket or waiting on a mutex)
the normal call cannot be executed. Instead the thread requests an event to be signaled when the reason for the
blocking is gone and while this is going on it starts/continues working on another request.
This is a model of appealing simplicity: scaling is achieved by adding threads into a central loop.
Add threads corresponding to the capabilities of the machine not the complexity of the program.
All threads are using the same code.
Obviously such approach introduces major problems for naive usage.
Having one thread per taks essentially leads to the situation when number of threads which can proceed with data
exceeds number of idle CPUs which can run that threads, such situation must be generally avoided.
This simplicity in the inner loop does come at a price: code has to be restructured to avoid all blocking calls
and the central loop has to be able to wait for events of all types corresponding to operations which can block
and are avoided.
This is where my proposal comes in. The new event handling mechanism is the bit in the inner loop. We can define
its abilities and properties as this:
We can wait for all kinds of events corresponding to all the possible interfaces which can block
kevents were designed to be able to host all existing event types.
Event notification better be fast. Already today with 1Gb/s Ethernet we can create requests faster than they
can be processed. This is only to increase with 10Gb/s Ethernet or fabrics.
kevent notifications are fast - there are number of benchmarks where they are compared with epoll().
It is also possible to get into account network AIO benchmark, which I created on top of kevents.
Obviously, the interface must be usable in multi-threaded situations. This means multiple threads must be able
to wait in the kernel for event. The alternative is to have one waiting and signal a thread in a thread pool
that an event is available. This is unnecessarily slow. When multiple threads are used it must be avoided that
an unnecessary number gets woken if only one or a few events are available. Ideally we wake one thread per
available event.
kevents are completely thread-safe.
Ideally the interface is also usable from multiple processes.
Is it about having the same queue of events for several processes?
Is it about the way of putting more and more users for the same pipe, when it is possible to create aditional pipe?
Is it about the case, when application is designed so, that it is possible to lose events, and thus they would be
picked by other application?
In any way, one can create a fifo and put all events from the same queue into it, so several application could
read them.
A consequence waking only one thread per available event is that no wakeup must be lost.
This is a new concept which does not exist in the POSIX interfaces and the kernel
(except the level triggered epoll interface). For this to work we have to have a mechanism to tell the kernel
I cannot handle this event, tell another thread. This is important, for instance, when the system call to wait
for the event is canceled (as in thread cancellation). In this case the userlevel function call does not return
and the program does not even know the notification is lost. Therefore the system library wrapper the
wait-for-event system call needs to tell the kernel about the notification loss.
It is not correct behaviour - it is possible to use different queues for different types of events
and bind threads to that queues, so only users interested in given event would be awakened.
Any case when event is dequeued and then put back are error-prone - it is possible than new event can not be put
back due to memory pressure. Any case when something can be lost is just unacceptible - there are
a lot of troubles for those who work with rt-signals and when queue overflows, it is a real pain.
So no new design sould be even remotely possible to have possibilities to just lost an event.
The requirement for such a system call to request additional notifications has itself a consequence.
If the wait-for-event system call returns itself the data for the events instead of just a notification,
then this data would have to be pushed back to the kernel as well. This is a high overhead.
And more problematic: if there is not sufficient memory to store the event data, what to do then?
I completely agree that it is a dead way.
It is therefore better to separate the signalling of an event from delivering the data associated with the event.
The data can be retrieved using another system call. Or: we can use shared memory. Shared between kernel and userlevel.
Or userlevels (plural), see the multi-process requirement above.
If we implement the shared memory mechanism this calls for a ring buffer. Since there can be multiple threads
working on the same ring buffer it must be possible to mark a ring buffer entry as worked on and/or at least
as finished. The bulk of the ring buffer needs not be modified. In fact, it might be best to not allow the userlevel
code to change it at all. This way the kernel care store information in the ring buffer in a form which might cause
problems if the values are changed (e.g., pointers of some sort). Not all the fields of a ring buffer entry need to be
used or even usable by the userlevel code.
There are at least two shared ring buffer implementation, created by me, but I never got any feedback at design
descussion time, and even after it was implemented, it wook quite a lot of time and pressure from my side (and other developers)
to get back the feedback, that it is broken.
Btw, it is possible to use shared memory for kevent_get_events() syscall, so that
ready events put there could be shared between processes. It is even possible to completely implement
such ring buffer in userspace without any additional copies.
Proposed Kernel Interfaces
This is the current state of my/our thinking. We want to use a ring buffer. The memory must not be required to be pinned down.
The memory might only be used among the threads in a process or among several processes. Depending on the needs the application
should allocate memory either anonymously or file-backed. The size of the buffer can be chosen by the application.
To create an event queue we pass the address and size of the buffer to the kernel:
int kev_create (void *addr, size_t size, int flags);
We throw in a flags parameter to be on the safe side for future use.
We need some structure in the buffer. At the very least we need to be able to locate the ring buffer and identify the
elements of the buffer which are currently active. If we are going to implement CPU ID-based cache line optimization
(see next section) we need some sort of mapping from CPU ID to ring buffer entry. And we need to chain the ring buffer
entries to reach only the entries for the CPU. Searching for them defeats the purpose.
Ideally we would split the memory area in two pieces. One piece is the ring buffer itself. This part needs not be written
to by the userlevel code. See above for the advantages. The writable section is the part where the program(s) provide
feedback to the kernel about the entries which have been handled. Here we have two possibilities:
We use only a front and a tail pointer. The front pointer points to the first unused entry, the tail pointer to the
last still-used entry. Everything between tail and front is assumed used even though some entries might actually have
been handled already. Everything between front and tail is free and available to receive new events. This approach requires
little work from the kernel when queuing new events but it means we might run out of space even if there are empty slots.
The latter can be avoided by making it the policy that userlevel code has to evacuate the ring buffer slots ASAP.
The alternative is to use per-ring buffer-entry accounting. Each entry has a flag associated with it. This means less memory
requirement. We can still have front and tail pointers to mark regions about which we have absolute knowledge but the kernel
would also try to look for holes and use those entries.
The alternatives do not differ too much. When moving the tail pointer we always have to have knowledge about the individual entries.
It's just that in the second case the work is all done by the userlevel code. The kernel has an easier life and we avoid possibly
long delays (imagine a huge ring buffer with many thousand entries). Maybe the best argument in favor of the first solution: we only
have one pointer is modifiable state (the tail pointer, the kernel maintains the front pointer). This means we potentially can do
away with the writable part of the mapping altogether. Just pass a void ** value to kev_create()
and maintain this variable in the runtime. This advantage would go away if we need more writable state.
A few words about the memory handling of the ring buffer. I would imagine that on 64bit platforms we can use large areas.
Several MBs if necessary. This would cover worst case scenarios. The key points are that
the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that
fails because the memory is swapped out) and
we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)).
This can be done implicitly in the kernel or explicitly at userlevel. There is therefore no huge memory usage associated
with the ring buffers unless the program never frees entries. It is all ordinary anonymous or file-backed memory, subject
to existing rlimit restrictions. No need to invent yet more rlimits.
There are two differencies in existing ring buffer implementation:
it does not use special syscall to reserve that memory, instead predefined number of pages is used.
Since kevent initialization syscall was removed, I do not have easy way to allocate required number of bytes,
althoug it is possible to change that.
main difference - existing implementation uses pinned memory (Here I was asked for it),
but in this post Ulrich proposes not pinned memory to be used.
It requires that data copying happens always in process context, so there must be either special
kernel thread (and thus additional context switches in addon to context swich, required to start
actual working userspace process), or syscall.
So userspace already should run a syscall, so that kernel would put events into ring buffer.
In existing syscall kevent_get_event() exactly the same is porformed, only pointer to area where
to put new event is provided as parameter. So it _is_ the same without any special ring buffer handling and additional
interfaces.
Until I completely do not understand what is the point, it looks like it is redundant functionality, since
it is exactly how kevents work right now, but with additional unneded complexity.
To register delivery of events through an event queued we likely need a number of interfaces.
Fortunately one interface already exists: the sigevent structure.
I described all this in my paper but some people keep misunderstanding this point.
The use of sigevent has nothing whatsoever to do with using signals.
It's just a convenient way to request for a whole bunch of existing interfaces that the new method of event
delivery should be used. This applies, so far, to POSIX AIO (which might be extended for networking) and POSIX timers.
As a side note - there is kevent based network AIO, which has much more convenient and intuitevly used
interface than POSIX ones.
Other uses will need new interfaces. For instance, signal delivery could be requested through an extension of
sigaction().
We define a new flag to be ORed into sa_flags and add int sa_kev
to the union already containing sa_handler and sa_sigaction.
None of these would be use at the same time.
My biggest worry in the moment is to find a suitable interface for futexes. I'll think about it once the work actually starts.
I always asked myself, why should we use POSIX, when we have excellent poll() interface?
It is ery convenient, it does not have unneded parameters, it is simple and it perfectly works for our needs.
Anyway, anyone who likes to create new interfaces can use special function in kevent to add them
from your own calls - that is how network AIO aio_send()/aio_recv()/aio_sendfile() work.
Implementation Notes
These are a few notes about implementation details Andrew and I came up with during the discussions. It is not necessarily all usable:
Ring buffer entries can be annotated with CPU ID (or perhaps even core ID). This would allow the scheduler to chose among
the threads which are waiting for events to chose one which is scheduled for the same CPU. At userlevel we have the getcpu()
system call and we are therefore able to pick an entry from the ring buffer which matches the CPU. This means no unnecessary
cache line transfers.
To locate the entries for a CPU the ring buffer could have a hash table or something similar which points to the appropriate
entry. It would also be possible for the wait-for-event system call to return the index of the ring buffer entry for the
current CPU.
I think the memory for the ring buffer should be either anonymous or backed by tmpfs. Nothing in a persistent file system.
We should not in any form or away encourage that people even look at the data in the file. It should be meaningless and
should not survive a reboot.
So in general I agree with Ulrich, although ring buffer implementation in userspace not-pinned memory
is a real questing for me, since it looks like it does not differ from what kevent already provides.
But if we dig deeper, we can see that actually main problem is not technical one.
Things develop from bad to worse - I do not get feedback, try to determine it by myself,
Ulrich proposes already implemnted ideas, votes against kevent,
I try to understand and fix my implementation, but do not get feedback... and so on.