Today we celebrated Yuliana Silich (former Sviridovskaya)
and Perec (Pavel Sherstnev) - they decided to make a party in Dolgoprudny forest
with fire, vodka and shashlik. And they did not lose - it was great time,
although quite hold especially later, but eventually we heated and
it had become quite warm. I met old friends which I did not see
for several years already, some of them were in the abroad, some of them
just did not appear.
Later party continued in Yuliana's house - we drunk as far as I recall upto 3 A.M.
with Mephody - we talked about life, work, future plans and the whole situation -
eventually we concluded that although I work for small salary, but with extremely
cool technologies and this is essentialy work for myself, although
he works with interesting ideas in Dell, but he is just a small,
although quite significant, screw in the someone other's machine.
That was great time with friends, which smoothly flows into New Year celebration.
It was quite new trainig - I climbed on the vertical walls and on negative slope
second time for the last three months. My insurance partner was instructor Anya,
who helped a lot and even tried to push me up on the final trace - but after four
quite complex old traces, combined into series of two in one run,
I was too tired to complete on-sight new one on negative slope
(start is almost horizontal wall, but next part was quite easy on vertical wall),
so I miserably failed several times on start until helped myself with other holds.
It as quite new feelings to climb on vertical walls, but I found that I perform
quite good and make some new interesting movings trained 'on the ground' with
boulderings.
My next training will be only in a week or so - climbing zone is closed for vacations.
I organized Mephody and Ira meeting today - we
spent a great time in "5 oborotov"
restaurant with a lot of old friends - there were
Mephody (btw, his name is Alexander Boykov) and his wife Ira,
Fedor and Ira Schtrom, Perec, Lyasha with wife Olga, white John
(Burnyakov Evgeniy) and me.
As Mephody says, he visits Russia to remove his Europe depression -
everything is so damn good and right, that one wants to blow it up -
and Russia is completely different - dirty streats, bad weather, all with
problems - extremely cool place to live compared to swamp-like table Europe.
It was just bloody perfect training - a lot of new
small and bigger traverses and boulderings, many
old ones, talks with new people, theirs traces...
I did not have such training quite for a while alreday -
there were no negative moments at all. I think life becomes
better and longer with each such evening.
Later I spent several hours preparing to Mephody's meeting -
last tickets check, New Year booking and so on.
Now everything is good. I mean EVERYTHING is just bloody good.
See how monkey screams "Stop working!". I hear it right now.
I start celebration today evening after climbing training
with Schtrom's family (they got theirs keys for new flat), then tomorrow with
ireland mug Mephody, Irina and friends. Friday will be short working day
with climbing training.
Saturday will be devoted to non-celebrated birthdays - I will be in Dolgoprudny
forest with friends, fire and vodka. Sunday will be (started) spent in
"5 oborotov" new year party with friends.
I will be back to normal life somewhere about Jan 4-5.
Jonathan Corbet (LWN editor) has published
an article about kevent in
LinuxWorld, which has reached me via
Linux today news.
As a side note, kevent homepage
got the first place in the Google's rank for 'kevent' word.
it becomes popular, and major thing kevent lacks is
true AIO,
which will be started very soon.
It was a good training - all traverses and boulderings were quite old,
but I started to combine them, so it was hard traces. I almost
completed relife traverse over the whole climbing zone - there is
only one place where I do not know how to move (well, I have some ideas
on that place, but theirs testing will absolutely end up with failing,
which will not be that pleasant since there is only small floor-mat there).
Actually I failed to complete the whole traverse, but I did not stop
at the middle when ars started to tell me that something is not usual,
and they feel itself a bit uncomfortable - probably at the moment
I had quite monster view (wry face due to pain, heavy music from head-phones,
likely I produced some non-human roars trying to stimulate
myself to complete complex or hard movings), since all people in opposite
direction quickly moved away...
Anyway, it was good training.
Climbing is quite addictive and I became a victim.
It is inevitable like New Year.
I've fixed bug, introduced in take19
release and plan to put new version to the wild.
I've also setup new machine with gigabit NIC, so I will test NAT performance
and then release new version.
This is scheduled to tomorrow, and now I am, dirty as pig after yesterday
flat development, going to climbing zone for my regular (almost daily)
dose of good mood and shower.
I've completed my hinged ceiling. It is not 100% ready yet,
but about 90-95% - small board is created,
almost all ceiling seams are filled with plaster, all
parts are combined together. The only things to complete
are: dotty lights setup, plaster filling complete and final
postprocessing (includion the rest of the ceiling).
I never worked before with stucco cardboard, but ceiling looks
really cool - it has two rounded layers with boards, the whole covered
area is about 8-10 square meters, distance between layers and
main ceiling is about 11 and 23 santimeters accordingly.
Hinged ceiling will contain dotty lights over the area
and neon cord in the board.
Next thing is to complete the whole ceiling final plaster postprocessing -
it will include three layers of plaster (and one stucco ground already filled) -
ground, finish plaster which will form the paint and finall fixative colour wax.
This is quite simple work, but getting into account the fact, that I only
perform my flat development each Sunday, this can take some time...
And I'm still in doubts about bathroom - I do not have bricks to
complete my idea about small podium for handmade shower cabin, so probably
I will fill the bathromm with ceramic tiles and then will setup usual shower cabin -
we'll see...
changed ALWAYS_QUEUE behaviour with poll/select notifications - previously
kevent was not queued into poll wait queue when ALWAYS_QUEUE flag is set
added KEVENT_POLL_POLLRDHUP definition into ukevent.h header
libevent-1.2 patch (Jamal, your request is completed, so I'm waiting two weeks
before starting final countdown :)
All regression tests passed successfully except test_evbuffer(), which
crashes on my amd64 linux 2.6 test machine for all types of notifications,
probably it was fixed in libevent-1.2a version, I did not check.
Patch and README can be found here.
libevent is very BSD specific with a lot
of workarounds for epoll, /dev/poll and finally for
kevent,
especially for signals in the first two, but kevent does not require it - this is another
argument in favour of absent unneded sigmask parameter for kevent.
Kevent support for libevent can be further optimized though,
current code adds event when appropriate callback is invoked, while
kqueue
postpones it to dispatching time - they can use it since kqueue() contains
additional array of events, which are considered as changes to existing ones. It is
possible that kevent addition could be postponed too.
All kernel kevent options must be turned on (namely CONFIG_KEVENT_POLL, CONFIG_KEVENT_SOCKET, CONFIG_KEVENT_PIPE).
I did not hack 'configure' to check for supported notification types.
Call me lazy boom slacker.
In the most simple way - it currently only supports signal notifications and
poll for all others, which should be optimized for sockets and pipes.
It is good library indeed, although it is verykqueue()
centric.
And it does not contain any documentation about internal structures (i.e. there is no
documentation for those who want to add new event dispatching mechanism into library),
although there is a good man page for users.
And it is made by OpenBSD developer Niels Provos,
this OS, if you do not know, claims that it has the best documentation ever
(as a side note: no matter how good documentation is, my work with mbuf
in its network stack completely killed _any_ possible wish to
work with its kernel in future :) ).
Fortunately there is a source code for kqueue(), so
kevent poring was easy.
Porting uncovered interesting 'feature' of one of the kevent flags - KEVENT_REQ_ALWAYS_QUEUE,
it was added to workaround applications which can not deal with the fact, that
event can be immediately ready and thus will not be queued (like lighttpd),
in case of poll/select notifications and this flag being set, kevent was not added
into wait queue for given file and thus will never be processed later,
which leads to impossibility to work with this flag and poll notifications.
Another issue is that set of regression tests in libevent
crashes on every event dispatching mechanism I tested (poll/select and epoll) on my
AMD64 test machine, but I have not looked deeply into this issue.
I will fix poll/select notifications and KEVENT_REQ_ALWAYS_QUEUE flag issue
and release new kevent version and library port tomorrow. I will also include
socket/pipe optimisation into library port (not supported right now).
This version contains locking bug which leads to crash on netchanel removal
when a lot of them were created.
Scalability testing.
Script has created 22k wildcard (only one dimension (source address)
is wildcard to save some mem) NAT rules
(16k $rand.$rand.$rand.$rand/255.255.248.0 rules and
6k $rand.$rand.$rand.$rand/255.255.255.$rand).
I.e. it is equal to 22k following netfilter rules:
I will try to start and get some results about
true AIO for that date.
Btw, due to the fact that kevent stalls, its inclusion into lighttpd
official tree is postponed/dropped.
Update: I've put a task to patch libevent
to support kevent into my schedule - site is currently down, but I hope it will be ready soon,
so I can complete this task this week.
This is second resending of 'take28'
patchset - the last one will be sent either saturday or monday.
Due to complete lack of feedback (oh, no, Ulrich Drepper said that kevent build fails
on his machine, but did not present actual files and did not confirm that
it was fixed by 'make prepare' command, so it can not be counted).
Jonathan Corbet has published an excellent article
in the Dec 13 LWN edition about kevents.
It was first time for the last three months I climbed
on the walls and not traverses or boulderings. I held
a girl from DDS (which is closed, so there are enourmous crowds of people
in Skala-city), which actually most of the time were in hinged state,
so the whole training was not that active.
I completed three old traces and found that although
I can do the things, it was not that easy as I expected,
but nevertheless it was good training although easy and short (if number
of minutes I spent on the walls are calculated).
Next time unknown girl will ask me to insure her, I should ask
if she permit me to climb a bit more than 3-4 traces :)
One can add/remove source/destination NAT using
connector (Documentation/connector/ in your source tree).
NAT over netchannels uses two aditional netchannels per dataflow
(input and output from NAT server point of view)
and one main (what administrator sets up using source/destination rules)
as I described previously.
It uses the same tries as other netchannels (userspace or others),
so only one lookup is performed to transfer packet.
Packet processing happens in process context on behalf of special threads,
dedicated specially for netchannel processing, when packet enters
netchannel stack and netchannel has been selected, it is possible
to schedule its processing either to current thread, or to
thread on other CPU.
Netchannel searching is lockless (protected by RCU), skb queueing is proceted
by spinlock.
So, brief list of netchannel features:
multidimensional wildcards support
RCU searching
single multidimensional trie for different kinds of dataflows
dedicated processing threads with possibility to schedule
processing on different CPUs for those netchannel types which are not acked with processing context
userspace netchannel backend (allows to receive packets to userspace), which can be used for:
high-performance sniffers
tun/tap device replacement
packet socket replacement (note, that netchannels steal packets from main stack)
own protocol stack implementaion (from VPN tunnels to TOE)
netfilter netchannel backend (only NAT is supported as the most interesting user, NAT caches appropriate route,
so essentially routing becomes part of the netchannel trie)
Some testing was done in 'emulator', i.e. pre-netchannel userspace multidimensional trie implementation.
Results are here.
This version (netchannels.18) can be found in archive.
it has been also sent to netdev@ for review and comments.
Only because of drink evening and korean stories
I have not completed NAT support for netchannels, but I will do it tomorrow or in a day
(if climbing training will not be that hard).
First impression - they drink. No, they do drink. Even, they DO DRINK.
That's a good news if I will ever visit that country.
He has brought us a box (about 20 small bottles of 0.3 litre) of super-puper
known product called 'Soju' or something like that - it is korean vodka,
which has only 20.1 degrees compared to russian 40, but nevertheless is quite strong.
We drunk a littel and listened a lot of fun stories about koreans
(I'm sorry if it will be a bit unpleasant, but what would you expect
from big white monkeys?).
First of all, theirs language. From russian point of view it is fun.
For example they lived in big hotel somewhere on the sea-side (15 minutes from Seoul),
and every time they entered a lift (in Samsung) some movie was shown -
how many degrees must be in different bows and what are other rules
of decencies...
After several such movies they started to use interesting korean phrases as toasts
in korean restaurants (well, they did drink a lot there).
So, the most interesting (I managed to recall after couple of bottles of theirs 'Soju')
phrases which sounds specially in russian:
'ebazie' - something like 'hello' when you answer a phone call.
'humida' - something like 'to be' (actually it is pure russian pronuncation, used for
phrases like 'that thigs', i.e. 'eta humida').
'ebti' - something like 'next door' - does not require a translation for russians,
we will forget it in a minute (I specially wrote on paper), since it sounds too fun.
So, imagine several big white monkeys sitting in the korean restaurant, which drink
for 'Hello', drink for 'Doors are closing', 'Next door' and 'to be' words, i.e.
one of them stands and starts using grave voice:
Let's drink for the next door.
Collegue has told interesting story about sea products and how they eat it.
In Russia we sometimes exit the city and go 'po gribi - po yagodi', i.e.
for mushrooms and berries, and we eat berries right in the forest.
Koreans do the same with sea-livers - consider a family, which
has a good car(s), which moved to the sea-side from Seoul to get some fresh
monsters - after low tide they get alive ratchets and eat them right on the coast.
If ratchets want to run away and dig into the sand, they have special small spades
- just like russian mushroomers have special knifes.
Another story was about _alive_ octopus being cookied right for them,
so even after its skin was removed and it was sliced, its parts still moved
(even in your mouth). Or some other alive, which cried when you eat it.
They start to drink thursday evening (at least in Samsung), they drink a lot, but automagically friday morning
they are fresh (although each evening they are so tired that can sleep on neighbours shoulder
in the bus), then exactly the same happens firday and saturday (including major drinks
and parties), but sunday they get themself and go shopping (but can sleep in the shop).
It looks like they live completely different lives than we.
I think I would like to visit Korea.
Today we took a film action. Actually not film, but "Usage instruction"
TV show - two girls as presenters, operator, some other people...
I hope I was not screened, but eventually they stopped on my traverse way, so I
couple of time asked them to move (very civilly indeed).
Among this it was quite good usual training - old traverses and boulderings,
one or two new one - I started to try 8b/8c start - obviously I can not climb
such complex trace, but I try to make several start holds - quite successfully though.
Who will win if turned on flatiron will be closed in the refrigerator?
One man decided to check and closed faltiron in the refrigerator,
after about 30 minutes, when flatiron almost completely defreezed
refrigerator, a very strong steam in the inner area cracked something
in flatiron and it had died, refrigerator got the situation
under own control and iced the door.
Draw?
"Who I am, where I am, what is my address, where is exit?"
One day of DHCP packet's life.
- What is galvanic resistance?
- It is rise of the batteries.
Source NAT works in one direction - i.e. it is possible to send
packets from internal network to the outside using address of NAT machine
as a source, but packets are not received yet (actually they are,
but no address changes happens).
Since I got no feedback about three previous releases and one resending,
I started decrementing counter of resendings - when it hits zero, I will
stop resending and move kevent
into maintenance mode without further pushing it upstream and will
concentrate on more interesting work than sending mails into
non-feedbackable black hole called linux-kernel@.
Situation with kevent becomes idiotic from fun - a lot of people
want that functionlity, kevent works faster
than epoll, there are patches for real-life users, implemented
tons of functionality which allows kevent to scale to hell,
and things stopped.
So, if people for whom it was developed do not care, I do not care even more - I created it, I can kill it.
Thanks a lot for those who supported kevent and tried to push it
upstream, but it looks like we wasted our time for nothing :)
crypto session routing (allows to complete single crypto session when
several operations (crypto, hmac, anything) are completed)
crypto session binding (bind crypto processing to specified device)
modular load balancing (one can created load balancer which will get
into account for example pid of the calling process)
crypto session batching genetically implemented by design (acrypto
provides the whole data structure to crypto device, i.e. it is
possible to use acrypto as a bridge which routes requests between
completely different devices, since it does not differentiate between
users, just handles requests)
crypto session priority
different kinds of crypto operation(RNG, asymmetrical crypto, HMAC and
any other)
Combined patchsets include:
acrypto core
IPsec ESP4 port to acrypto
dm-crypt port to acrypto
OCF to acrypto bridge, which allows to run OCF device drivers with acrypto (for example ixp4xx),
requires OCF installed.
I've bought myself an mp3 player since Skala-city's
sound accompaniment does not hold any critics (in my opinion), so I started to climb
with own music in the head.
That was great training - likely because of excellent mood and good music -
although most of the boulderings I made were old ones, but I started to climb
several new ones - quite complex, but very interesting. Eventually I even completed
part of the trace instructors failed to do by using parts of the relief
they dit not found (it looks like some other master created that trace, not local trainers),
but I'm quite sure they will complete it soon too.
I got an extremely super good mood - people likely with great surprise looked
at man who climbs alone and smiles itself even when fail.
That was great time.
My boss has read my blog...
And probably is reading right now - if I will be fired soon,
that will not be a surprise.
At least I will stop doing uninteresting things :)
Changes from 2.6.18 are small enough, but my testing machine is not functional yet,
so I will start testing tomorrow and then will announce release in mail lists.
added falgs to callback structures - currently used to check if kevent
can be requested from kernelspace only (posix timers) or
userspace (all others)
With this release I start 3 days resending timeout - i.e. each third day I
will send either new version (if something new was requested and agreed to
be implemented) or resending with back counter started from three.
When back counter hits zero after three resendings I consider there is no
interest in subsystem and I will stop further sending.
Thanks for understanding and your time.
Not bad and quite long training - today it was devoted to bouldering traces.
I ran quite a few, although most were old enough, but also found couple of new ones.
Found why instructors climb in running shoes but not in special climbing shoes most of the time -
they say that when new trace is created and they can complete it in running shoes, then
likely usual people can complete them in climbing shoes, otherwise we need to be able to climb
7'th cathegory and higher. Well, maybe that is true, most of the time they create simple
traces it does not matter in what shoes one can run over it.
Talked with instructor Anya about climbing areas in Turkey - she wants to open climbing
tourist agency there, although I'm not sure it will have major profits - there are
a lot of good climbable rocks, weather and area are good for that too, but I doubt there
are a lot of climbers all over the world who will visit it again and again - people
tend to like different places.
Interesting article in LWN about
possible solutions for avoiding and fixing memory fragmentation.
This issue was main goal of the
network tree allocator,
which among others has ability to perform zero-copy sending and receiving, which I plan
to integrate into netchannels.
Main difference is the fact, that network tree allocator combines chunks inside
pages including composed pages, while solution proposed by Mel Gorman is to combine pages.
Netchannels itself work - one can configure either userspace
netchannels or kernel ones, which are processed by kernel threads.
NAT is special and thus interesting case - packets must be mangled
and resent upto two times - when packet is received from one
network (before changes), then it should be changed, sent to
likely other network, reply received, changed and sent back to original host.
There are at least two types of rules in such scenario.
One type contains NAT rules which are created by system administrator, another one
is per connection rule, which includes real manipulation data.
At minimum it contains two routes - to original host and
to remote host. Main mangling rule setup by administrator,
for example 192.168.0.0/255.255.0.0 convert to look like source is 1.0.0.1,
can contains several different networks, which are connected
through different network adapters, so input routes are essentially per-connection,
the same applies to destination routes.
So, what is the logic behind NAT in case of netchannels or netfilter?
I will describe how netchannel implementation will work.
When packet first time enters network stack and it hits NAT netchannel,
it will be queued for processing in special thread, where new netchannel
with exact source/destination parameters will be created, including
routes for input and output directions (actually they are both output directions for NAT server
(or only one direction, if connection is to/from NAT server itself),
but one of them is input and other is output from the initial packet point of view,
so I will call it that way), then packet will be changed and sent to appropriate
destination. When next packet or reply will enter the stack, it will be caught
by that new netchannel, since its priority is higher, and eventually packet will
be changed and resent in special thread. Each new netchannel, created by NAT system,
will have either kernel timer attached or special state machine will be created
to track too old netchannels which potentially can be freed (i.e. some kind of
garbage collection).
Since each new netchannel, created by NAT system, has always exact parameters
without wildcards, it would be possible to put them not into trie, but into
special binary tree with faster search time, but such tree must be put into
main NAT netchannel to not mess with other rules, which were setup in trie,
so actually there will be even more lookups than in case when new netchannels are put
directly into the same trie - one lookup for main NAT netchannel and then additional
tree search for subnetchannel (lets give such a name to netchannels created by NAT system).
In theory it is possible to not create new subnetchannel for each new connection -
system can take an advantage of the fact, that several dataflows (potentially mapped
to individual netchannels) can use the same routes, so they could share the same
subnetchannel, but in practice there is no way to select set of addresses which
contain the same route, only possible to get route by addresses.
This rises a question actually, since I want to use exactly one netchannel for server-like
scenarious, when there is only one listening netchannel, which transfers data
to/from userspace, which handles protocol itself. I will think about posible
good implementation of such a feture, but until it is done, new netchannel must be created
for each new connection.
Netfilter NAT work essentially the same.
It is based on famous connection tracking subsystem - each packet, when it enters
the stack, will get connection tracking entry attached to it, when packet within the same
connection (i.e. source/destination parameters) enters the stack, it will
be changed according to that connection tracking entry. Connection tracking entries
live in own cache independent on NAT, but called from NAT rule - so we have exactly
two lookups which I decided to avoid in described above netchannel scheme. Netfilter/connection tracking
lookups are perfomed in fixed-sized hash tables, contained linked lists,
which does not scale neither with number of packets, nor with number of rules.
It was quite short training, only about two hours, but quite hard -
I started new relief traverse part - from the begining to the middle,
where previously I failed, now there are only couple of places on the
whole trace which I can not complete, but I can not run the whole trace
even without that parts - do not have enough power endurance, although
trace's rating is not that high (about 6c).
Also several old boulderings and traverses were completed and that was all.
Now I feel pain in virtually every pipece of my body, and strangely enough
I like that masochistic feeling. Well, all except pain in knees and head,
which I damaged a bit during unsuccessfull failings from the wall,
i.e. all muscle pain is realy pleasant - it is not something really strange,
but if I would have a psychoanalyst it could be probably a thing to worry about.
Let's see how 'commit early' linux motto sucks (well, it only works for major known developers) - I have not received any feedback from
that major release, which included most (all except sigmask syscall parameter) issues mentioned by other people
as long as a lot of additional features.
And kevent (although obsoleted 'take25' release) was removed even from -mm tree.
Maybe I should heat an interest in kevent a bit with this
kevent benchmark?..
I splitted
netchannels
into core processing unit, which selects netchannel from the trie when packet arrives,
queues packet, export add/remove and other operations and other similar core tasks.
Trie layer itself is just an implementation of the multidimensional data structures,
used for fast searching of the final destination point (i.e. netchannel),
it can be replaced with some other implementation easily as long as new
one provides the same three function to add/remove/search operations
and obey the same locking rules, which should not be hard, since netchannel layer
does not perform any kind of special locking in that operations, except that
requirement for search RCU protection and RCU-aware netchannel freeing.
Other layers are just netchannel users - userspace with ability to work with netchannels
using syscall interface, NAT implementation (work in progress yet).
Interesting decision was done for some netchannel subsystems which can not have
backed process - like NAT, when packets must be processed in kernel only. There is
set of special threads, which performs netchannel processing (it is quite simple -
they just calls nc_process() callback of each linked ready netchannel),
processing thread selection can be performed when new skb has been received -
system can select local thread or schedule netchannel processing on the some other
CPU bound thread (for example if netchannel is being processed on different CPU,
cache thrashing overhead will be much smaller if system will move skb to different CPU,
than the whole netchannel to current CPU).
Thus this subsystem only performs RCU protected netchannel selection, spinlock protected
skb queueing and awakening of the actuall processing thread (either userspace parked in syscall,
or kernel thread).
All the real work is performed in either userspace or by kernel users (currently only NAT).
I plan to complete NAT implementation, perform some testing and release new major update this week.
I've completed the heaviest work - weight of the second layer is about 40-50 kg
with about 6-7 square meters, the whole ceiling thus is about 60 kg and about 10 squares
with two levels. Doing that alone was not that simple task,
but it is done already, fortunately without major damages to the neighbour nature.
Now it requires plaster filling, rubbing, board creation and final postprocessing.
Likely I will complete that part this week.
After it is done, I can say, that the dirties parts are done and I can start
finish processing (although I did not even start bathroom development -
did not not get bricks and concrete).
This is phone-made photo of the part of the ceiling completed about midnight.
RCU is not a perfect solution for existing schemas actually.
Lets consider synchronous situation when object freeing does not only
return memory to the cache, but also perform some additional state machine changes -
like putting reference counters and so on - obviously postponing of such work
likely will not be very good solution, since when we work with fast path, we want
the whole sequence be completed quickly, but not fast first half and postpone
other for a long time.
But even pure memory freeing - i.e. returning mempory to the cache, if being postponed
to RCU callback, can lead to very noticeble performance degradation.
Thinking about better and faster skb processing for netchannels I created
simple patch which just postpones skb freeing (kfree_skbmem(), i.e. pure
releasing memory back to skb cache) to RCU callback invoked from __kfree_skb().
This leads to the following performance degradation (receiving of small packets):
Speed is about 2.5 times slower, although CPU usage is smaller too - likely due to the increased
work of RCU tasklet and increased number of context switches.
As a conclusion: using RCU protected lists of skbs for sockets will lead to major performance
degradation.
As a second conclusion: RCU is not a panacea, so its usage will be limited in netchannels.
This tree is never rebalanced by design, its pointers
are always the same except the case when node is added or removed,
thus it is possible to use RCU protection when traversing
or modifying trie structure - just using rcu_dereference()
and rcu_assign_pointer() it is possible to create completely
lockless network stack.
Let's consider trie modification which leads to node removing (when appropriate
reference counter becomes zero).
Trie removal algo deletes pointer
from root node to its child using rcu_assign_pointer() and
all subsequent subnodes and schedules node freeing through call_rcu() mechanism.
Trie traversal, which uses rcu_dereference(), will eventually fail
to dereference next pointer or will complete netchannel processing
and thus will exit RCU protected section, which will lead to freeing appropriate nodes.
Node addition is essentially the same - set of node insertions (if new trie
set of nodes requires allocations and insertions and does not reuse existing
inserted nodes) will be 'published' one by one through rcu_assign_pointer(),
which can be interrupted between different isnertions and lead to obtaining
NULL pointer, which is perfectly fine, since netchannel is not yet added.
RCU callbacks are scheduled for invocation after so called grace period is completed,
in practice it means that callback will be invoked some time after context switch happens
(surely since context has switched, section is no longer executed, thus it is ended),
so we need to protect our execution path from being scheduled away while traversing
the trie with rcu_read_lock()/rcu_read_unlock(), which is normal,
since searching path is executed in softirq context. So when searching
is completed new data is added into netchannels queue and appropriate user
is awakened (either userspace process or kernel thread/work queue for things that
do not have userspace process, for example netfilter checks,
for the latter case netchannel will be put into appropriate per-cpu list, accessible
from kernel thread, bound to given CPU).
Since netchannel processing happens in different context than searching, it means
that grace period has elapsed, thus netchannel and related nodes can be freed, while
netchannel will be accessed from processing context. To solve this problem
and increase performance I design special multeplexing schema which will be used in
softirq/hardirq fast path and allow better scalability by minimizing
number of context switches and data moving between CPU caches.
Addition is simple - while holding mutex, which will prevent simultaneous
deletion, it will allocate nodes one by one and publish them using
rcu_assign_pointer(), since fast path running in different context uses
rcu_dereference() it will either get new value with all above logic,
or get NULL pointer, thus returning without processing.
Addition will not be protected by rcu_read_lock()/rcu_read_unlock(),
since simultaneous deletion is not allowed due to mutex
being held, so sleepeing allocation can be used.
I completed to fill self-leveling floor in kitchen, bathroom and part of the hall -
do not believe the advertisement, it does not self leveling, no matter
how thick the layer is - one needs to level it during filling. Its nature
is about the same as plaster, so it behaves exactly like it, but it is more
fluid. I spent about an hour mixing and filling, and then about 3 hours
to finally level it - get palette-knife and meter-by-meter stating to cut all
non even parts, with some water oiling it becomes quite good.
Thinking some more about hinged ceiling I decided to implement second layer
even although there are no long enough hangers - I will use parts of ceiling profiles
with cut middle part to implement some kind of 'U' hangers - without middle part
it is possible to level ceiling.
It ran for almost two days and about 7.4 milliards of insert/search
operations were completed in 1 million series. No crashes, no failed
searches or inserts, no bugs. It's a win.
I start to combine netchannels
and this algorithm so wildcards get supported.
Thus netchannels become extremely useful functionality for Linux network stack -
it allows to create and test different network stacks in userspace or kernelspace, thus
creating TOE engines without influence on core stack (since netchannels engine runs before
core network protocol hooks are called), it allows to create very fast and scalable network filters
and packet manipulation entities like NAT. Can you ever consider Linux system with one million
netfilter rules? System with million of sockets is already quite hard to get.
I also have some plans to implement userspace network stack
as a library and patch existing web server (likely lighttpd
to work with it - it only requires one netchannel (wildcard of course), i.e. only
one socket in existing notation, and socket and related structures like inodes and files allocation
and freeing and related syscall overhead is quite major limiting factor in existing systems.
Other system usage include packet-socket-like interface - when userspace can select set of IP addressed
to work with without main kernel stack - like BGP/OSPF daemons, TUN/TAP interface, VPN proxies
and much more.
So there are plenty of interesting ideas, which I plan to sublimate my energy into...
I spent the whole day in Leroy merlin
development shop - from early morning to 22-00 without food and drink I selected
ceramic tiles, walpapers, floor covering and so on. I can not say I liked it, but
it should be done and now I have everything to complete my repair process.
Since there are no standard long hangers I will create only one layer for hinged
ceiling. I also bought new finish plaster for decorative ceiling smearing - looking
at what it is and how it should be used I can conclude that I cound do the same
with usual finish plaster (not stucco, but oil) and it could about 5-10 times cheaper,
but let's see what our french friends invented. Also bought transformer and set
of ceiling dotty lights - I plan to complete hinged ceiling this week, and it would
be great if I could do it tomorrow.
Main thing is of course ceramic tiles - women in shop agreed that they do not like and
do not even imagine how grey/blue marbled tiles on the walls (it is actually floor tiles)
with blue marbled floor can look like. Then it was decided that it is man's bathroom, where
everything is possible. This took about 4-5 hours. Well, let's see what my imagination created.
Wallpapers selection was simpler - I almost did not have power to spend there as much time
as I did when selected ceramic, so I bought a lot of wallpapers for room, kitchen and checkroom.
Floor covering was easier - although it is a bit early, but this will force me to faster
complete developemnt - various types from black ceramic granite to blue carpet covering.
Major things I missed are bricks and concrete for small podium I want to create for
shower cabin, this will stop bathroom development actaully, but I plan to get them next week.
Eventually I bought all that and other stuff, which will be delivered Wednesday, so while waiting for
the bus I decided to visit nearly located Auchan gigantic shop -
I just wanted to find immersion heater - but I completely lost in the area of the couple of stadiums
filled with various stuff - nature brought me to the area where strong alcohol drinks were sold,
so I got bottle of rum (they do not have "Captain Morgan", only "Baccardi") and moved away.
While waiting for the bus and drinking some rum I decided that it was not that bad day actually.
I'm back to work on this problem and here are results.
Hardware.
Intel Core Duo 3.4 Ghz (each core runs with 3.7 Ghz) with 2Gb of RAM
(userspace application is singlethreaded, so only one core is used).
Test description.
A number of 3-dimensional pseudo-random rules has been inserted/searched/inserted+searched.
More than 80% of rules contain pseudo-random wildcards (wildcard is formed with
higher bits from 0 to 31 set without gaps). Each rule is a 32bit range,
for example 192.168.0.0/255.255.255.0 produces such a range.
Searching is performed for the non-wildcard rule, i.e. rule with 0xffffffff mask.
Such setup is a powerfull IPv4 packet filter emulation.
Results.
Graph shows number of microseconds needed to insert (including time for memory allocation),
search and sum one multidimensional rule depending on total number of rules.
This graph shows memory usage for trie implementation. Theoretical maximum limit is calculated
by multiplying number of rules, number of directions, size of each rule object (in onde dimension), 32 (number of bits,
i.e. number of rule objects to represent the rule) and 0.6 (20% is non-wildcard rules
and 80% is wildcards with equal distribution of 1-31 bits, so 0.2 + 0.8*0.5).
Each rule contains 28 bytes on x86 of auxiliary information,
which is used to build the tree, it can be reduced though.
Test run for the case without wildcards shows increased memory usage (since with wildcard existing algorithm allocates
nodes only for masked bits, i.e. the biger mask is (0xf0000000 is bigger mask than 0xffff0000 in above notation)
the less memory it requires, thus insert time was increased, but still is constant value about 12 microseconds.
Search speed becomes constant and is about 1 microsecond.
I've started following test to run till Monday - system creates and inserts 500.000 rules with
random data and prefixes, then 500.000 rules without wildcard, after each insertion it search
for the rule, after one million rules has been isnerted, program starts from the beginning (by existing
and restarting). I do not put rule deleting test, since there is no memory to hold all rules
and all intervals used to create those rules (deleting was successfully tested on smaller amounts).
It was interesting training - besides usual traverses and boulderings I created
new type of exercises. Consider the case when you got big hold, where two arms can be placed
and there is some place for feet - yes, while you are in hinged state, you get your feet
over the head to arms (another leg can be setup for friction stop),
put the feet onto the hold and try to move your body so that you sat on the hold -
some kind of lifting your body using arms and leg on hold from the initial place
(somewhere over the head) to the place when you sit on the hold. It is not that complex
exercises when you use two holds - one for arms and different one for leg, but when
there is only obe hold it becomes very interesting - I spent quite big part of the training
trying to complete it using different holds - big reliefs were done quite easily, but usual
big holds required quite a lot of efforts. That was great time.
Something like that (photo from bouldering championship in Skala-city,
man on photo is not me), but with only one hold about 20 santimeters width (like big blue hold rotated to 90 degrees). Final position
is when you sit on the hold, where right hand is.
And I would not call them beautiful. Actually couple of times
I produced really nice sound, but I can not easily reproduce it,
so most of the time (I started about 23-30 and practiced about an hour)
it was horrible wheeze and creak, but I alsmost start to figure out
how to play notes...
I think it will take looong time to produce something interesting, but I
do not hurry.
added high-resolution timer to handle absolute timeouts.
added flags to waiting and initialization syscalls.
kevent_commit() has new_uidx parameter.
kevent_wait() has old_uidx parameter, which, if not equal to u->uidx,
results in immediate wakeup (usefull for the case when entries
are added asynchronously from kernel (not supported for now)).
added interface to mark any event as ready.
kevent POSIX timers support.
return -ENOSYS if there is no registered event type.
provided file descriptor must be checked for fifo type (spotted by Eric Dumazet).
signal notifications.
documentation update.
lighttpd patch updated.
Appropriate applications (ring_buffer.c, signal.c, posix_timer.c) and lighttpd patch
can be found in archive.
I've completed a bunch of kevent updates including major kevent interface changes
and will release new version soon. I've also created some locking changes, which
I call speculative locking (i.e. some fields (ret_flags mostly)
can be updated without locks, but it should not harm the system),
which can be wrong, but results are impressive.
Hardware setup.
Client: Intel Core Duo 3.4 Ghz (run with 3.7 Ghz each core), 2 GB of RAM, sky2 driven gigabit NIC
(Marvell 88E8053).
Server: AMD Athlon 64 3500+ 2.2 Ghz, 1 GB of RAM, r8169 gigabit NIC.
Software.
Both client and server systems have this network timewait socket setup:
Client runs Apache benchmark utility, which was started with following command line:
ab -n80000 -c8000 http://192.168.0.48/
i.e. 80k requests total with 8k requests started concurrently.
Results.
kevent about 7200-7400 requests per second
epoll about 4300-5300 requests per second
With 10k concurrence performance drops to
kevent about 4000 requests per second
epoll about 3800 requests per second
it is possible that I made some errors in that benchmarks though, so I will not publish
results to linux-kernel@ or make awailable from kevent
homepage.
Update: even without speculative lock changes I got the same outperforming results. Kernel is compiled for SMP setup,
all debugging is turned off (in my previous run on that platform I had some debugging options turned on,
and performance was about the same and even worse for kevent), preemption model is 'Low-Latency Desktop'.
Update 2: forgot to say, that web server is lighttpd-1.4.13.
Here is mp3 version (6.1 Mb, renamed to moscow.mp for those who
has corporate filters for mp3 files) of the song from Mongol Shuudan band.
This song and some others are available from band's homepage - mostly punk style.
I have not tried to extract sounds from it yet - I'm in office and I want to get home without
major damages, which are unavoidable if I would try to 'play' right now - there are several people
here, and they even try to work, so my first lesson is postponed to the evening.
I will start to rape and crack neighbours ears later today at home - likely no one will suffer too much,
building is quite new, so there are only about 20 'alive' flats in the
whole building (about 350 apartments).
I think my sounds will be in a harmony with sounds extracted from concrete breaking perforators and corner-grinding machines.
I'm talking about my climbing training, if you have not understood.
Actually there were nothing special from climbing point of view - the same
traverses and bouldering, some progress here and there, bits of mixed traces
and so on. But the whole training was extremely good - when
interesting people smile to you and you smile to them, when you found
new moving on the old trace, when body aching after the holds, when ... -
all that small bits create excellent mood and I become to think that I just
spent perfect time, and actually most of the time (I would even say overwhelming
part of the time) I spent was really good.
Grange has really burnt them brains -
in the brazilian pub he took away a guitar from singer and started
to sing russian song "Moscow" with Sergey Esenin words.
I would say that Grange signs not that bad, and especially when
he is drunk, and song itself is very good too, but the whole situation made it even better,
so he was called 'the most talented man at h2k6'. It is quite possible. One can even find
his records on the homepage (in russian).
So, it was very interesting meeting, as we can see.
Among good mood he also got bunch of various hardware, so I hope he will finally
start hacking interesting stuff instead of slacking his ... fifth point.
Although long-running test was completed without errors (actually
it is not interesting test at all, since 10 rules is a too small set),
after some thoughts I've found a bug in the algorithm - I tried to avoid recursive searching
in case of wildcard nodes found traversing trie, but it seems that it is unavoidable.
I also added trivial memory leak detector, which found another bug in algorithm -
there is possibility to not free next-dimensional roots in some cases.
So I started to change algorithm itself, and although its theoretical ground
is fully completed, practical implementation has a place where bug lives - I know
exactly what is the bug and where is that buggy place in the code, what are the reasons,
actually I know everything to fix it, but do not know how to make a good fix. Yet.
It is a first and the most significant step in netchannels wildcard implementation.
I tested it with different number of dimensions (1-5) and different number
of added/searched/removed rules (upto several millions in each dimension).
Algorithm is simple and fast - my previous
estimation was incorrect - currently 1 million 5-dimensional insertions take about 6 seconds,
1 million 1-dimension insertions take about 1.1 seconds on my test machine.
When I add search/remove operation (one-by-one, i.e. add one, search for it and remove it),
the whole tuple takes about 15 and 3.2 seconds for 5-dimensional and 1-dimensional case accordingly.
I will (probably) create some graphs describing algorithm and how it will be used
with netchannels tomorrow, but if algorithm itself is implemented correctly, it is
the most significant part of wildcards support fot netchannels.
I have started following testing case for 5-dimensional case for the night:
10 rules are added, then 10 various values (actually inserted values, but with 0xffffffff mask)
are searched for, and then all they are removed, let's see how this will end up tomorrow.
After quite empty day (I did not hack, did not work, did not do anything usefull)
I moved to climbing zone and spent there excellent time - there were
several old traverses completed, couple of new boulderings. The most interesting
part was relief traverse - each day I add more and more holds there, although
it is marked not as the most complex trace, I only saw one person who tried
to work with it (except me), others somehow do not like that trace.
unfortunately in Skala-city there are no
relief vertiacl traces, which I'm quite sure would be my favourite ones.
I've started to think about presents.
First of all, present for myself (well, second one actually) -
I think I want to start to make photos, so I decided to get
Cannon EOS 400D. Not sure if I will really buy it, but at least
I think about it.
The best present for myself would be flat repair complete,
but we will see...
So, to touch the perfect, I've subscribed to couple of interesting
livejournals of photo masters.
There are indeed several types of photos
I like - first of all, it is night photos, then probably building/stuff
photos (especially from higher floors or roofs) and panorams.
I have extremely small experience in that area, but will try to learn. Maybe.
This algorithm describes what is hackathon and what they are doing there:
For comparison: Linux network developers conference NetConf
is completely different meeting - hackers sing karaoke songs and seems to not drink at all, I never participated there though
so can not say for sure. On picture core linux network hackers
(David Miller,
Rusty Russell and
not-yet-a-netdev-hacker Soyoung Park) select karaoke song.
I completed 1-dimensional trie implementation and started its wild testing -
random prefix and address mask is created in infinite loop and then
random address is being searched there - it has passed more than 123 millions insertion/searches
and ate about 1.8Gb of ram - each trie node is 20 bytes in size,
which would take 2.29 Gb of RAM if they lived as-is, but due to the nature
of tries, which allows to reuse the same entries for several prefixes, memory usage
is smaller. 100k insert+search operations took always less than 2.5 seconds on my test machine.
Next step, which I will start after weekend, is multi-dimensional trie implementation,
which is actually netchannel wildcard support.
Grange has flown to OpenBSD hardware mini-hackathon.
This small party, also known as 'I-can-drink-more-than-you-can-hack'
(there is no real evidence that they spend there time with profit for the world
or at least for own health), will be spent
in University of Coimbra in Portugal.
Grange will hack there probably Dallas one-wire bus, maybe something else.
Actually I do not know, why he is going there except that to become drunk with
people which live about tens of thousands kilometers away.
I hope he did not forget to get one bottle of 'Russian Size'.
So, let's wish them nice time there.
Ok, after I was requested to change interfaces again to the state they were in the previous release,
I've thrown the last words - I'm waiting for about one week and listen for other developers to comment
on the interfaces, if no one will comment, I will gather all the latest suggestions, implement them and make
the new release - no interface changes after that day.
And I do not care if there will be some problems and/or
some people do not have time to review all my releases or something else - I have time to implement them on per-week
basis (I could create new release every day actually), so there is plenty of time to review for those who cares.
If you have something to say about kevent, your opinion on its inclusion/declining or on my behaviour and so on,
it is the right time to say.
Thank you.
I'm going to implement netchannels wildcards -
it is much more interesting than spending hours on completely void tasks.
struct timespec and possibility to have absolute timeouts in syscalls instead of simple number of nanoseconds.
signal mask parameter for syscalls to allow temporal blocking of the signals while syscall is executed
(almost, but I'm still trying to show that it is stupid idea with kevents).
kevent posix timers support.
allow kevent to mark any events as ready from userspace.
allow to break any waiting syscalls.
various errno return values update.
various interface changes, like addition of some new fields, and removal of otheres, which will not affect
functionality behind that syscalls.
remove some locks in the hottest path - it is indeed needed task. I plan to refactor searching/modifying code so that
there were places in the struct ukevent which are never modified by another than caller users (like
simultaneous changes in readiness state machine and searching part), so there would be possible to only
have one lock in the readiness state machine around ready queue.
Call me a looser, but although I'm strongly against first two entries, I likely will add them
to make at least some progress on kevents.
Crap, I spent 2.5 hours sending e-mails into empty discussion. I wish my day
would have not 36, but 48 hours...
Early morning hack has been sent to mail lists for review.
I have not tested it (except compilation test), and frankly do not want,
but if my flow of thoughts is correct, I will create simple application,
which will use timer_create() and friends to test
kevent posix timers support.
Implementation (and design of POSIX timers API) does not allow to have
KEVENT and SIGNAL or THREAD delivery mechanism simultaneously, at least I
concluded it after quick look into the code, which is probably is not correct.
Let's see if my patch is good or not.
That was quite easy training - I ran several old traverses
and couple of bouldering, also created one traverse, which previously
contained green holds, but most of them were removed or replaced with time,
so it was a new trace, that's all - nothing special this day.
The most pleasant part was sauna and shower which definitely were required
by development related dirtiness.
Now I feel mysef just bloody excellent.
Most of them are implemented in the local tree, I'm just waiting
for ack from Ulrich Drepper:
new command, which allows to mark any events as ready. It is also possible to just
wake up listener of the given kevent queue if number of events requested to be marked as ready is zero.
It allows glibc to notify other event processing threads, which are parked in the
waiting syscalls like kevent_wait() or kevent_get_events(),
that they can process new events in cae of originally awakened thread was terminated.
fixed unsufficient parameters check in kevent pipe notifications (spotted by Eric Dumazet).
return -ENOSYS if there is no registered event type instead of -EINVAL.
add 'flags' parameters to kevent_init() syscall (not implemented yet).
Actually looking at how kevent
makes fast progress and how slowly I get positive feedback and how complex are negotiations
about features and requirement addons are performed with some people, I seriously doubt
there will be enough resources to really complete the task in the way I want and think
is the best for future development.
I have several possible ways to solve existing (stagnating) problem with APIs and feature addons:
just implement all horrible workarounds and ugliness I'm asked about, which I think
will allow quite fast kevent integration into mainline.
continuously perform seems to be endless negotiations with mutual insults in linux-kernel@. This will not force
kevent inclusion, and likely will not end up with it at all, but there is probability (I think it is somewhere
between zero and void), that I can convince people that I am right. Practice shows that it is hardly to
convince people, which do not hear what you talk to them, no matter correct it is or not.
say that 'I do not care and will not even discuss all requested utterly damaged ideas' or something like that
and stop discussion. It can or can not help with kevent integration, but in any case I at least will have enough
time for more interesting tasks than empty talks in linux-kernel@.
For example netchannels wildcard
support and fast NAT implementation, zero-copy
support for netchannels, asynchronous IO, threading library,
log-structured filesystem
The more I think, the more I like third variant...
Something like that.
I screwed it yesterday evening, and can say that although making hinged ceiling alone
is not that simple task, it is quite doable.
Next task is to perform second layer, it will be lower and take more square, but it is
task for another Sunday.
new (old (new)) ring buffer implementation with kernel and user indexes.
added initialization syscall instead of opening /dev/kevent.
kevent_commit() syscall to commit ring buffer entries.
changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
only first thread always if that flag is not set.
KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
instead of copying back to userspace when kevent is ready immediately when it is added.
lighttpdpatch
(Hail! Although nothing really outstanding compared to epoll).
And special blog edition changelog entry (interesting, do people read my patches to detect
this lyrics? Let's see...):
small three-line verse (I think I created not bad step for non-english speaking person) about one-line optimisation.
Casted by Shakespeare.
I installed slightly used, but still functional (bought on ebay) remote
mind reader, and set it up to read Ulrich's alpha brain waves (I hope he
agrees that it is a good decision), which took me the whole week.
So I think the last ring buffer implementation is what we all wanted.
Details in documentation part.
That is how I spent Sunday.
Weight of that part is about 20 kg, it was challenging to lift and fasten the detail and
it is still in a hung state, since I do not have long enough screws. But I will complete it soon,
maybe even today.
It will be 'take25', but I have not released it yet, since
it is in testing mode right now.
This release will include:
new (old) ring buffer implementation with head and tail in userspace buffer with two syscalls
to wait and to commit events (the latter becomes just an index update).
additional feature of the ring buffer - overflow counter to prevent situation when two threads
are going to free the same events, but one of them was scheduled away for too long,
so ring indexes were wrapped, so when that thread will be awakened, it will free
not those events, which it suppose to free.
wake-all-threads flag - when it is set all interested threads will be awakened, otherwise
only first one.
initalization syscall and removed /dev dependency.
small three-line verse (I think I created not bad step for non-english speaking person) about on-line optimisation.
Casted by Shakespeare.
I wil probably include there userspace notifications.
Did I say that httperf is unfair benchmark?
It definitely is, and actually all existing kevent results are unfair. So... Apache ab benchmarks shows
kevent win, but I do not know how fair it is and how it works at all.
It is also possible, that my epoll based server is broken too.
So, let's state for now, that my benchmarks can not show kevent superiority in socket tests.
PIPE/socket test, created by Eric Dumazet proves kevent works better that epoll.
Johann Borck's tests show that kevent slightly outperforms epoll in web benchamrks.
But not mine.
So I have not gone to climbing zone (and I'm dirty as pig after weekend hinged ceiling setup) and
started lighttpd patch to create real-world benchmark.
Here we go.
I've patched lighttpd with kevent support (although it is limited,
since lighttpd does not support edge-triggered notifications and immediate events).
Server runs on Xeon 2.4 Ghz with 1 GB of ram and e100 NIC.
Client uses apache ab benchmark,
which was started with following parameters:
ab -n50000 -c5000 http://192.168.4.78/
i.e. 50k requests with 5k concurrent requests. Document size is 699 bytes to not make bandwidth a limiting factor.
although it can be just a file/socket rollup.
So essentially kevent socket code behaves like epoll in real-world benchmarks,
as long as allows a lot of other interesting features.
I live in a new district which is only became to be inhabited.
There is a bus station near the buildings and generally there are
a lot of people there and obviously it is a good place for small
shop or stall which would have cigaretts, beer and other such stuff.
And what would you think is the first shop opened there?
It is flavour shop!
I love my country.
Well, it is not new, but it is different from what was described
before.
And although previous variant has higher lookup speed,
it requires too much memory, so it is not acceptible.
New one is quite simple - it will be implemented on top
of trie structures. Main feature of tries is that they are quite compact
(in case of PATRICIA tries) and allow W^d lookup speed in the worst case,
where W is width of the interval in bits (i.e. 32 bits in current netchannels
implementation, which does not support IPv6 due to my laziness and absence of
requirements feedback from the community), and d is number of dimensions
(5 currently), which results in very bad number of 33.5 millions operations
in the worst case of (2^W)*(2^d) == 2^(W+d) (137.5 milliards) rules.
Actually with netchannels it will be much smaller since only two dimensions use
full 32 bit space (source and destination addresses), while others use either 16 bits
(source and destination ports) or 8 bits (protocol number), so it will be
something about 2 millions operations.
Basic idea behind tries is following: trie is a simple unbalanced tree each node
of which represent only one bit (i'th level of the tree represent bits in
i-1 position in the searched value), so it's maximum depth is 32 (in PATRICIA
trie several bits can be combined into the same node if there are no other rules,
which have different bits in that positions). In that case rule 192.168.0.0 - 192.168.0.255
will be presented as 0b11000000101010000000000000000000(192.168.0.0) - 0b11000000101010000000000011111111(192.168.0.255),
so actually trie will be stopped at bit 24 (counted from the most significant one) and the rest will present a wildcard
(0-255 in the least significant byte in host order) or it will contain additional nodes if there are
additional rules which has bits in that range, for example 192.168.0.0-192.168.0.20.
Each direction is a base for own tree.
Each node contains a pointer to the subtree of the next-dimension tree, where its rule starts.
So when each node is examined (i.e. each bit is checked) in the first dimension and bit is found
to to belong to some rule, next dimension is checked from the bit position specified in the node
and this is being done recursively for each dimension. If in some dimension requested bit is not found
in the trie (where it should be by the previous dimension rule), it means that there are no rules
for requested values.
In the previous memory-hungry design,
if rules do not have ranges but only single tuples (i.e. single source/destination IP address and single ports and protocol),
or if all rules present the same range (for example all port rules have the same 1-65535 range),
it can not lead to the described problem with memory usage. The more range overlaps rulset has,
the more memory such algorithm will use - in theory, when each new rule requires interval split,
i.e. when each new rule overlaps with all previous and has additional range (for example when each new rule
is slightly bigger than previous one), it is O(d*W*(N^d)) where N
is number of rules and d is number of directions, W is width in bits of each dimension (actually
it can be hidden by O(), since it is a constant for different set of rules).
So I could quickly complete
flat development,
moved my laptop to HP repair (its disk fails after couple of hours of work) and started to work
more than currently.
Likely if you do not see entry in the blog, that means that I'm too busy at paid work
(or even unavailable at all) to work on my own projects.
All existing AIO - both mainline and kevent based lack major feature -
they are not fully asyncronous, i.e. they require synchronous set of steps,
some of which can be asynchronous. For example
aio_sendfile() requires
open of the file descriptor and only then aio_sendfile() call. The same applies
to mainline AIO and read/write calls.
My idea is to create real asyncronous IO - i.e. some entity which will describe
set of tasks which should be performed asynchronously (from user point of view,
although read and write obviously must be done after open and before close),
for example syscall which gets as parameter destination socket and local filename
(with optional offset and length fields), which will asynchronously from user point of view
open a file and transfer requested part to the destination socket and then return opened
file descriptor (or it can be closed if requested). Similar mechanism can be done for
read/write calls.
This approach as long as asynchronous IO at all requires access to user memory from
kernels thread or even interrupt handler (that is where kevent based AIO completes
its requests) - it can be done in the way similar to how existing
kevent ring buffer
implementation and also can use dedicated kernel thread or workqueue to copy data into process
memory.
It is very interesting task and should greatly speed up workloads of busy web/ftp and other servers,
which can work with a huge number of files and huge number of clients.
I've put it into TODO list.
Someone, please stop the time for several days, so I could create some really good things for the universe.
It was quite easy training, since there were crowds of people,
which usually means, that it is quite problematic to climb horizontal
traverses. But nevertheless I completed several old traces
and started parts of the relief traverse (longer and longer each training,
which is a good sign). At the end I also completed couple of start of complex
traces on negative slope and found an interesting bouldering
(read: lurked in climbing school training which happend near the place
where I had a rest), but failed to complete it since was tired already.
Sauna, shower at climbing zone and a little beer with Gabriel Garcia Marquez book
at home completed the day - yet another good day has gone...
It was first described here,
but I did not present memory usage for that scenario.
After some calculations I found it is about O(d*(N^d)/4 + N) where
N is number of netchannels in the whole tree and d is
number of dimensions we search in. With 5 dimensions (source/desctination address/port and protocol number)
it becomes completely unfeasible.
And after my implementation completed (although it requires some cleaning now)
I've found that memory limit on practice.
Presented design allows extremely fast searching, but it is too memory hungry,
so I will modify it to not replicate sub-trees of intervals in each interval,
but instead to have different kind of trees which will reduce
memory consumption down to O(d*N) instead of above O(d*(N^d))
and increased search time.
So, stay tuned...
I've seen Eugene Grishkovec's 'Po Po' performance
created under expression of Edgar Po works.
Performance is a compilation of many novels written by famous author,
all of them are easily detectible.
Grishkovec plays with Alexander Cekalo, they call it 'fun story about the death'
and it is really so. I would even say that Grishkovec plays second role in that
single-actor monologue performance.
There are some opinions that Grishkovec has star fever, but I would not say so - neither
during his presentation before the performance, nor during the whole act he showed
any signs of it, I managed to get tickets in the first raw, so I saw actors right in a couple of
meters. Even after performance, when we met Grishkovec on the street with his girlfriend
and asked couple of questions and presented a flower he behaved very good.
I like his performances, they are really different from the mos others, but unfortunately
he plays quite rarely in Moscow. I hope I will see his 'Drednouti' ('Dreadnoughts')
performance this season.
Later we got couple of beers in 'Vobla' beer restaurant and moved home. I think the whole
day indeed succeeded, although it looks I caught a cold in addon to my current sick...
Do you have an excellent spatial imagination?
Can you consider yourself and easily operate say in 3-dimensional world?
Likely you can, since we live in such a world. Probably we can think about 4-dimensional world too.
But can you describe to someone 5-dimensional world?
Or can you put yourself into one of the N worlds, where N is potentially millions,
each of which has number dimensions smaller than previous world by one and upto N subworlds in each dimension?
And while you - poor walker - are searching for the shortest way to the destination point
in that brothel, someone looks into your universe through the glas and loudly lough counting
your times clock.
I've just described couple of moments in a single network packet live,
when it enters netchannels subsystem with
this wildcards design.
It looks like it will have complex live, but we will see, since I've implemented
generic multi-dimensional tree hierarchy and will complete searching algorithm
and start testing tomorrow.
And now I want to finish my fantastic narration, since I'm going into
a theatre.
It is extremely 'trendy' book, here
one can find a chapter called 'Office'.
Crap. Complete crap - some kind of cool artistic slightly squeamish flood of mind(less).
One can probably think that it is trendy to read it, that it is cool, it is another
view on our usual life and so on...
Maybe it is right, but I think it is a crap created to show 'how different author is from all
those stupids'.
There are quite a few fun moments though - probably life of the top management really looks
like it is described.
"What about putting the sea or beach or wild nature against a background of a woman".
But why so declamative? Why so depreciatingly? Crap.
Essentially, delaying the disk allocation until a file is large/complete
(delalloc) and then using a buddy allocator in memory to get contiguous
chunks of disk is better than hard 64K+ cluster because it avoids
internal fragmentation and allows much more optimal/efficient placement
than just a factor of 16 reduction in the block count.
This completely confirms my thoughts
about delayed allocation.
But I want to add even more - log-structured file system, i.e. file systems, which
allows to write updated block essentialy everywhere on the disk instead of the place
where previous one was placed, have self-defragmentation mechanism by design - there is no need
to reallocate inodes and move files all over the disk - new version can be put
in new location, which can be carefully selected and can be written together with unupdated part
thus removing fragmentation. Journaling filesystems require defragmentator for that task.
For netchannels wildcard support I need to create set of non-crossing
intervals which will be examined for each packet. It is possible
to put them one-by-one and thus form an array of intervals which can be traversed
using binary search, but there is a major problem when we will add new interval, which
should be placed somewhere inside the existing array - that will force array reallocation
and copy. This problem also applies to hash tables and actually all other t