Zbr's days.
March
Sun Mon Tue Wed Thu Fri Sat
       
2007
Months
Mar

About TODO Blog RSS Old blog Projects Gallery Notes

Sat, 31 Mar 2007

Mikhail Alexandrovich.


Today we celebrated birth of the new man - Mikhail Alexandrovich Silich.
I met with many old friends, although not everyone were able to meet. Growth problem were discussed, examples from our lives and education, fun stories. Mikhail did not keep silence and several times per hour requested parents attention, we sometimes tried to determine the reason by scream analysis - mostly failed.
He is two weeks old, but already hairy and always tried to get out of the wear. I failed to determine to whom of the parents he looks like.
Alexander and Yuliana looks happy :)

/life :: Link / Comments (0)


Thu, 29 Mar 2007

FS userspace emulation tool.


Ok, here are features I will implement:

  • a library which will provide synchronous read and write calls, each one will get sector number and amount of bytes (maybe with vectored variants). Each such call will block until data is really written/read, i.e. it will include cost for sequential write/read and seek into the place.
  • simple FS implementation, which will take a 'storage' size and divide it into fixed size blocks, which will form a bitmap of used/free elements. Then I will use random models to test that FS - work with several big files, with a lot of small files and so on, no directories so far. That will show what disk usage patterns are common for different loads, main goal to find a solution which will have the smallest amount of disk seeks, which are extremely slow.
  • start more complex FS implementation with this set of features as a must

/devel/fs :: Link / Comments (0)


New wine and recent happenings.


'Rose d'Anjou. Henry Le Cuvier' - french dry wine - not tasty, no special spirit, almost water. Will not buy.

Wine which helped to crack 2 of 9 Jenkins hash rounds - 'Sunrise Merlot' by 'Vina Concha y Toro' - Chile red dry wine - extremely tasty thing, will buy it again if there will not be any better variants.

I did not write for my climbing trainig progress so far - I do climb three times per week about 2.5-3 hours each, still train alone, sometimes with weights - mostly compelx boulderings and traverses on the second set of shields (1 meter high) - that make me tired as hell, which is exactly the purpose - every piece of body (even head brain some times, although that happens extremely rarely) need to work I think.
Tried (although I did not want to change anything too much) to change life style (as Grange call it) - stopped after initial steps - call me lazy boom slacker.
Found, that I can do only three things after training - to do nothing, to eat and drink and to do somthing I will not tell you. Everything a lot.
Feel myself extreemly good.

Had a talk with my boss and company CEO - such talks never happen when people want to thank you or rise a salary (it rises once per year with smaller than inflation rate), so I got a bunch of shit on my head. And I needed to lie, which I never (hmm, almost) do - but how can I tell them that the most interesting things I did for the last several years were my own projects, which did not affect (almost) my main work - I said, that although I really do not like what I do, it must be done and so on.
Think about doing for my own again - but there are some problems (as usual)...

/life :: Link / Comments (0)


The 2007 Linux Storage and File Systems Workshop.


LWN article and wiki notes.

Interesting notes:

  • extended attributes must have variable lengths to satisfy various usage cases (crypto, ACL)
  • fsck is evil, it can be fixed by changing meta-data location to eliminate seeks
  • disk encryption should not depend on external subsystems like SELinux
  • stackable filesystems work badly when underlying filesystem in the stack is modified directly
  • ShadowFS - way to use some kind of write-anywhere-filesystem-layout - when data is being written into the file, old data is not owerwritten, but instead write happens into different area on disk, so system is never in inconsistent state if there were some errors (power failure, shutdown and so on
  • a short description of several filesystems (NFS, GFS2, OCFS2, DualFS) and eXplode - bug searching tool, which is known to help finding bugs in every Linux filesystem recently
As an interesting short-time task I started to implement a userspace model of the hard drive - it will map the storage and have different access models for read/write/seek disk operations.

/devel/fs :: Link / Comments (0)


Wed, 28 Mar 2007

Breaking Enigma code.


Or cracking Jenkins hash for fun and profit...

One man. Six days. One bottle of wine. One bottle of beer. Enourmous amount of tea cups. Eighteen sheets of paper. Two rounds of Jenkins hash.

As a gain:

  • library for polynomial arifmetic calculations in different Galua fields
  • refreshed knowledge of polynomial algebra
  • refreshed knowledge about Galua fields (CRC, Rid-Solomon codes)
  • new knowledge on boolean algebra transformations
  • some knowledge on differential crypto analysis
History.
Day one - cracking first round of Jenkins hash.
That was pretty easy from calculation point of view (i.e. how to select i'th bit based on knowledge of (i-1)'th or (i+1)'th bit), but it was not a main goal.
I wanted to solve the problem from algorithmistic point of view, i.e. not to find a single solution, but a generation law. I started from simpler task - to solve following equation:
(A + X) XOR X = B
Day two - theory.
There are two possible ways of solving above equation - either to present logical (bitwise) operation in Galua field of 2^1 (GF(2^1)) - XOR - in ariphmetical field (GF(2^32)) or present sum operation in GF(2^32) as a bitwise operation in GF(2^1).
There are different ways to do it.

Day three - sum as a bitwise operation.
It can be presented using quite simple form:
Ai + Bi = Ai xor Bi xor K(i-1), where
Ki = 0, i = 0,
Ki = (Ai and Bi) or (Ai and K(i-1)) or (Bi and K(i-1))

Xi means i'th bit of X
I got recursive formula for simple sum, which was not what I wanted, so I started to solve above logical equation using different models.
First one - polynomial algebra - sum and bitwise operations are quite simple in polynomial algebra, but there is a serious problem - bitwise operations must be applied only to normalized polynomial form, since it looses information when drops overflow of the order. This limitation is nothing for comupter program, since it is possible to normalize polynomial form before doing bitwise operation, but it does not allow to move on in algorithm solvation.
So, next step - transform bitwise operations into GF(2^32) operations.

Day four - boolean polynoms.
Boolean polynom is a polynom which result and arguments can only be either 0, or 1.
For example XOR operation can be presented as following polynomial form:
A XOR B = X*(3-X)/2, where X = 2*a + b
Looks exactly what I wanted - to present all operations in polynomial form, but there is small problem.
After I expressed all logical operations in the simple equation above ((A+X) xor X = B, where each operation (sum in GF(2^32) and XOR) is transformed into boolean polynoms) in polynomial form, I got system of equations of 27 order - it is not what I expected to solve for simple equation, so this step was dropped too.

Day five - give up?
Not so fast, if I can not solve that problem in algorithm form, let's find at least one sulution for all given inputs.
Here I created a library for polynomial sum, subtraction and normalization (which allows to work with numbers of essentually any order as a bonus, i.e. it is a sum and substraction in Galua field of any order), and started with first round of Jenkins hash cracking.
I ended up with code which gets as input two 32 bit values and hash result, and returns third 32 bit value which if being hashed in first round of Jenkins hash produce required hash value.
That was simple, but then I started second round. (Jenkins hash has 9 rounds).
I made the main mistake here - I started to calculate second round value based on initial inputs (two of them are under attacker's control), but for correct solvation I needed to make a trick.

Day six - break second round of Jenkins hash.
Main trick in Jenkins hash is that it uses value calculated on previous round as input parameter for current round, so after some mathematical transformation it is possible to change variables in the hash equation to work not with initial values, but with some initial values and result of the previous round.
Since each round's result only depends on its input values (some of them can be initial ones, others can be calculated on previous rounds), it can be done.
This idea was in my brain from the beginning, but I could not formulate it into something which can be used, and today sitting in the bus, which moved from home to Moscow to my paid work office I finally drew it (the latest sheet of paper).

Now I have a program which calculates two inputs for given input and required hash output for two-round Jenkins hash.
It is quite simple, but the whole process was very interesting itself.

Further cracking is not impossible task, but complexity increasees with each round (although not exponentially).
I absolutely do not regret about time I spent solving this problem - that was fun.

Interested reader can find a set of photos of brainstorming of this problem in gallery.
And a solution:

Two rounds of Jenkins hash crack solution

Example:
old a: 15e28f3a, b: 1cb4ed1c, c: 7edcd3a0, h1: 70bc78e4
new a: 15e28f3a, b: 434c39d0, c: d28fc0ec, h1: 70bc78e4
As you can see, $a parameter is the same, $b and $c are under attacker's control, all three produce the same hash value $h1 of reduced to two rounds Jenkins hash. $b and $c were calculated based on $a and $h1 knowledge (or actually $h1 can be selected randomly and $b and $c can be selected to produce it again and again, salting with random value will not change the picture).

/devel/math/hash :: Link / Comments (0)


Fri, 23 Mar 2007

Jenkins hash analysis and conclusion.


There were several attempts to use Jenkins hash for socket lookup functions instead of old XOR hash, but I was always against that.
My objectinos were based on the tests I ran during netchannels hash selection process.
It especially becomes visible in the following usage case - we use the same constant IP addresses (two 32 bit values) and one constant destination port (16 bits) and variable source port (16 bits), so essentially we hash 3 32 bit values.
It can be done for example using following scheme:

 hash = jhash_3words(saddr, daddr, (dport<<16)|sport, random_seed);
where random_seed is selected early.

Jenkins and XOR hash results

This graph shows chain length on X axis and number of elements with such a lentgs on Y axis.
As you can see, there is a significant artifact with about 3200 elements in some hash chains.

My first (and incorrect) impression was that Jenkins hash has some distribution problems, but further analysis showed that it is not true. I performed distribution checks in full 32 bit field and found, that it has uniform distribution as long as XOR hash.

Problem appeared when I got 16 bit distribution (size of the hash table is 2^16 elements), while XOR hash was still uniform.

Further analysis showed that Jenkins hash itself does not have distribution problems, but they appear when its folded into hash table size boundaries.
This theory was also confirmed by different hash tests (new Jenkins with rotations and RIPEMD).
XOR hash does not have such a problem, because it uses (u32 ^ u16) as one round, which results in the uniform (it is not correct to call that distribution uniform as is, but only getting into account that u16 values used in tests were uniformly distributed) distribution inside F(16), which does not suffer from hash_size boundary folding. Since XOR hash has 3 rounds, only one of them (xor of the final u32 values) will suffer from folding, but tests where is it can be determined for sure use constant addresses, so problem hides again.

So, I was wrong, Jenkins hash does not have problems as is, they appear only because of folding into hash table boundaries, which is unavoidable for general-purpose hash.

/devel/other :: Link / Comments (0)


Thu, 22 Mar 2007

Kernel with multidimensional trie instead of socket tables boots!


Single trie is created for the following protocols:

  • IP (AF_INET) sockets
    • TCP established sockets
    • TCP listen sockets
    • TCP timewait sockets
    • RAW sockets
    • UDP sockets
  • Unix domain sockets
  • Netlink sockets
Lookup methods as long as insertion/deletion ones differ only in key setup part, although insert/delete methods perform some additional per-protocl steps (like cleaning private areas in netlink).

Patch itself contains horrible ifdefs and hacks, but it works (somehow) - my system boots and I can log in over ssh and run tcpdump, although it crashes when I connect from that host.
I think it is a win.

I'm cooking up a patch, which will be sent to netdev@ in a couple of moments, with link to trie design and userspace tests.
If (and only if) network developers will decide that idea worth implementation, I Will proceed and clean things up (and change tons and tons of code all over the place).

/devel/networking :: Link / Comments (0)


Multidimensional trie implementation for Linux sockets.


Ok, I've implemented Unix domain sockets (both bound and private namespace), but kernel still does not boot - udevd does not create appropriate device nodes in /dev, since it never receives any single netlink message - just because I broke netlink broadcasting.
Netlink sockets were hashed y pid and protocol values, but they were also placed into additional per-protocol broadcast lists (mc_list). Since I removed appropriate entry from socket structure, I need to remove appropriate list from netlink table, so right now broadcasting does not work.
I need to think how to implement netlink broadcasting correctly without changes in the socket structure.

Upd: I need to return one liststructure back to implement correct traversal of all sockets - for example for /proc and netlink statistics, I of course can use the same approach current code does - for example TCP statistics code runs over the whole hash table and gets info about each entry, so I can run over all trie too, but I do not like such approach, so I will return one hlist_node in sock_common structure (or will use list_head, which is simple to operate), which will form a list of all sockets for given protocol. Netlink in turn will has per-protocol lists to implement broadcasting.

/devel/networking :: Link / Comments (0)


Wed, 21 Mar 2007

Climbing evening.


In a masochistic^W^W^Wself-improving process of climbing on negative slope with weights on arms and legs I did some progress - basically I tried to move about 4-5 holds up over one trace and then move down over another trace andthen repeate it until I have a power - this ends up with virtually any piece of body aching as I struck several times with a train, or at least with a wall.
Some times I ask myself, why do I do it - why I spent several hours with such loads without significant (or at least somehow largely noticeble) result? It does not look too cool, it does not end up with better climbing results (at leat I can not test it since I climb alone), instead I could seat in a cafe, drink a tea and talk about something...
But I think I know the answer - I sublimate (hmm, loking into dictionary for that word I did not find appropriate translation, I meant somthing like 'change' or 'replace'), and I do it for myself and likely will not change anything at all.
Does above text sound strange? Yes, it is.

/life :: Link / Comments (0)


Unified socket trie and booting.


Ok, it seems to be somehow completed - at least it works for netlink sockets, but my system does not boot, since I use device mapper (LVM), which requires special nodes created in /dev/ by udevd, which does not start (at least in Debian Etch), since Unix sockets are not supported (yet) in my unified cache.
But at least it does not crash (yet).
I will try to implement Unix domain sockets today before I will go climbing, but I'm not sure I will have enough time (I need to complete some paid job work first). When Unix sockets are ready, my system will boot and networking will start, and only then real troubles will appear.

/devel/networking :: Link / Comments (0)


Graphic tablet.


I've just bought a simple Genius G-Pen 450 for my sister.
Whlist it is here, I've managed to make it working in Debian Etch with "wizardpen" driver. Although movement and pressure level is not very convenient in Gimp, I think it is a matter of settings.
Marina uses windows and she is not a professional artist, so this simple model will be quite good for her skills.

/life :: Link / Comments (0)


Tue, 20 Mar 2007

My brain has exploded.


But initial part of the multidimensional trie lookup algorithm has been added into TCP, UDP, RAW and netlink sockets.
They share the same trie with 160 bits search key (128 actual data bits).
Tree does not compile yet, but there are only couple of function missed - raw sockets lookup, time-wait socket insert/delete, and port free function (read: old inet_put_port()). First two tasks (raw socket lookup and timewait sockets) only need to a simple wrappers to create searching key out of appropriate structures, like:

	u32 key[5] = {tw->tw_rcv_saddr, tw->tw_daddr, (tw->sport<<16)|tw->dport, 
		(sk->sk_bound_dev_if & 0x00ffffff) | (IPPROTO_TCP << 24), 0};

	err = mdt_insert(&mdt_root, key, sizeof(key)<<3, sk, GFP_ATOMIC);
	...
	err = mdt_remove(&mdt_root, key, sizeof(key)<<3);
Code has terrible number of the ugliest set of ifdefs, it does not support procfs and netlink statistics (due to absence of the code to get all sockets), btw, the latter traverses over the whole hash table to get socket info, which is way too unoptimal, it is likely too broken, and (as it does not even compile) I have not even booted it, but I think that what has been done is a serious task.

/devel/networking :: Link / Comments (0)


Fake interrupt disable/enable calls on x86.


Linux uses fair sti/cti instructions to disable/enable interrupts on x86, since it is likely one the most frequently used instructions (since they are essential part of irq-aware spinlocks).
PPC64 for example does not use it (as far as I know), but instead they have a flag, that interrupt was fired when it is supposed to be disabled, so subsequent enable will perform some steps (real disable).
It looks like the same technique will be used by x86 in a near future.

/devel/other :: Link / Comments (0)


Socket lookup algorithm implementation.


You know, it is quite simple to create something new, it is even simple to create something new which will be better than something currently being used, but it is way too complex to replace old one with new one.
I've created a patch which replaces established socket hash table with mdt trie, and it somehow works, but then I suddenly found how many things can be moved there too. For example time-wait sockets (I do not even know why they are placed into own hash table, probably to not to pollute established hot path) and other protocols.
So I made a serious decision - I just removed (if special config option is set) hash related members from socket structure. Try to do it in you setup and enjoy amount of work...
Let's see how this will end up and where I will stop this stupid task (read: how strong is my stamina).

P.S. If I will manage to complete it today or _very_ soon, that will be not even cool, but unimaginable cool.

/devel/networking :: Link / Comments (0)


Mon, 19 Mar 2007

Climbing evening.


I've created new type of my training - since I climb alone for some time, I decided to try doing it on negative slope - I climbed up over one trace for about 4-5 meters (where I have permission to climb without insurance), then move down over holds from another trace, then change it at the bottom to the next one, and so on until I have no power. Then I got some rest and continue with different set of traces in negative slope sector.
That was exhausting, but that was a main reason to try.
Next time I will do it with weights.

/life :: Link / Comments (0)


Sat, 17 Mar 2007

Celebration increases.


Alexander and Yuliana have just become a parents - yesterday they wanted to meet with us, and just about one hour ago a small boy was born.

My congratulations!

/life :: Link / Comments (0)


I congratulate you with St. Patric's day!


Although I'm not an Irish, it is indeed a good day.

/life :: Link / Comments (0)


Fri, 16 Mar 2007

Climbing evening.


Ok, since Grange can not climb for some time, I train myself - so to make training a bit more complex last several days I trained with weights on legs and arms - that does make training more complex and interesting - I try to do the same traverses and boulderings I did before, but this time with weights - that is very fun.

Got my "Games theory" and "Graph theory" books - started to read first one.
It even contains some homework tasks, which I will likely post here (with my solutions) in a new section, so if you think games theory and related economic in particular and decision making in general tasks are interesting for you, stay tuned.

Tomorrow we will celebrate missed Wijo and Fedor birthdays - I will meet with old friends, which I expect to be very fun. I did not see some of them about a half of a year.

/life :: Link / Comments (0)


Asyncfd by Davide Libenzi got ring buffer.


Implementation allows to post events into userspace ring buffer.
There are several disadvantages though:

  • it can not be called from interrupt context.
  • there is possibility to lost events when ring is full, since events are posted unconditionally.
  • it has redundant fields in API.
Briefly saying - it goes the way kevent moved, but with additional problems.
For example I do not see how to solve second problem from above list, since kevent has (had) own queue of events, which were copied from the queue when userspace updated indexes, so if userspace ring is full, they would not be lost, but existing mechanism does not have such a queue at all, events are just blindly copied and thus can be lost when ring is full.
First problem must be solved too to support tricky usage (as Davide mentioned in patch description, USB calls might enp up calling it from interrupt context).
One of the problems Davide solves either by using locked userspace memory (which is a big no-no), or by using kernel buffers and read(), which is unconvenient and just breaks the whole idea of ring buffer - having multiple events in userspace without additional (only waiting) syscalls and having syscalls as thread cancellation points without additional structures shared between threads.

Timerfd implementation will likely rise the same questions from Ulrich Drepper as my kevent timers - they are not needed, instead POSIX timers support should be extended (although I always provided two patches - for POSIX timers and own API, maybe Davide will use both too). But POSIX timer events become ready in softirq context, so it is impossible to use above ring buffer implementation for them.
Timerfd code also uses long type in structures shared between userspace and kernelspace, which does not work on x86_64 when userspace is 32bit.

Yes, I know, I'm a bastard moron, since I write only bad things in my blog, but hey, I was not in Cc: list :)
Actually I'm glad people started to work in that area after kevent, that means that my ideas were correct and eventually Linux will adopt a good solution for that problems.
I think Davide will resolve all issues quickly.

/devel/kevent :: Link / Comments (0)


Super optimized multidimensional trie implementation.


After morning tea and quite interesting optimization lookup for 96 bit key on 32 bit arch takes about 90-100 nanoseconds! This basically means that my multidimensional trie implementation outperforms hash table in any workload with any hash table size. But I need to admit, that it is random values (three 32 bit values without zeroes in any byte, the same are used for hash table tests), so things can differ in real life.
Memory overhead is about 97 Mb.
Test was performed in userspace.

Update: and the same speed (upto 110 nanoseconds) and memory overhead (about 98 Mb) can be achieved for 160 bit keys - so they can include bound device index and protocol number.

/devel/other :: Link / Comments (0)


Thu, 15 Mar 2007

Multidimensional trie vs. hash table benchmarking.


This time I generated 2^20 elements and inserted them into trie and hash tables with different sizes and with different hash functions.
Lookup was performed for the same elements.
This test allows to eliminate rand(3) influence in lookup performance and allows to see how badly actual data is distributed in the hash tables and trie.

First data - distribution of elements in different layers of hash/trie - each point shows how many lookups are needed (x axis) to get one of the 2^20 elements.

Lookup level distribution

As you can see, mdt trie can access total majority of elements in 3 lookups, while hash table with 2^20 elements has access 'length' of about 1-3 elements per hash chain. With decreased hash size peak is smoothed and overall length of the chain is increased as expected. Number in filename in the upper right corner (like /tmp/jhash_order_19.dist shows order of the size of the hash table).
I need to say, that mdt trie used in test has maximum access 'length' of 4 by design per 32 bits. This picture shows numbers only for one 32 bit entry lookup.

Next data is hash table/mdt trie scalability, i.e. time per one lookup depending on size of the hash table. MDT trie has constant access time showed for reference.

Lookup scalability depending on hash table size

Conclusions.
As you can see, for 2^20 elements in the system, mdt trie performs better than hash table with 2^17.5 elements and about 2.6 times slower than 2^20 elements.
This tests were run in userspace, so increased memory usage (three tries ate about 285 Mb of memory) and per-4k tlb miss can play significant role in benchmark.

So, I as author of the trie implementation, can confirm, that it sucks compared to properly selected hash table as is. But getting into account its additional benefits like lock-less traversal and constant upper bound of access time, it might worth to have.
I will cook up a patch tomorrow and send it to netdev@ for review with link to this analysis.

/devel/other :: Link / Comments (0)


Tue, 13 Mar 2007

signalfd/timerfd patches by Davide Libenzi.


Ok, some time after I posted first version of eventfs, Davide Libenzi has posted his version for signalfd system call (created originally by Linus Torvalds), which ended up in a big discussion and subsequent three releases, which now includes code for timers too. My implementation is smaller and uses completely different idea - signal file descriptor is processed before signal is posted into signal mask, so it eliminates some issues and simplifies code.
My implementation does not require additional steps to run in your code (like blocking of the signal to be delivered through file descriptor as in Linus' variant), and nevertheless I got zero responses on it although a lot of people including Linus and Davide were in Cc list.
I call it NIH syndrome and it does demotivate me from doing anything related to generic Linux code.

I got that news via LWN:kernel news line.

Eventfs and kevent projects are officially closed. I want to specially thank David Miller, Zach Brown, Johann Borck, Jonathan Corbet and all other people for theirs time.

/devel/eventfs :: Link / Comments (0)


New wine.


My local wine shop does not have products of neither 'Koktebel', nor 'Inkerman' wine factories, so I got something really foreigh - Chile wine.
'Carmenere' by 'Aresti Chile Wine Ltda' factory and 'Sunrise Merlot' by 'Vina Concha y Toro', both are red dry wines.
The former I already opened - well, it has very interesting taste, but I would not say I like it very much - it is too sour dry wine.

It's time to start working...

/life :: Link / Comments (0)


Mon, 12 Mar 2007

Climbing evening.


Climbing in fetters last friday did not get without consequences - today it was quite hard to climb, so I only completed couple of traverses and then tried old traces. Simper one (6a+) I completed without problems, but next one was a bit more complex (6b+), which I helped a bit with wrong hold. Next one I tried 7a, but arms were already tired, so I failed miserably several times. The last one was 6a+ on-sight, which I failed at the end - shame on me. Although it was quite short training, my body aches virtually in every piece.

And then I was told that my style of life is fundamentally broken and something must be changed and so on. I'm thinking...

/life :: Link / Comments (0)


Optimized trie implementation and performance.


I've completed implementation of my optimized trie design and ran several performance tests.
But for the begining let me introduce some test and implementation details.

Trie test one.
1. Allocation.
As I described (link above) in design, each node can cache a value instead of pointer to the next level leaf, so for the maximum performance trie must be designed to have keys smaller or equal to pointer size.
Since IPv4 socket lookup uses three 32bit values (source and destination addresses and value combined of source and destination ports), the best would be to host three tries each of which is indexed with 32 bit key (32 bit x86 world).

2. Lookup.
Lookup is performed in each of three tries one-by-one if previous one succeeded, i.e. we do not need to lookup in destination address trie, if source one does not match.

Test two.
96 bit keys created as a concatenation of the three 32 bit values.

Test data.
Each test inserts 2^20 elements (either into hash table with 2^20 elements size or different (or the same) trie).
Then a lot (about a million) of entries are being searched.
For hash table three random keys are constructed, for the first trie test three runs are performed for different random data in each 32 bit key (if previous search succeeded), for second trie test one 96 bit key is constructed and one lookup is performed.

Each trie test was ran for 4 and 8 bits per node (i.e. 16 and 256 nodes on each level).

Results.

  Name			    | memory usage (Mb)          | time for one lookup (nsecs)
--------------------------------------------------------------------------------------
hash table (size for 2^20)    32 (4+28) or 24 (4+20) **)	300
trie 4 bits, 3 32bit keys		 24			250
trie 4 bits, 1 96bit key		683			390
trie 8 bits, 3 32bit keys		 96			160
trie 8 bits, 1 96bit key   not performed - too big memory overhead (see below) *)
Winner declaration I provide to the interested reader.

Details marked with (*) in the above table.
*) Since each node can only cache 32 bit (on 32 bit platform, on 64 bit platform it can cache 64 bits), all other bits must be hosted in the higher leafs, so for 96 bit keys we need to host 64 bits in different layers, each layer takes 8 bits to store and 1Kb of ram, so we need to have 64/4*1=16Kb per key, given that we have 2^20 keys, it becomes 16Gb, so test was not performed.
**) hash table itself is sizeof(void *)*2^20, which is 4 Mb on x86, each embedded structure must contain at least 96 bits for key (two addresses and two ports), two list manipulation pointers (in that case it becomes 20 bytes, so 20 Mb for 2^20 entries, thus total memory overhead is 24 Mb (4+20)), and other parameters needed for searching - hash value stored (do not recall if it is 100% needed) and bound device index, so structure takes 28 bytes and total overhead becomes 32 Mb (28+4).

Conclusions.
As you can see, having one big index is not always the best way to perform a search lookup.
Properly designed trie (and lookup algorithm) can be about two times faster (with 3.5 times higher memory overhead) or 20% faster with essentially the same memory overhead.

Additional notes.
During such high rates there were actually never found any entry at all (i.e. either hash/trie lookup returned NULL, or one of the three trie lookups returned NULL and lookup for the next level was not performed), in practice things can be better or worse for any algorithm, but recall all pros and cons for dynamic and static data structures and its lookup algorithms.

Results can be very dependend on real data where actual sockets are returned, but not empty searches.

/devel/other :: Link / Comments (0)


New features in the Linux FS world.


For me at least following two ideas seems the most interesting:

  • introduction of the single ->perform_write() callback instead of prepare_write()/commit_write() pair, which becomes very complex to handle if filesystem has tricky locking. Even ext3 has some issues with journalling and prepare_write()/commit_write() sequence, which I discovered in receiving zero-copy project. That issues were fixed by hack which moves a lot of ext3 processing completely into ->commit_write() callback as Alexey Kuznetsov told me.
  • move to 4k sector size. Some time ago I decided to implement my own FS using pages only (i.e. without buffer heads and related splits, especially with introduction of ->perform_write() callback), so it would be a good performance enhancement (if it would be ready).

/devel/fs :: Link / Comments (0)


Sun, 11 Mar 2007

Meanwhile on appartments development side.


Winter is over and I can continue development - today I completed second plaster layer filling where it was required (there is a place where the last - third one would be good though, but it is not absolutely required), so essentially my main ceiling only needs painting right now, or may be I will fill it with the finish plaster with some kind of a random pattern spreaded over the ceiling before.
Hinged ceiling needs to be rubbed and repainted in some places.
To get a paint I need to move to the development shop where I will also buy special screws for my table implementation.

/devel/flat :: Link / Comments (0)


Solve problem in order of appearence.


I've done one of the things I described yesterday as a short term todo - I get some beer and played Open Transport Tycoon Deluxe the whole night (I went sleep about 6 o'clock today's morning) and managed to win against two computer players on the easiest level.
That was quite challenging in that regard, that playing a strategy game in hammock with touchpad only is quite complex task...

There are couple of other things though...

/life :: Link / Comments (0)


Sat, 10 Mar 2007

I'm sick.


I'm very sick.

Crap, I wanted to buy couple of easy reading books - some fantasy, also wanted Pelevin, but instead I ordered "Game theory", "Graph theory" and "Probabily theory. Markov's chains".

Does anyone know how can I cure myself? I'm really ready for serious steps.
Maybe it is a result of hangover?
I was implementing bits of optimization for trie implementation (some anonymous reader pointed that my ideas are very close to judy trie design, sometimes I think that all CS related ideas are already implemented or at least described in details), and suddenly decided that I have serious white areas related to above topics and I needed to fix that. And I fixed.
There is only one problem - I did not complete my appartment development process, I have a lot of interesting CS ideas, I want to play Open Transport Tycoon Deluxe and my trumpet. And I'm an idiot who reads Nietzsche for interest and fun.
Ugh, forgot, I also do not have time.

Something is wrong, I need to think about it...

Update: Grange presented me a bottle of "Aligote Koktebel" 2002 year harvest and it is over already (that was very tasty thingy), I also completed "Massandra Sherry" (but I do not like it too much - it is very dry) - so I have nothing to drink and a lot of things to do - not very fun environment...

/life :: Link / Comments (0)


Fri, 09 Mar 2007

Evening.


This was first time I climbed with weights (1kg) on hands and legs.
That was quite interesting and I needto admit - hard, but that was exactly the purpose of the exercises - 3 hours of climbing in that fetters ended in my complete emptiness - they only thing I could do was to sit in a sauna - magically power returned to me in a several minutes, so I decided to call old friends and move to the pub. I got Fedor and Irin and Wijo (Evgeniy Burnyakov), we spent several hours in "Golden Vobla" pub (which is quite bad imo - when they have alive music, it is impossible to hear each other, theirs beer looks like mixed with alcohol, at least it is the only place after which my head aches).
We 'celebrated' old new yar, women day, new Fedor's car, Wijo's birthday (or we forgot about it...) and other missed events, and of course our meeting itself.
That was great start of the weekend.

/life :: Link / Comments (0)


Multi-dimensional trie implementatin.


I've completed the first phase of the extended trie design, which allows to create non-binary tries. Implementation allows to use different number of bits per node (i.e. usual binary trie (one bit), trie with 2, 4, 8 and so on bits used to select an entry), for example trie with 8 bits per node alows to select an entry in 32 bitfield in 4 lookups. Implementation can work with searching keys of different length (but key length must be divisible by number of bits per node), which allows to use that algo with different areas (like IPv4 and IPv6 networks in the same cache, or several different network protocols in the same tree (UDP, TCP, SCTP, DCCP - they all can share the same trie, but one byte of the searching key must be different and correspond to the protocol number)).
Lookup is fast, although not optimal yet (comparable with hash table lookup speed I tested previously).
Its main problem, as expected, is memory usage - each trie node contains array of pointers, number of elements in which is created according to number of bits selected per node (for 8 bits per node each node contains 256 pointers, which is equal to 1 kb of memory on x86), for 2^20 random elements such trie eats about 1Gb of memory, which is unacceptible.
Exactly the same problem existed before and was solved in second generation of the trie by caching values in the leafs when they do not share bits.
I'm going to create the same optimization for this implementation, which should noticebly reduce memory usage and improve lookup performance a bit.

/devel/other :: Link / Comments (0)


Thu, 08 Mar 2007

Women.


You make our live so complex and unprdictible, so you just might not be not loved.

If you do not understand what happens, then you likely live in a poor country where do not celebrate women day (men cook, make presents and flowers, walk with children and give a sitting place in transport), which happens to be today.
So, start the new tradition just as is.

Your brightest eyes have burnt my soul,
And moved my mind away.
And now I have the only goal:
To make you happy day by day.
That blank verse inspired by the sence of this day.

/life :: Link / Comments (0)


Transport cards.


I was wrong when mentioned that moscow transport cards can be read by visa-like mifare reader - it cant, contactless reader must be used instead, which can be bought relatively easy too though.

/devel/other :: Link / Comments (0)


Wed, 07 Mar 2007

RFID passport was cloned without even opening a package.


Details shows that new USA passport can be cloned without major problems - key, which encrypts data in the RFID chip can be easily guessed by other factors without even looking inside the package containing passport.
If it is so easy, then how easy is russian transport card protection ever? I cracked some bits in Moscow railway transport tickets code couple of years ago (and I know for sure that it was cracked fully by other people) until it was extended (about two times), but I still in doubt about moscow transport card, which encodes information about entering moscow subway transport and moscow railway transport in a single card (visa like type (forget actual name, mifare supports that kind of reading via some devices as far as I recall), which can be read by freely sold readers), likely there is no protection at all...
I recall this again now - that is what I consider the real hack. If I will find some (free) money, I will buy a reader/writer and try to check how things are kept secret. Likely it is not...

/devel/other :: Link / Comments (0)


Linux.conf.au has released presentations.


I've just noticed, that you can download most of the LCA2007 presentation (in form of slides though).
You can also get video and audio records.

Let's see the people...

/devel/other :: Link / Comments (0)


First release of the eventfs.


Eventfs - pseudo FS which allows to bind file descriptors to events.

One can bind signal and other events (currently only signals are supported) and poll them using epoll().

It is heavily based on ideas from kevent project.

I've sent to linux-kernel@ and hackers who participated in kevent discussion.
Let's see where this will end up.

/devel/eventfs :: Link / Comments (0)


Eventfs.


This is a prototype of the pseudo filesystem created to bind different events and epoll().
I will implement only signal biding (and maybe POSIX timers binding), that's all - if there will be no feedback like with kevent, this will be dropped from the start - I will better continue implementation of the scalable socket lookup.

Basic idea is to provide set of syscalls, which will get object id (like signal number) and private data pointer and private kernel interface (used for example by POSIX timers), which will end up allocating new file structure and descriptor with appropriate ->poll() callback, which can be used by epoll().

So far it looks like I've found a race in Linux signal handling, which can lead to the lost signals setup.
Consider following code in do_sigaction() (called from signal() for example):

int do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact)
{
	struct k_sigaction *k;
	sigset_t mask;

	if (!valid_signal(sig) || sig < 1 || (act && sig_kernel_only(sig)))
		return -EINVAL;

	k = ¤t->sighand->action[sig-1];

	spin_lock_irq(¤t->sighand->siglock);
	...
		*k = *act;
	...
	spin_lock_irq(¤t->sighand->siglock);
	return 0;
}
If signal() or sigaction() is called from signal handler just before spin_lock_irq(), it will not take any effect, since the same action will be overwritten in a process after handler is completed. Man page says that both signal() and sigaction() are signal safe functions.

/devel/eventfs :: Link / Comments (0)


Tue, 06 Mar 2007

How expensive is cache miss? Tlb miss?


I've ran some trivial tests on my Intel Core Duo desktop to solve that questions for me.
Tests basically allocate a region in memory and perform reading from different addreses in a loop.
I did them in userspace to determine TLB miss price (each tlb entry in userspace on x86 covers 4k).
Here is a code snippet:

#define rdtscll(val) \
	__asm__ __volatile__("rdtsc" : "=A" (val))

	...
	
	for (i=0; i< num; i++) {
		rdtscll(start);
		asm volatile(	
			"movl %1, %%eax\n"
			"leal (%2, %%eax), %%ecx\n"
			"movl (%%ecx), %0\n"
			: "=r"(data)
			: "g"(area), "g"(i)
			: "eax", "ecx");
		rdtscll(stop);
		idx = (stop-start)/CPUFREQ;
		if (idx >= MAX_RESULTS)
			results[MAX_RESULTS-1]++;
		else
			results[idx]++;
	}
Here is an overall picture with results:

memory access speed

Horizontal axis shows number of cycles needed to fetch one entry from memory.
Vertical axis shows number of elements fetched with the same speed.

But let's see into the end of the picture, where the most expensive fetches live (TLB miss):

TLB miss

Above graph has nanoseconds as horizontal axis.
We have one TLB miss roughly per 4k reads - exactly on 4k tlb entry boundary:

TLB miss details

Analyzing first picture we can find that most of the time fetching of 4 bytes from memory takes about 100 cycles, about 1/8 of that time 4 bytes are fetched for 86 cycles.

Strange results for DDR (or DDR2 I do not remember actually) memory, doesn't it?
That is because they are absolutely incorrect.
Well, not absolutely, but they do not show anything interesting for L1/L2 latencies (only TLB miss test is useful), just because above 100 cycles is exactly overhead of the second rdtsc instruction. I've added additional rdtsc and got peak at 200 cycles. Real memory access is fully hidden behind rdtsc overhead.
This problem is only related to my Core Duo machine:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Pentium(R) D CPU 3.40GHz
stepping        : 5
cpu MHz         : 3740.058
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 3
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni 
monitor ds_cpl est cid cx16 xtpr lahf_lm
bogomips        : 7485.68
Second CPU is the same. Never use that processor and rdtsc instruction - it is implemented extremely slow.

Better check AMD Athlon64 3200 CPU.
Its rdtsc takes 8 cycles, which is very good result.
So, the same memory access test:

memory access speed on amd

As you can see, there are two peaks - around 20 and 65 cycles, getting that rdtsc is 8 cycles and reducing 4 reading instructions were get about 8 cycles for L2 cache hit and about 53 cycles for L2 cache miss. Test performs fetch of 4 bytes aligned to 4 bytes boundary, so getting square under peaks (theirs correlation is about 1/15) we get about 64 byte cache line (4*15) - you know, it is exactly the cache line size on that processor.

TLB cache miss picture for that CPU is essentially the same:

TLB miss details for AMD

except that TLB miss happens on 1k boundary and takes 2 times longer than on Intel (maybe it is x86_64 Linux port which can cover userspace with 1kb tlb entries, I did not check, but getting, that /proc/cpuinfo shows this: "TLB size : 1024 4K pages, it might be true).

That's all, what I wanted to say about it - actually I just did not want to do something serious, so to kill some time I ran that tests.
I hope it will be somehow interesting for you.

/devel/other :: Link / Comments (0)


Mon, 05 Mar 2007

Climbing evening.


That was short training - I only completed several traverses and two traces - one of them new complex one over green holds in the central sector - I have only one complex part in it right now, so I plan to finish it without falls quite soon - although its category is quite high (7a) it is not that complex (or maybe I just easily complete traces on vertical walls).
I also tried couple of starts from recent (highly-qualified) competition spent there - but instructors forbid to climb without insurence higher than three meters, so I only completed several first holds - not that complex like I thought, but likely I would not completed it without fall since I always had problems with physical endurance on walls with negative slope.

/life :: Link / Comments (0)


The fastest way of doing tree traversal. Better algorithm to do a socket lookup.


Enjoy.

 80484a9:	89 e9                	mov    %ebp,%ecx
 80484ab:	89 d6                	mov    %edx,%esi
 80484ad:	89 c3                	mov    %eax,%ebx
 80484af:	83 e1 01             	and    $0x1,%ecx
 80484b2:	8d 54 24 1c          	lea    0x1c(%esp),%edx
 80484b6:	8d 76 00             	lea    0x0(%esi),%esi
 80484b9:	8d bc 27 00 00 00 00 	lea    0x0(%edi),%edi
 80484c0:	8b 14 8a             	mov    (%edx,%ecx,4),%edx
 80484c3:	d1 ed                	shr    %ebp
 80484c5:	89 e9                	mov    %ebp,%ecx
 80484c7:	83 e1 01             	and    $0x1,%ecx
 80484ca:	85 d2                	test   %edx,%edx
 80484cc:	75 f2                	jne    80484c0 
C code:
	struct trie_node
	{
		struct trie_node		*ptr[2], *parent;
	};
	
	...

	int bit = value & 1;

	while (n) {
		n = n->ptr[bit];
		value >>= 1;
		bit = value & 1;
	}
	return n;
6 operations per loop, just two times more than linked list and with only one conditional branch!
Its speed is about 60-80 nanoseconds, which is roughly 1.5 times slower than linked list traversal only because we need to calculate which dimension to take (additional shift and binary 'and' operations). Very fast, but still can not compete with hash table with 4-5 elements in each entry.
So, there must be another algorithm.

I've invented one (well, not exactly invented of course, but will implement at least), which is a bit memory hungry, but allows to get one element out of 32 bitfield in 4 lookups (with only one conditional branch per lookup).
It is some kind of trie/tree extension, where each node becomes not binary one, but containing multiple fields, which in turn becomes just multidimensional instead of two-dimensional tree with above access type.
Stay tuned (I need to complete paid work part first).

/devel/other :: Link / Comments (0)


Sun, 04 Mar 2007

Sunday.


It was full of small things.

Main one - frenchs can make cars (couple of photos). Although they do not expect people with 185 sm height will ever sit in Peugeot 206 XS. They also do not not expect people with 45 foot size will use that car.
So, I several times kicked my head and shoulders, and Grange broke a small spring around cohesion.
After a liter of cogniac, russian sharpness and indecent words spring was fixed and we continued our celebration.
I ended the (next) day reading Friedrich Nietzsche "That said Zaratustra" - sometimes very interesting, but frequently I do not understand him.
Also looked into the book which describes different schemes of tax manipulations used by Khodorkovsky - I live in very interesting country.

/life :: Link / Comments (0)


Kevent.


In case you were confused about this post that I never used any project to advertise kevent and this post about how I used threadlet thread to advertise kevent.
The latter is a very ironic post about the only possibility to get attention is only by entering some other thread (remotely related) and start agressively stating own point of view. If you think otherwise - that is your problems and basicvally you suck.
I indeed did that, but that was not done to show that threadlets suck and kevent is cool.
Absolutely.
I like Ingo's approach of threadlets and agree that it is a very simple and quite non-problematic case to create highly parallized environment, but I just think (based on my expirience with threads in Linux) that threads per IO (or per block) is not a good solution for AIO model. I do not say, that it is worse than kevent (or any other event-driven model), I just want to say, that as is it is not the best solution from performance point of view, and both event driven (kevent or epoll, the former will be declined almost sure, I proposed another idea - having new filesystem for each type of events, i.e. signalfs, timerfs and so one, that approach does have problems and scales worse than kevent (it is bound to file structure, which is quite heavy), but it allows to have epoll as a controlling interface) model.
Although it does requre to write new code both in kernelspace and userspace, and to change existing applications just like kevent, and likely scales worse than kevent, which is there already, it probably will be blessed by developers.
But I do not care actually :)
There are a lot of other interesting things (without politics).
Stay tuned.

/devel/kevent :: Link / Comments (0)


Grange became a man.


He bought himself a car!
Let's see how things will go in his case - I sold my after year and a half after the start, but my bucket with screws was noticebly worse that his one - he is a happy owner of Peugeot 206.
Let's celebrate this!

Update: Ugh, he did it. We even moved over MKAD about 30 km - I need to admit that couple of times I always made a shit - he is quite a hard driver, although really not that bad for the first drive in Moscow (I do not know about others cities a lot, but moscow roads are the most complex I ever saw), but I think that will be changed in a couple of weeks. We ended in his home drinking cogniac. I thought I do not like cogniac, since I drunk quite a few different sorts of it - from usual one to very 'good' and expensive, but never liked it actually, but after a half of a liter "Moskovskiy" one it might be not that bad.
Let's continue celebration...

/life :: Link / Comments (0)


Sat, 03 Mar 2007

Climbing evening or training of sevens. [ND]


That was very interesting evening - I did several traverses and new not so complex traces as warming process and eventually started to do sevens - old and new traces of 7a category - I tried new one on-sight - failed of course, but I think I can do it - I completed most failed parts on the second attempt after some rest on the insurance rope. Old favourite trace over red holds in the central sector was not completed too, but there was another climber on the way, so I tried part of yet another trace (though I did it some time ago too) of the same complexity.
That was definitely interesting training.
There is only one bad moment - I burnt a bit a skin on my 'sitting area' in the sauna - but that can be considered as a lesson to not change a sitting place...
That was an excellent time.

/life :: Link / Comments (0)


Fri, 02 Mar 2007

How things can endup with some trivial initial misunderstanding. [ND]


(you are never ever wrong, and if you are proven wrong on topic A you claim it is an irrelevant topic (without even admitting you were wrong about it) and you point to topic B claiming it's the /real/ topic you talked about all along. And along the way you are slandering other projects like epoll and threadlets, distorting the discussion. This kind of keep-the-ball-moving discussion style is effective in politics but IMO it's a waste of time when developing a kernel.)
Sigh, we started very interesting topic, but ended up with some personal insults and drawing things people never talked about and even remotely based assumptions on - and all just because of some trivial misunderstanding of the simplest context.

That's pity, but well, if things has moved to that point, fuck them, there will be tons of others.
It seems I broke possible good relations with some very interesting kernel hackers, but I never wanted to do any personal or project insults for my own advertisement. If people can not (or do not) understand that, then, well, let's just move further.

/devel/kevent :: Link / Comments (0)


Quotation of the week. [ND]


Ingo Molnar wrote in linux-kernel@ in threadlet/syslet thread:

sure, if we debate its virtualization driven market penetration via self promoting technologies that also drive customer satisfaction, then we'll be able to increase shareholder value by improving the user experience and we'll also succeed in turning this vision into a supply/demand marketplace. Or not?
I'm going to write that mantra in some piece of the memory.

/devel/kevent :: Link / Comments (0)


New wine has arrived. [ND]


Bottle of Massandra Sherry of 2000 year harvest.
Now I need to solve a problematic question: to start it drinking right now or postpone for evening or tomorrow, since I expect a good climbing training later today.

Update: I failed to resist nad opened a bottle. Hmm, tasty but Krymsk's one I liked more. Massandra is more dry and do not have that expressive taste.

/life :: Link / Comments (0)


Thu, 01 Mar 2007

Further trie/hash table perfromance analysis. [ND]


Ok, let's first check, how does it take to traverse 32 pointers, i.e. single linked list, one-by-one.
Optimized asm output produces something like that:

 80484e4:	89 c3                	mov    %eax,%ebx
 80484e6:	89 d6                	mov    %edx,%esi
 80484e8:	89 e9                	mov    %ebp,%ecx
 80484ea:	31 c0                	xor    %eax,%eax
 80484ec:	8d 74 26 00          	lea    0x0(%esi),%esi
 80484f0:	40                   	inc    %eax
 80484f1:	8b 09                	mov    (%ecx),%ecx
 80484f3:	83 f8 20             	cmp    $0x20,%eax
 80484f6:	75 f8                	jne    80484f0 
Blue lines are actual loop:
for (j=0; j<32; ++j)
	n = n->right;
Using rdtsc call I measured that that loop takes about 50 nanoseconds on my 3.7 Ghz Intel Core Duo machine (including some overhead in above asm output marked as black lines to get parameters), this roughly corresponds to (32*4)/50 = 2.6 instructions per nanosecond, which is quite close to theoretical - 3.7Ghz is about 3.7 instructions per nanosecond.
That is about 2 nanoseconds per lookup in optimized linked list.
Ok, let's see how a bit more complex trie traversal algo works.
I've create following loop:
for (j=0; j<32; ++j) {
	bit = (value >> j) & 1;
	if (bit)
		n = n->right;
	else
		n = n->left;
}
This produced following optimized asm:
 80484eb:	31 c9                	xor    %ecx,%ecx
 80484ed:	89 c6                	mov    %eax,%esi
 80484ef:	89 d7                	mov    %edx,%edi
 80484f1:	89 eb                	mov    %ebp,%ebx
 80484f3:	eb 09                	jmp    80484fe 
 80484f5:	41                   	inc    %ecx
 80484f6:	8b 5b 04             	mov    0x4(%ebx),%ebx
 80484f9:	83 f9 20             	cmp    $0x20,%ecx
 80484fc:	74 12                	je     8048510 
 80484fe:	8b 44 24 18          	mov    0x18(%esp),%eax
 8048502:	d3 e8                	shr    %cl,%eax
 8048504:	a8 01                	test   $0x1,%al
 8048506:	75 ed                	jne    80484f5 
 8048508:	41                   	inc    %ecx
 8048509:	8b 1b                	mov    (%ebx),%ebx
 804850b:	83 f9 20             	cmp    $0x20,%ecx
 804850e:	75 ee                	jne    80484fe 
Where blue lines show loop itself, we got about 8 operations per loop plus two branches - that code works about 4-5 times slower (instead of two since number of operations increased twice), which means that some operations take more than one tick.

Code which does not have branches (the same loop as in linked list plus value bit analyze) takes about 100 nanosecods and 8 instructions per loop, which is exactly two times slower as linked list processing.
C loop:
for (j=0; j<32; ++j) {
	bit = (value >> j) & 1;
	n = n->right;
	res += bit;
}
res processing added so that compiler would not optimize value bit analyze away. Asm code is following:
 80484e9:	31 c9                	xor    %ecx,%ecx
 80484eb:	89 c7                	mov    %eax,%edi
 80484ed:	89 d5                	mov    %edx,%ebp
 80484ef:	8d 5c 24 3c          	lea    0x3c(%esp),%ebx
 80484f3:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
 80484f9:	8d bc 27 00 00 00 00 	lea    0x0(%edi),%edi
 8048500:	8b 44 24 2c          	mov    0x2c(%esp),%eax
 8048504:	8b 5b 04             	mov    0x4(%ebx),%ebx
 8048507:	d3 e8                	shr    %cl,%eax
 8048509:	41                   	inc    %ecx
 804850a:	83 e0 01             	and    $0x1,%eax
 804850d:	01 c6                	add    %eax,%esi
 804850f:	83 f9 20             	cmp    $0x20,%ecx
 8048512:	75 ec                	jne    8048500 
As usual, blue text is loop itself.

So, what do we have?
The most optimized 32-bit trie processing code takes about 7-8 nanoseconds per bit (200-250 nanoseconds per 32 bit), linked list's overhead is 2 nanoseconds per lookup.

Btw, in linked list processing there is no for() loop, but instead check of the current entry if it is equal to the list head, it costs about 3 operations per loop instead of 4.

In real life we have a hash table of sockets, each hash entry usualy contains upto 4-5 nodes in the linked list, number of sockets is about 2^20.
Hash itself takes several instructions (socket XOR hash at least) and I will not get it into account.
So, to create competive trie implementation I need to be able to code 2^20 entries into one lookup of the trie? (one trie lookup, i.e. time per check of one bit of the key, is 4 times slower, than one pointer dereference in linked list according to above analysis).
That is impossible. Theorem has been solved - I was wrong.
Trie topic is closed, I need to think about better algorithm.

I'm posting it right now in the blog for others to think about analysis (which can be incorrect though), I would post it to netdev@ in the related topic, but I need to ge home (quickly or I will miss the latest bus), if I will not mentally find bugs in the analysis, I will post it tomorrow.
If you see something obvious, please drop me a mail.

/devel/networking :: Link / Comments (0)


Sophisticated filesystem testing and benchmarking. [ND]


Filesystems are built on top of physical hardware, which has set of characteristics which can not be changed - like search time, linear read time and so on.
Deeply looking into this set of operations, I came to idea to implement kernel module which would sit between FS and underlying hardware and emulate some delays, errors and so on.
Now I think about its extension in two directions - first one to allow userspace testing of the fs algorithms - like some library which will provide read/write methods, which will then be emulate disk behaviour - like writing to the different sector requires a seek delay, so algorithm, which has highly separated meta and actual data will seek in usual write (to meta data and to real data), another direction is to be used as trivial kernel module on which fs can be based initially.
In theory it can be implemented as block layer scheduler, but simpler is just to implement a block device which will end up with changed timings and redirecting to usual one.
First step is to implement it in userspace though and test only algos there.

/devel/fs :: Link / Comments (0)