|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Sat, 31 Mar 2007
Mikhail Alexandrovich.
Today we celebrated birth of the new man - Mikhail Alexandrovich Silich.
I met with many old friends, although not everyone were able to meet.
Growth problem were discussed, examples from our lives and education, fun stories.
Mikhail did not keep silence and several times per hour requested
parents attention, we sometimes tried to determine the reason by scream analysis - mostly failed.
He is two weeks old, but already hairy and always tried to get out of the wear. I failed
to determine to whom of the parents he looks like.
Alexander and Yuliana looks happy :)
/life :: Link / Comments (0)
Thu, 29 Mar 2007
FS userspace emulation tool.
Ok, here are features I will implement:
- a library which will provide synchronous read and write calls,
each one will get sector number and amount of bytes (maybe with vectored variants).
Each such call will block until data is really written/read, i.e. it will include
cost for sequential write/read and seek into the place.
- simple FS implementation, which will take a 'storage' size and divide it into
fixed size blocks, which will form a bitmap of used/free elements.
Then I will use random models to test that FS - work with several big files, with a lot of small
files and so on, no directories so far. That will show what disk usage patterns are common
for different loads, main goal to find a solution which will have the smallest amount
of disk seeks, which are extremely slow.
- start more complex FS implementation with this
set of features as a must
/devel/fs :: Link / Comments (0)
New wine and recent happenings.
'Rose d'Anjou. Henry Le Cuvier' - french dry wine - not tasty,
no special spirit, almost water. Will not buy.
Wine which helped to crack 2 of 9 Jenkins hash rounds - 'Sunrise Merlot' by 'Vina Concha y Toro' -
Chile red dry wine - extremely tasty thing, will buy it again
if there will not be any better variants.
I did not write for my climbing trainig progress so far -
I do climb three times per week about 2.5-3 hours each, still train alone,
sometimes with weights - mostly compelx boulderings and traverses
on the second set of shields (1 meter high) - that make me tired as hell, which
is exactly the purpose - every piece of body (even head brain some times, although
that happens extremely rarely) need to work I think.
Tried (although I did not want to change anything too much) to change
life style (as Grange call it) -
stopped after initial steps - call me lazy boom slacker.
Found, that I can do only three things after training - to do nothing, to eat and drink
and to do somthing I will not tell you. Everything a lot.
Feel myself extreemly good.
Had a talk with my boss and company CEO - such talks never happen when people want to thank you
or rise a salary (it rises once per year with smaller than inflation rate), so I got a bunch
of shit on my head. And I needed to lie, which I never (hmm, almost) do - but how can I tell
them that the most interesting things I did for the last several years were my own projects,
which did not affect (almost) my main work - I said, that although I really do not like what I do,
it must be done and so on.
Think about doing for my own again - but there are some problems (as usual)...
/life :: Link / Comments (0)
The 2007 Linux Storage and File Systems Workshop.
LWN article
and wiki notes.
Interesting notes:
- extended attributes must have variable lengths to satisfy various usage cases (crypto, ACL)
- fsck is evil, it can be fixed by changing meta-data location to eliminate seeks
- disk encryption should not depend on external subsystems like SELinux
- stackable filesystems work badly when underlying filesystem in the stack is modified directly
- ShadowFS - way to use some kind of write-anywhere-filesystem-layout - when data is being written
into the file, old data is not owerwritten, but instead write happens into different area on disk,
so system is never in inconsistent state if there were some errors (power failure, shutdown and so on
- a short description of several filesystems (NFS, GFS2, OCFS2, DualFS) and eXplode - bug searching tool,
which is known to help finding bugs in every Linux filesystem recently
As an interesting short-time task I started to implement a userspace model of the hard drive -
it will map the storage and have different access models for read/write/seek disk operations.
/devel/fs :: Link / Comments (0)
Wed, 28 Mar 2007
Breaking Enigma code.
Or cracking Jenkins hash
for fun and profit...
One man. Six days. One bottle of wine. One bottle of beer. Enourmous amount of tea cups.
Eighteen sheets of paper. Two rounds of Jenkins hash.
As a gain:
- library for polynomial arifmetic calculations in different Galua fields
- refreshed knowledge of polynomial algebra
- refreshed knowledge about Galua fields (CRC, Rid-Solomon codes)
- new knowledge on boolean algebra transformations
- some knowledge on differential crypto analysis
History.
Day one - cracking first round of Jenkins hash.
That was pretty easy from calculation point of view (i.e. how to select
i'th bit based on knowledge of (i-1)'th or (i+1)'th bit), but it was not a main goal.
I wanted to solve the problem from algorithmistic point of view, i.e. not to
find a single solution, but a generation law. I started from simpler task -
to solve following equation:
(A + X) XOR X = B
Day two - theory.
There are two possible ways of solving above equation - either to
present logical (bitwise) operation in Galua field of 2^1 (GF(2^1)) - XOR -
in ariphmetical field (GF(2^32)) or present sum operation in GF(2^32)
as a bitwise operation in GF(2^1).
There are different ways to do it.
Day three - sum as a bitwise operation.
It can be presented using quite simple form:
Ai + Bi = Ai xor Bi xor K(i-1), where
Ki = 0, i = 0,
Ki = (Ai and Bi) or (Ai and K(i-1)) or (Bi and K(i-1))
Xi means i'th bit of X
I got recursive formula for simple sum, which was not what I wanted,
so I started to solve above logical equation using different models.
First one - polynomial algebra - sum and bitwise operations are quite simple
in polynomial algebra, but there is a serious problem - bitwise operations
must be applied only to normalized polynomial form, since it looses information
when drops overflow of the order. This limitation is nothing for comupter program, since
it is possible to normalize polynomial form before doing bitwise operation,
but it does not allow to move on in algorithm solvation.
So, next step - transform bitwise operations into GF(2^32) operations.
Day four - boolean polynoms.
Boolean polynom is a polynom which result and arguments can only be either 0, or 1.
For example XOR operation can be presented as following polynomial form:
A XOR B = X*(3-X)/2, where X = 2*a + b
Looks exactly what I wanted - to present all operations in polynomial form, but there is small problem.
After I expressed all logical operations in the simple equation above ((A+X) xor X = B, where each
operation (sum in GF(2^32) and XOR) is transformed into boolean polynoms)
in polynomial form, I got system of equations of 27 order - it is not what I expected to solve for simple
equation, so this step was dropped too.
Day five - give up?
Not so fast, if I can not solve that problem in algorithm form, let's find at least one sulution for all
given inputs.
Here I created a library for polynomial sum, subtraction and normalization (which allows to work with
numbers of essentually any order as a bonus, i.e. it is a sum and substraction in Galua field of any order),
and started with first round of Jenkins hash cracking.
I ended up with code which gets as input two 32 bit values and hash result,
and returns third 32 bit value which if being hashed in first round of Jenkins hash produce required
hash value.
That was simple, but then I started second round. (Jenkins hash has 9 rounds).
I made the main mistake here - I started to calculate second round value based on initial inputs
(two of them are under attacker's control), but for correct solvation I needed to make a trick.
Day six - break second round of Jenkins hash.
Main trick in Jenkins hash is that it uses value calculated on previous round as input parameter for current
round, so after some mathematical transformation it is possible to change variables in the hash equation
to work not with initial values, but with some initial values and result of the previous round.
Since each round's result only depends on its input values (some of them can be initial ones, others can
be calculated on previous rounds), it can be done.
This idea was in my brain from the beginning, but I could not formulate it into something which can be used,
and today sitting in the bus, which moved from home to Moscow to my paid work office I finally drew it
(the latest sheet of paper).
Now I have a program which calculates two inputs for given input and required hash output for two-round Jenkins hash.
It is quite simple, but the whole process was very interesting itself.
Further cracking is not impossible task, but complexity increasees with each round (although not exponentially).
I absolutely do not regret about time I spent solving this problem - that was fun.
Interested reader can find a set of photos of brainstorming of this problem in
gallery.
And a solution:

Example:
old a: 15e28f3a, b: 1cb4ed1c, c: 7edcd3a0, h1: 70bc78e4
new a: 15e28f3a, b: 434c39d0, c: d28fc0ec, h1: 70bc78e4
As you can see, $a parameter is the same, $b and $c are under attacker's control,
all three produce the same hash value $h1 of reduced to two rounds Jenkins hash.
$b and $c were calculated based on $a and $h1 knowledge (or actually $h1 can be selected randomly
and $b and $c can be selected to produce it again and again, salting with random value will not change
the picture).
/devel/math/hash :: Link / Comments (0)
Fri, 23 Mar 2007
Jenkins hash analysis and conclusion.
There were several attempts to use Jenkins hash for socket
lookup functions instead of old XOR hash, but I was always against
that.
My objectinos were based on the tests I ran during
netchannels
hash selection process.
It especially becomes visible in the following usage case - we use
the same constant IP addresses (two 32 bit values) and one constant destination
port (16 bits) and variable source port (16 bits), so essentially we hash 3 32 bit values.
It can be done for example using following scheme:
hash = jhash_3words(saddr, daddr, (dport<<16)|sport, random_seed);
where random_seed is selected early.

This graph shows chain length on X axis and number of elements with such a lentgs on Y axis.
As you can see, there is a significant artifact with about 3200 elements in some hash chains.
My first (and incorrect) impression was that Jenkins hash has some distribution problems, but further analysis
showed that it is not true. I performed distribution checks in full 32 bit field and found,
that it has uniform distribution as long as XOR hash.
Problem appeared when I got 16 bit distribution (size of the hash table is 2^16 elements), while
XOR hash was still uniform.
Further analysis showed that Jenkins hash itself does not have distribution problems, but they
appear when its folded into hash table size boundaries.
This theory was also confirmed by different hash tests (new Jenkins with rotations and RIPEMD).
XOR hash does not have such a problem, because it uses (u32 ^ u16) as one
round, which results in the uniform (it is not correct to call that
distribution uniform as is, but only getting into account that u16 values
used in tests were uniformly distributed) distribution inside F(16), which
does not suffer from hash_size boundary folding. Since XOR hash has 3 rounds,
only one of them (xor of the final u32 values) will suffer from folding, but
tests where is it can be determined for sure use constant addresses, so
problem hides again.
So, I was wrong, Jenkins hash does not have problems as is, they appear only because of folding into
hash table boundaries, which is unavoidable for general-purpose hash.
/devel/other :: Link / Comments (0)
Thu, 22 Mar 2007
Kernel with multidimensional trie instead of socket tables boots!
Single trie is created for the following protocols:
- IP (
AF_INET) sockets
- TCP established sockets
- TCP listen sockets
- TCP timewait sockets
- RAW sockets
- UDP sockets
- Unix domain sockets
- Netlink sockets
Lookup methods as long as insertion/deletion ones differ only in key setup part,
although insert/delete methods perform some additional per-protocl steps
(like cleaning private areas in netlink).
Patch itself contains horrible ifdefs and hacks, but it works (somehow) -
my system boots and I can log in over ssh and run tcpdump,
although it crashes when I connect from that host.
I think it is a win.
I'm cooking up a patch, which will be sent to netdev@ in a couple of moments,
with link to trie design and userspace tests.
If (and only if) network developers will decide that idea worth implementation,
I Will proceed and clean things up (and change tons and tons of code all over the place).
/devel/networking :: Link / Comments (0)
Multidimensional trie implementation for Linux sockets.
Ok, I've implemented Unix domain sockets (both bound and private namespace),
but kernel still does not boot - udevd does not create
appropriate device nodes in /dev, since it never receives
any single netlink message - just because I broke netlink broadcasting.
Netlink sockets were hashed y pid and protocol
values, but they were also placed into additional per-protocol broadcast
lists (mc_list). Since I removed appropriate entry from
socket structure, I need to remove appropriate list from netlink table,
so right now broadcasting does not work.
I need to think how to implement netlink broadcasting correctly without changes
in the socket structure.
Upd: I need to return one liststructure back to implement correct
traversal of all sockets - for example for /proc and netlink
statistics, I of course can use the same approach current code does -
for example TCP statistics code runs over the whole hash table and gets
info about each entry, so I can run over all trie too,
but I do not like such approach, so I will return one hlist_node
in sock_common structure (or will use list_head,
which is simple to operate), which will form a list of all sockets
for given protocol. Netlink in turn will has per-protocol lists
to implement broadcasting.
/devel/networking :: Link / Comments (0)
Wed, 21 Mar 2007
Climbing evening.
In a masochistic^W^W^Wself-improving process of climbing on negative slope
with weights on arms and legs I did some progress - basically I tried
to move about 4-5 holds up over one trace and then move down over another trace
andthen repeate it until I have a power - this ends up with virtually
any piece of body aching as I struck several times with a train, or at least with a wall.
Some times I ask myself, why do I do it - why I spent several hours
with such loads without significant (or at least somehow largely noticeble) result?
It does not look too cool, it does not end up with better climbing results
(at leat I can not test it since I climb alone), instead I could seat in a cafe,
drink a tea and talk about something...
But I think I know the answer - I sublimate (hmm, loking into dictionary for that
word I did not find appropriate translation, I meant somthing like 'change' or 'replace'),
and I do it for myself and likely will not change anything at all.
Does above text sound strange? Yes, it is.
/life :: Link / Comments (0)
Unified socket trie and booting.
Ok, it seems to be somehow completed - at least it works
for netlink sockets, but my system does not boot, since
I use device mapper (LVM), which requires special nodes
created in /dev/ by udevd,
which does not start (at least in Debian Etch), since Unix
sockets are not supported (yet) in my unified cache.
But at least it does not crash (yet).
I will try to implement Unix domain sockets today before I will go climbing,
but I'm not sure I will have enough time (I need to complete
some paid job work first). When Unix sockets are ready, my
system will boot and networking will start, and only then real
troubles will appear.
/devel/networking :: Link / Comments (0)
Graphic tablet.
I've just bought a simple Genius G-Pen 450 for my sister.
Whlist it is here, I've managed to make it working in Debian Etch with
"wizardpen"
driver. Although movement and pressure level is not very convenient in Gimp, I think it is a matter of settings.
Marina uses windows and she is not a professional artist,
so this simple model will be quite good for her skills.
/life :: Link / Comments (0)
Tue, 20 Mar 2007
My brain has exploded.
But initial part of the multidimensional trie lookup algorithm has been added
into TCP, UDP, RAW and netlink sockets.
They share the same trie with 160 bits search key (128 actual data bits).
Tree does not compile yet, but there are only couple of function missed -
raw sockets lookup, time-wait socket insert/delete,
and port free function (read: old inet_put_port()). First two tasks
(raw socket lookup and timewait sockets) only need to a simple wrappers to create
searching key out of appropriate structures, like:
u32 key[5] = {tw->tw_rcv_saddr, tw->tw_daddr, (tw->sport<<16)|tw->dport,
(sk->sk_bound_dev_if & 0x00ffffff) | (IPPROTO_TCP << 24), 0};
err = mdt_insert(&mdt_root, key, sizeof(key)<<3, sk, GFP_ATOMIC);
...
err = mdt_remove(&mdt_root, key, sizeof(key)<<3);
Code has terrible number of the ugliest set of ifdefs,
it does not support procfs and netlink statistics (due to absence of the code to get
all sockets), btw, the latter traverses over the whole hash table to get socket info,
which is way too unoptimal, it is likely too broken, and (as it does not even compile)
I have not even booted it, but I think that what has been done is a serious task.
/devel/networking :: Link / Comments (0)
Fake interrupt disable/enable calls on x86.
Linux uses fair sti/cti instructions to disable/enable
interrupts on x86, since it is likely one the most frequently used instructions
(since they are essential part of irq-aware spinlocks).
PPC64 for example does not use it (as far as I know), but instead
they have a flag, that interrupt was fired when it is supposed to be
disabled, so subsequent enable will perform some steps (real disable).
It looks
like the same technique will be used by x86 in a near future.
/devel/other :: Link / Comments (0)
Socket lookup algorithm implementation.
You know, it is quite simple to create something new,
it is even simple to create something new which will be better than something
currently being used, but it is way too complex to replace
old one with new one.
I've created a patch which replaces established socket hash table with
mdt trie, and it somehow works, but then I suddenly found how many things can be moved there
too. For example time-wait sockets (I do not even know why they
are placed into own hash table, probably to not to pollute established
hot path) and other protocols.
So I made a serious decision - I just removed
(if special config option is set) hash related members from socket structure.
Try to do it in you setup and enjoy amount of work...
Let's see how this will end up and where I will stop this stupid
task (read: how strong is my stamina).
P.S. If I will manage to complete it today or _very_ soon, that will be not even cool,
but unimaginable cool.
/devel/networking :: Link / Comments (0)
Mon, 19 Mar 2007
Climbing evening.
I've created new type of my training - since I climb alone for some time,
I decided to try doing it on negative slope - I climbed up over one trace for
about 4-5 meters (where I have permission to climb without insurance),
then move down over holds from another trace, then change it at the bottom to the next one,
and so on until I have no power. Then I got some rest and continue with different
set of traces in negative slope sector.
That was exhausting, but that was a main reason to try.
Next time I will do it with weights.
/life :: Link / Comments (0)
Sat, 17 Mar 2007
Celebration increases.
Alexander and Yuliana have just become a parents - yesterday they
wanted to meet with us, and just about one hour ago a small boy was born.
My congratulations!
/life :: Link / Comments (0)
I congratulate you with St. Patric's day!
Although I'm not an Irish, it is indeed a good day.
/life :: Link / Comments (0)
Fri, 16 Mar 2007
Climbing evening.
Ok, since Grange can not climb
for some time, I train myself - so to make training a bit more
complex last several days I trained with weights on legs and arms -
that does make training more complex and interesting - I try
to do the same traverses and boulderings I did before,
but this time with weights - that is very fun.
Got my "Games theory" and "Graph theory" books - started to read first one.
It even contains some homework tasks, which I will likely post here (with my solutions)
in a new section, so if you think games theory and related economic in particular
and decision making in general
tasks are interesting for you, stay tuned.
Tomorrow we will celebrate missed Wijo and Fedor birthdays - I will
meet with old friends, which I expect to be very fun. I did not see some of them about
a half of a year.
/life :: Link / Comments (0)
Asyncfd by Davide Libenzi got ring buffer.
Implementation allows to post events into userspace ring buffer.
There are several disadvantages though:
- it can not be called from interrupt context.
- there is possibility to lost events when ring is full, since
events are posted unconditionally.
- it has redundant fields in API.
Briefly saying - it goes the way kevent moved, but with additional
problems.
For example I do not see how to solve second problem from above list,
since kevent has (had) own queue of events, which were copied from the queue when
userspace updated indexes, so if userspace ring is full, they would not be lost,
but existing mechanism does not have such a queue at all,
events are just blindly copied and thus can be lost when ring is full.
First problem must be solved too to support tricky usage (as Davide
mentioned in patch description, USB calls might enp up calling it
from interrupt context).
One of the problems Davide solves either by using locked userspace memory (which is a big no-no),
or by using kernel buffers and read(), which is unconvenient and just
breaks the whole idea of ring buffer - having multiple events in userspace without
additional (only waiting) syscalls and having syscalls as thread cancellation points
without additional structures shared between threads.
Timerfd implementation will likely rise the same questions
from Ulrich Drepper as my kevent timers - they are not
needed, instead POSIX timers support should be
extended (although I always provided two patches - for POSIX timers and own API,
maybe Davide will use both too). But POSIX timer events become ready in softirq context, so it is impossible to use above
ring buffer implementation for them.
Timerfd code also uses long type in structures shared between userspace and kernelspace,
which does not work on x86_64 when userspace is 32bit.
Yes, I know, I'm a bastard moron, since I write only bad things in my blog, but hey, I was
not in Cc: list :)
Actually I'm glad people started to work in that area after kevent, that means
that my ideas were correct and eventually Linux will adopt a good solution for that problems.
I think Davide will resolve all issues quickly.
/devel/kevent :: Link / Comments (0)
Super optimized multidimensional trie implementation.
After morning tea and quite interesting optimization lookup for 96 bit key
on 32 bit arch takes about 90-100 nanoseconds! This basically means that
my multidimensional trie implementation outperforms hash table in any workload
with any hash table size. But I need to admit, that it is random values (three
32 bit values without zeroes in any byte, the same are used for hash table tests),
so things can differ in real life.
Memory overhead is about 97 Mb.
Test was performed in userspace.
Update: and the same speed (upto 110 nanoseconds) and memory overhead (about 98 Mb) can be achieved
for 160 bit keys - so they can include bound device index and protocol number.
/devel/other :: Link / Comments (0)
Thu, 15 Mar 2007
Multidimensional trie vs. hash table benchmarking.
This time I generated 2^20 elements and inserted
them into trie and hash tables with different sizes and with
different hash functions.
Lookup was performed for the same elements.
This test allows to eliminate rand(3) influence in lookup performance
and allows to see how badly actual data is distributed in the hash tables and trie.
First data - distribution of elements in different layers
of hash/trie - each point shows how many lookups are needed
(x axis) to get one of the 2^20 elements.

As you can see, mdt trie can access total majority of elements in 3 lookups,
while hash table with 2^20 elements has access 'length' of about 1-3 elements
per hash chain. With decreased hash size peak is smoothed and overall length of the chain
is increased as expected. Number in filename in the upper right corner (like /tmp/jhash_order_19.dist
shows order of the size of the hash table).
I need to say, that mdt trie used in test has maximum
access 'length' of 4 by design per 32 bits. This picture shows
numbers only for one 32 bit entry lookup.
Next data is hash table/mdt trie scalability,
i.e. time per one lookup depending on size of the hash table.
MDT trie has constant access time showed for reference.

Conclusions.
As you can see, for 2^20 elements in the system, mdt trie performs better
than hash table with 2^17.5 elements and about 2.6 times slower than 2^20 elements.
This tests were run in userspace, so increased memory usage (three tries
ate about 285 Mb of memory) and per-4k tlb miss
can play significant role in benchmark.
So, I as author of the trie implementation, can confirm, that it sucks compared
to properly selected hash table as is. But getting into account its additional benefits
like lock-less traversal and constant upper bound of access time, it might worth to have.
I will cook up a patch tomorrow and send it to netdev@ for review with link
to this analysis.
/devel/other :: Link / Comments (0)
Tue, 13 Mar 2007
signalfd/timerfd patches by Davide Libenzi.
Ok, some time after I posted first
version of eventfs,
Davide Libenzi has posted his version for signalfd system call (created originally by Linus Torvalds),
which ended up in a big discussion and subsequent three releases, which
now includes code for timers too. My implementation is smaller
and uses completely different idea - signal file descriptor is processed before
signal is posted into signal mask, so it eliminates some issues and simplifies code.
My implementation does not require additional steps to run in your code (like blocking of the signal to be delivered
through file descriptor as in Linus' variant), and nevertheless I got zero responses on it although
a lot of people including Linus and Davide were in Cc list.
I call it NIH syndrome and it does demotivate me from doing anything related to generic Linux code.
I got that news via LWN:kernel news line.
Eventfs and
kevent
projects are officially closed. I want to specially thank
David Miller,
Zach Brown, Johann Borck, Jonathan Corbet
and all other people for theirs time.
/devel/eventfs :: Link / Comments (0)
New wine.
My local wine shop does not have products of neither 'Koktebel', nor 'Inkerman'
wine factories, so I got something really foreigh - Chile wine.
'Carmenere' by 'Aresti Chile Wine Ltda' factory and
'Sunrise Merlot' by 'Vina Concha y Toro', both are red dry wines.
The former I already opened - well, it has very interesting taste, but I would not say I like
it very much - it is too sour dry wine.
It's time to start working...
/life :: Link / Comments (0)
Mon, 12 Mar 2007
Climbing evening.
Climbing in fetters last friday
did not get without consequences - today it was quite hard to climb,
so I only completed couple of traverses and then tried old traces.
Simper one (6a+) I completed without problems, but next one was a bit more complex (6b+),
which I helped a bit with wrong hold. Next one I tried 7a, but arms were already tired,
so I failed miserably several times. The last one was 6a+ on-sight, which I failed at the end -
shame on me. Although it was quite short training, my body aches virtually in every piece.
And then I was told that my style of life is fundamentally broken and something must be changed
and so on. I'm thinking...
/life :: Link / Comments (0)
Optimized trie implementation and performance.
I've completed implementation of my optimized trie design
and ran several performance tests.
But for the begining let me introduce some test and implementation details.
Trie test one.
1. Allocation.
As I described (link above) in design, each node can cache a value
instead of pointer to the next level leaf, so for the maximum performance
trie must be designed to have keys smaller or equal to pointer size.
Since IPv4 socket lookup uses three 32bit values (source and destination addresses
and value combined of source and destination ports), the best would be
to host three tries each of which is indexed with 32 bit key (32 bit x86 world).
2. Lookup.
Lookup is performed in each of three tries one-by-one if previous one succeeded,
i.e. we do not need to lookup in destination address trie, if source one does not match.
Test two.
96 bit keys created as a concatenation of the three 32 bit values.
Test data.
Each test inserts 2^20 elements (either into hash table with 2^20 elements size
or different (or the same) trie).
Then a lot (about a million) of entries are being searched.
For hash table three random keys are constructed, for the first trie test
three runs are performed for different random data in each 32 bit key
(if previous search succeeded), for second trie test one 96 bit key
is constructed and one lookup is performed.
Each trie test was ran for 4 and 8 bits per node (i.e. 16 and 256 nodes
on each level).
Results.
Name | memory usage (Mb) | time for one lookup (nsecs)
--------------------------------------------------------------------------------------
hash table (size for 2^20) 32 (4+28) or 24 (4+20) **) 300
trie 4 bits, 3 32bit keys 24 250
trie 4 bits, 1 96bit key 683 390
trie 8 bits, 3 32bit keys 96 160
trie 8 bits, 1 96bit key not performed - too big memory overhead (see below) *)
Winner declaration I provide to the interested reader.
Details marked with (*) in the above table.
*) Since each node can only cache 32 bit (on 32 bit platform, on 64 bit platform it can cache 64 bits),
all other bits must be hosted in the higher leafs, so for 96 bit keys we need
to host 64 bits in different layers, each layer takes 8 bits to store and 1Kb of ram,
so we need to have 64/4*1=16Kb per key, given that we have 2^20 keys, it becomes
16Gb, so test was not performed.
**) hash table itself is sizeof(void *)*2^20, which is 4 Mb on x86,
each embedded structure must contain at least 96 bits for key (two addresses and two ports),
two list manipulation pointers (in that case it becomes 20 bytes, so 20 Mb for 2^20
entries, thus total memory overhead is 24 Mb (4+20)), and other parameters needed
for searching - hash value stored (do not recall if it is 100% needed)
and bound device index, so structure takes 28 bytes and total overhead becomes
32 Mb (28+4).
Conclusions.
As you can see, having one big index is not always the best way to perform a search lookup.
Properly designed trie (and lookup algorithm) can be about two times faster
(with 3.5 times higher memory overhead) or 20% faster with essentially the same memory overhead.
Additional notes.
During such high rates there were actually never found any entry at all (i.e. either
hash/trie lookup returned NULL, or one of the three trie lookups returned NULL
and lookup for the next level was not performed), in practice things can be better or worse
for any algorithm, but recall all pros and cons for dynamic and static data structures and its lookup
algorithms.
Results can be very dependend on real data where actual sockets are returned, but not empty
searches.
/devel/other :: Link / Comments (0)
New features in the Linux FS world.
For me at least following two ideas seems the most interesting:
- introduction
of the single
->perform_write() callback instead of prepare_write()/commit_write()
pair, which becomes very complex to handle if filesystem has tricky locking. Even ext3
has some issues with journalling and prepare_write()/commit_write() sequence,
which I discovered in receiving zero-copy
project. That issues were fixed by hack which moves a lot of ext3 processing
completely into ->commit_write() callback as Alexey Kuznetsov told me.
- move to 4k sector size.
Some time ago I decided
to implement my own FS using pages only (i.e. without buffer heads
and related splits, especially with introduction of
->perform_write() callback),
so it would be a good performance enhancement (if it would be ready).
/devel/fs :: Link / Comments (0)
Sun, 11 Mar 2007
Meanwhile on appartments development side.
Winter is over and I can continue development - today
I completed second plaster layer filling where it was required (there is a place where the last - third one
would be good though, but it is not absolutely required),
so essentially my main ceiling only needs painting right now,
or may be I will fill it with the finish plaster with some kind of
a random pattern spreaded over the ceiling before.
Hinged ceiling needs to be rubbed and repainted in some places.
To get a paint I need to move to the development shop where
I will also buy special screws for my table
implementation.
/devel/flat :: Link / Comments (0)
Solve problem in order of appearence.
I've done one of the things I described
yesterday as a short term todo -
I get some beer and played Open Transport Tycoon Deluxe
the whole night (I went sleep about 6 o'clock today's morning)
and managed to win against two computer players on the easiest level.
That was quite challenging in that regard, that playing a strategy game in hammock
with touchpad only is quite complex task...
There are couple of other things though...
/life :: Link / Comments (0)
Sat, 10 Mar 2007
I'm sick.
I'm very sick.
Crap, I wanted to buy couple of easy reading books - some fantasy,
also wanted Pelevin, but instead I ordered "Game theory",
"Graph theory" and "Probabily theory. Markov's chains".
Does anyone know how can I cure myself? I'm really ready for serious steps.
Maybe it is a result of hangover?
I was implementing bits of optimization for trie implementation
(some anonymous reader pointed that my ideas are very close
to judy trie
design, sometimes I think that all CS related ideas are already implemented
or at least described in details), and suddenly decided that I have serious white areas
related to above topics and I needed to fix that. And I fixed.
There is only one problem - I did not complete my
appartment
development process, I have a lot of interesting CS ideas,
I want to play Open Transport Tycoon Deluxe
and my trumpet. And I'm an idiot who reads Nietzsche for interest and fun.
Ugh, forgot, I also do not have time.
Something is wrong, I need to think about it...
Update: Grange presented me a bottle of
"Aligote Koktebel" 2002 year harvest and it is over already
(that was very tasty thingy), I also completed "Massandra Sherry"
(but I do not like it too much - it is very dry) - so I have nothing to drink
and a lot of things to do - not very fun environment...
/life :: Link / Comments (0)
Fri, 09 Mar 2007
Evening.
This was first time I climbed with weights (1kg) on hands and legs.
That was quite interesting and I needto admit - hard, but that was
exactly the purpose of the exercises - 3 hours of climbing in that
fetters ended in my complete emptiness - they only thing I could
do was to sit in a sauna - magically power returned to me in a several minutes,
so I decided to call old friends and move to the pub. I got Fedor
and Irin and Wijo (Evgeniy Burnyakov), we spent several hours in
"Golden Vobla" pub (which is quite bad imo - when they have alive music,
it is impossible to hear each other, theirs beer looks like mixed
with alcohol, at least it is the only place after which my head
aches).
We 'celebrated' old new yar, women day, new Fedor's car,
Wijo's birthday (or we forgot about it...) and other missed events,
and of course our meeting itself.
That was great start of the weekend.
/life :: Link / Comments (0)
Multi-dimensional trie implementatin.
I've completed the first phase of the
extended trie design,
which allows to create non-binary tries. Implementation allows to use different
number of bits per node (i.e. usual binary trie (one bit), trie with 2, 4, 8 and so on bits
used to select an entry), for example trie with 8 bits per node alows to select
an entry in 32 bitfield in 4 lookups. Implementation can work with searching keys of different
length (but key length must be divisible by number of bits per node), which allows
to use that algo with different areas (like IPv4 and IPv6 networks in the same cache,
or several different network protocols in the same tree (UDP, TCP, SCTP, DCCP - they all
can share the same trie, but one byte of the searching key must be different and correspond
to the protocol number)).
Lookup is fast, although not optimal yet (comparable with hash table lookup speed I tested
previously).
Its main problem, as expected, is memory usage - each trie node contains array of pointers,
number of elements in which is created according to number of bits selected per node (for
8 bits per node each node contains 256 pointers, which is equal to 1 kb of memory on x86),
for 2^20 random elements such trie eats about 1Gb of memory, which is unacceptible.
Exactly the same problem existed before and was solved in
second generation
of the trie by caching values in the leafs when they do not share bits.
I'm going to create the same optimization for this implementation,
which should noticebly reduce memory usage and improve lookup performance a bit.
/devel/other :: Link / Comments (0)
Thu, 08 Mar 2007
Women.
You make our live so complex and unprdictible, so you just might not be not loved.
If you do not understand what happens, then you likely live in a poor country
where do not celebrate women day (men cook, make presents and flowers,
walk with children and give a sitting place in transport), which happens to be today.
So, start the new tradition just as is.
Your brightest eyes have burnt my soul,
And moved my mind away.
And now I have the only goal:
To make you happy day by day.
That blank verse inspired by the sence of this day.
/life :: Link / Comments (0)
Transport cards.
I was wrong
when mentioned that moscow transport cards can be read by visa-like mifare reader -
it cant, contactless reader must be used instead, which can be bought relatively easy too though.
/devel/other :: Link / Comments (0)
Wed, 07 Mar 2007
RFID passport was cloned without even opening a package.
Details
shows that new USA passport can be cloned without major problems - key, which
encrypts data in the RFID chip can be easily guessed by other factors without even looking inside
the package containing passport.
If it is so easy, then how easy is russian transport card protection ever? I
cracked some bits in Moscow railway transport tickets code couple of years ago
(and I know for sure that it was cracked fully by other people) until it was extended
(about two times), but I still in doubt about moscow transport card, which
encodes information about entering moscow subway transport and moscow
railway transport in a single card (visa like type (forget actual name, mifare
supports that kind of reading via some devices as far as I recall), which can be read
by freely sold readers), likely there is no protection at all...
I recall this again now - that is what I consider the real hack. If I will
find some (free) money, I will buy a reader/writer and try to check how
things are kept secret. Likely it is not...
/devel/other :: Link / Comments (0)
Linux.conf.au has released presentations.
I've just noticed, that you can download
most of the LCA2007
presentation (in form of slides though).
You can also get video and audio records.
Let's see the people...
/devel/other :: Link / Comments (0)
First release of the eventfs.
Eventfs -
pseudo FS which allows to bind file descriptors to events.
One can bind signal and other events (currently only signals are supported) and poll them using epoll().
It is heavily based on ideas from kevent
project.
I've sent to linux-kernel@ and hackers who participated in
kevent discussion.
Let's see where this will end up.
/devel/eventfs :: Link / Comments (0)
Eventfs.
This is a prototype of the pseudo filesystem created
to bind different events and epoll().
I will implement only signal biding (and maybe POSIX timers binding),
that's all - if there will be no feedback like with kevent, this will
be dropped from the start - I will better continue implementation of the
scalable socket lookup.
Basic idea is to provide set of syscalls, which will get object id (like signal number) and
private data pointer and private kernel interface (used for example by POSIX timers),
which will end up allocating new file structure and descriptor with
appropriate ->poll() callback, which can be used by epoll().
So far it looks like I've found a race in Linux signal handling,
which can lead to the lost signals setup.
Consider following code in do_sigaction() (called from
signal() for example):
int do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact)
{
struct k_sigaction *k;
sigset_t mask;
if (!valid_signal(sig) || sig < 1 || (act && sig_kernel_only(sig)))
return -EINVAL;
k = ¤t->sighand->action[sig-1];
spin_lock_irq(¤t->sighand->siglock);
...
*k = *act;
...
spin_lock_irq(¤t->sighand->siglock);
return 0;
}
If signal() or sigaction() is called from signal handler just before
spin_lock_irq(), it will not take any effect, since the same action will be overwritten
in a process after handler is completed. Man page says that both signal() and sigaction()
are signal safe functions.
/devel/eventfs :: Link / Comments (0)
Tue, 06 Mar 2007
How expensive is cache miss? Tlb miss?
I've ran some trivial tests on my Intel Core Duo desktop to solve that questions for me.
Tests basically allocate a region in memory and perform reading from different addreses in a loop.
I did them in userspace to determine TLB miss price (each tlb entry in userspace on x86 covers 4k).
Here is a code snippet:
#define rdtscll(val) \
__asm__ __volatile__("rdtsc" : "=A" (val))
...
for (i=0; i< num; i++) {
rdtscll(start);
asm volatile(
"movl %1, %%eax\n"
"leal (%2, %%eax), %%ecx\n"
"movl (%%ecx), %0\n"
: "=r"(data)
: "g"(area), "g"(i)
: "eax", "ecx");
rdtscll(stop);
idx = (stop-start)/CPUFREQ;
if (idx >= MAX_RESULTS)
results[MAX_RESULTS-1]++;
else
results[idx]++;
}
Here is an overall picture with results:

Horizontal axis shows number of cycles needed to fetch one entry from memory.
Vertical axis shows number of elements fetched with the same speed.
But let's see into the end of the picture, where the most expensive fetches live (TLB miss):

Above graph has nanoseconds as horizontal axis.
We have one TLB miss roughly per 4k reads - exactly on 4k tlb entry boundary:

Analyzing first picture we can find that most of the time fetching of 4 bytes
from memory takes about 100 cycles, about 1/8 of that time 4 bytes are fetched
for 86 cycles.
Strange results for DDR (or DDR2 I do not remember actually) memory, doesn't it?
That is because they are absolutely incorrect.
Well, not absolutely, but they do not show anything interesting for L1/L2 latencies (only TLB miss
test is useful), just because above 100 cycles is exactly overhead of the second rdtsc
instruction. I've added additional rdtsc and got peak at 200 cycles.
Real memory access is fully hidden behind rdtsc overhead.
This problem is only related to my Core Duo machine:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Intel(R) Pentium(R) D CPU 3.40GHz
stepping : 5
cpu MHz : 3740.058
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 3
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni
monitor ds_cpl est cid cx16 xtpr lahf_lm
bogomips : 7485.68
Second CPU is the same. Never use that processor and rdtsc instruction -
it is implemented extremely slow.
Better check AMD Athlon64 3200 CPU.
Its rdtsc takes 8 cycles, which is very good result.
So, the same memory access test:

As you can see, there are two peaks - around 20 and 65 cycles, getting that rdtsc
is 8 cycles and reducing 4 reading instructions were get about 8 cycles for L2 cache hit and about 53 cycles
for L2 cache miss. Test performs fetch of 4 bytes aligned to 4 bytes boundary, so getting square under peaks
(theirs correlation is about 1/15) we get about 64 byte cache line (4*15) - you know, it is exactly
the cache line size on that processor.
TLB cache miss picture for that CPU is essentially the same:

except that TLB miss happens on 1k boundary and takes 2 times longer than on Intel
(maybe it is x86_64 Linux port which can cover userspace with 1kb tlb entries, I did not check,
but getting, that /proc/cpuinfo shows this: "TLB size : 1024 4K pages,
it might be true).
That's all, what I wanted to say about it - actually I just did not want to
do something serious, so to kill some time I ran that tests.
I hope it will be somehow interesting for you.
/devel/other :: Link / Comments (0)
Mon, 05 Mar 2007
Climbing evening.
That was short training - I only completed several traverses
and two traces - one of them new complex one over green holds in the central sector - I have
only one complex part in it right now, so I plan to finish it without falls
quite soon - although its category is quite high (7a) it is not
that complex (or maybe I just easily complete traces on vertical walls).
I also tried couple of starts from recent (highly-qualified) competition
spent there - but instructors forbid to climb without insurence
higher than three meters, so I only completed several first holds - not that complex
like I thought, but likely I would not completed it without fall since
I always had problems with physical endurance on walls with negative slope.
/life :: Link / Comments (0)
The fastest way of doing tree traversal. Better algorithm to do a socket lookup.
Enjoy.
80484a9: 89 e9 mov %ebp,%ecx
80484ab: 89 d6 mov %edx,%esi
80484ad: 89 c3 mov %eax,%ebx
80484af: 83 e1 01 and $0x1,%ecx
80484b2: 8d 54 24 1c lea 0x1c(%esp),%edx
80484b6: 8d 76 00 lea 0x0(%esi),%esi
80484b9: 8d bc 27 00 00 00 00 lea 0x0(%edi),%edi
80484c0: 8b 14 8a mov (%edx,%ecx,4),%edx
80484c3: d1 ed shr %ebp
80484c5: 89 e9 mov %ebp,%ecx
80484c7: 83 e1 01 and $0x1,%ecx
80484ca: 85 d2 test %edx,%edx
80484cc: 75 f2 jne 80484c0
C code:
struct trie_node
{
struct trie_node *ptr[2], *parent;
};
...
int bit = value & 1;
while (n) {
n = n->ptr[bit];
value >>= 1;
bit = value & 1;
}
return n;
6 operations per loop, just two times more than linked list and with only one
conditional branch!
Its speed is about 60-80 nanoseconds, which is roughly 1.5 times slower than linked list
traversal only because we need to calculate which dimension to take (additional
shift and binary 'and' operations). Very fast, but still can not compete with
hash table with 4-5 elements in each entry.
So, there must be another algorithm.
I've invented one (well, not exactly invented of course, but will implement at least),
which is a bit memory hungry, but allows to get one element out of 32 bitfield in 4
lookups (with only one conditional branch per lookup).
It is some kind of trie/tree extension, where each node becomes not binary one,
but containing multiple fields, which in turn becomes just multidimensional instead
of two-dimensional tree with above access type.
Stay tuned (I need to complete paid work part first).
/devel/other :: Link / Comments (0)
Sun, 04 Mar 2007
Sunday.
It was full of small things.
Main one - frenchs can make cars (couple of photos).
Although they do not expect people with 185 sm height will ever sit in Peugeot 206 XS.
They also do not not expect people with 45 foot size will
use that car.
So, I several times kicked my head and shoulders, and Grange
broke a small spring around cohesion.
After a liter of cogniac, russian sharpness and indecent words
spring was fixed and we continued our celebration.
I ended the (next) day reading Friedrich Nietzsche "That said Zaratustra" - sometimes
very interesting, but frequently I do not understand him.
Also looked into the book which describes different schemes of tax manipulations
used by Khodorkovsky - I live in very interesting country.
/life :: Link / Comments (0)
Kevent.
In case you were confused about this
post that I never used any project to advertise kevent and
this
post about how I used threadlet thread to advertise kevent.
The latter is a very ironic post about the only possibility to
get attention is only by entering some other thread (remotely related)
and start agressively stating own point of view. If you think otherwise - that is
your problems and basicvally you suck.
I indeed did that, but that was not done to show that threadlets suck and kevent is cool.
Absolutely.
I like Ingo's approach of threadlets and agree that it is a very simple and
quite non-problematic case to create highly parallized environment,
but I just think (based on my expirience with threads in Linux) that
threads per IO (or per block) is not a good solution for AIO model.
I do not say, that it is worse than kevent (or any other event-driven model),
I just want to say, that as is it is not the best solution from performance
point of view, and both event driven (kevent or epoll,
the former will be declined almost sure, I proposed another idea - having
new filesystem for each type of events, i.e. signalfs, timerfs and so one,
that approach does have problems and scales worse than kevent (it is bound to
file structure, which is quite heavy), but it allows to have epoll as a
controlling interface) model.
Although it does requre to write new code both in kernelspace and userspace,
and to change existing applications just like kevent,
and likely scales worse than kevent, which is there already, it probably will
be blessed by developers.
But I do not care actually :)
There are a lot of other interesting things (without politics).
Stay tuned.
/devel/kevent :: Link / Comments (0)
Grange became a man.
He bought himself a car!
Let's see how things will go in his case - I sold my after year and a half after
the start, but my bucket with screws was noticebly worse that his one - he is a happy
owner of Peugeot 206.
Let's celebrate this!
Update: Ugh, he did it. We even moved over MKAD about 30 km - I need to admit
that couple of times I always made a shit - he is quite a hard driver, although
really not that bad for the first drive in Moscow (I do not know about others
cities a lot, but moscow roads are the most complex I ever saw), but I think
that will be changed in a couple of weeks. We ended in his home drinking cogniac.
I thought I do not like cogniac, since I drunk quite a few different sorts of
it - from usual one to very 'good' and expensive, but never liked it actually,
but after a half of a liter "Moskovskiy" one it might be not that bad.
Let's continue celebration...
/life :: Link / Comments (0)
Sat, 03 Mar 2007
Climbing evening or training of sevens. [ND]
That was very interesting evening - I did several
traverses and new not so complex traces as warming process
and eventually started to do sevens - old and new traces
of 7a category - I tried new one on-sight - failed of course,
but I think I can do it - I completed most failed
parts on the second attempt after some rest on the insurance rope.
Old favourite trace over red holds in the central sector was not
completed too, but there was another climber on the way, so I tried
part of yet another trace (though I did it some time ago too) of the
same complexity.
That was definitely interesting training.
There is only one bad moment - I burnt a bit a skin on my 'sitting area'
in the sauna - but that can be considered as a lesson to not change a
sitting place...
That was an excellent time.
/life :: Link / Comments (0)
Fri, 02 Mar 2007
How things can endup with some trivial initial misunderstanding. [ND]
(you are never ever wrong, and if you are proven wrong on topic A you
claim it is an irrelevant topic (without even admitting you were wrong
about it) and you point to topic B claiming it's the /real/ topic you
talked about all along. And along the way you are slandering other
projects like epoll and threadlets, distorting the discussion. This kind
of keep-the-ball-moving discussion style is effective in politics but
IMO it's a waste of time when developing a kernel.)
Sigh, we started very interesting topic, but ended
up with some personal insults and drawing things people never talked about and even
remotely based assumptions on - and all just because of some trivial
misunderstanding of the simplest context.
That's pity, but well, if things has moved to that point, fuck them,
there will be tons of others.
It seems I broke possible good relations with some very interesting
kernel hackers, but I never wanted to do any personal or project insults
for my own advertisement. If people can not (or do not) understand that,
then, well, let's just move further.
/devel/kevent :: Link / Comments (0)
Quotation of the week. [ND]
Ingo Molnar wrote in linux-kernel@ in threadlet/syslet thread:
sure, if we debate its virtualization driven market penetration via self
promoting technologies that also drive customer satisfaction, then we'll
be able to increase shareholder value by improving the user experience
and we'll also succeed in turning this vision into a supply/demand
marketplace. Or not?
I'm going to write that mantra in some piece of the memory.
/devel/kevent :: Link / Comments (0)
New wine has arrived. [ND]
Bottle of Massandra Sherry of 2000 year harvest.
Now I need to solve a problematic question: to start it drinking right now
or postpone for evening or tomorrow, since I expect a good climbing training
later today.
Update: I failed to resist nad opened a bottle. Hmm, tasty but Krymsk's one
I liked more. Massandra is more dry and do not have that expressive
taste.
/life :: Link / Comments (0)
Thu, 01 Mar 2007
Further trie/hash table perfromance analysis. [ND]
Ok, let's first check, how does it take to traverse
32 pointers, i.e. single linked list, one-by-one.
Optimized asm output produces something like that:
80484e4: 89 c3 mov %eax,%ebx
80484e6: 89 d6 mov %edx,%esi
80484e8: 89 e9 mov %ebp,%ecx
80484ea: 31 c0 xor %eax,%eax
80484ec: 8d 74 26 00 lea 0x0(%esi),%esi
80484f0: 40 inc %eax
80484f1: 8b 09 mov (%ecx),%ecx
80484f3: 83 f8 20 cmp $0x20,%eax
80484f6: 75 f8 jne 80484f0
Blue lines are actual loop:
for (j=0; j<32; ++j)
n = n->right;
Using rdtsc call I measured that that loop takes about 50 nanoseconds
on my 3.7 Ghz Intel Core Duo machine (including some overhead in above asm
output marked as black lines to get parameters), this roughly corresponds to
(32*4)/50 = 2.6 instructions per nanosecond, which is quite close to theoretical -
3.7Ghz is about 3.7 instructions per nanosecond.
That is about 2 nanoseconds per lookup in optimized linked list.
Ok, let's see how a bit more complex trie traversal algo works.
I've create following loop:
for (j=0; j<32; ++j) {
bit = (value >> j) & 1;
if (bit)
n = n->right;
else
n = n->left;
}
This produced following optimized asm:
80484eb: 31 c9 xor %ecx,%ecx
80484ed: 89 c6 mov %eax,%esi
80484ef: 89 d7 mov %edx,%edi
80484f1: 89 eb mov %ebp,%ebx
80484f3: eb 09 jmp 80484fe
80484f5: 41 inc %ecx
80484f6: 8b 5b 04 mov 0x4(%ebx),%ebx
80484f9: 83 f9 20 cmp $0x20,%ecx
80484fc: 74 12 je 8048510
80484fe: 8b 44 24 18 mov 0x18(%esp),%eax
8048502: d3 e8 shr %cl,%eax
8048504: a8 01 test $0x1,%al
8048506: 75 ed jne 80484f5
8048508: 41 inc %ecx
8048509: 8b 1b mov (%ebx),%ebx
804850b: 83 f9 20 cmp $0x20,%ecx
804850e: 75 ee jne 80484fe
Where blue lines show loop itself, we got about 8 operations per loop
plus two branches - that code works about 4-5 times slower (instead of two
since number of operations increased twice), which means that some operations
take more than one tick.
Code which does not have branches (the same loop as in linked list plus
value bit analyze) takes about 100 nanosecods and 8 instructions
per loop, which is exactly two times slower as linked list processing.
C loop:
for (j=0; j<32; ++j) {
bit = (value >> j) & 1;
n = n->right;
res += bit;
}
res processing added so that compiler would not optimize value
bit analyze away. Asm code is following:
80484e9: 31 c9 xor %ecx,%ecx
80484eb: 89 c7 mov %eax,%edi
80484ed: 89 d5 mov %edx,%ebp
80484ef: 8d 5c 24 3c lea 0x3c(%esp),%ebx
80484f3: 8d b6 00 00 00 00 lea 0x0(%esi),%esi
80484f9: 8d bc 27 00 00 00 00 lea 0x0(%edi),%edi
8048500: 8b 44 24 2c mov 0x2c(%esp),%eax
8048504: 8b 5b 04 mov 0x4(%ebx),%ebx
8048507: d3 e8 shr %cl,%eax
8048509: 41 inc %ecx
804850a: 83 e0 01 and $0x1,%eax
804850d: 01 c6 add %eax,%esi
804850f: 83 f9 20 cmp $0x20,%ecx
8048512: 75 ec jne 8048500
As usual, blue text is loop itself.
So, what do we have?
The most optimized 32-bit trie processing code takes about 7-8 nanoseconds
per bit (200-250 nanoseconds per 32 bit), linked list's overhead is 2 nanoseconds per lookup.
Btw, in linked list processing there is no for() loop, but instead check of the current
entry if it is equal to the list head, it costs about 3 operations per loop instead of 4.
In real life we have a hash table of sockets, each hash entry usualy contains upto 4-5
nodes in the linked list, number of sockets is about 2^20.
Hash itself takes several instructions (socket XOR hash at least) and I will not get it into account.
So, to create competive trie implementation I need to be able to code 2^20 entries into
one lookup of the trie? (one trie lookup, i.e. time per check of one bit of the key, is 4 times slower,
than one pointer dereference in linked list according to above analysis).
That is impossible. Theorem has been solved - I was wrong.
Trie topic is closed, I need to think about better algorithm.
I'm posting it right now in the blog for others to think about analysis (which can be incorrect though),
I would post it to netdev@ in the related topic, but I need to ge home (quickly or I will miss the latest
bus), if I will not mentally find bugs in the analysis, I will post it tomorrow.
If you see something obvious, please drop me a mail.
/devel/networking :: Link / Comments (0)
Sophisticated filesystem testing and benchmarking. [ND]
Filesystems are built on top of physical hardware, which
has set of characteristics which can not be changed - like
search time, linear read time and so on.
Deeply looking into this set of operations, I
came to idea
to implement kernel module which would sit between FS and underlying hardware
and emulate some delays, errors and so on.
Now I think about its extension in two directions - first one to allow
userspace testing of the fs algorithms - like some library which will provide
read/write methods, which will then be emulate disk behaviour - like
writing to the different sector requires a seek delay, so algorithm, which
has highly separated meta and actual data will seek in usual write (to meta data
and to real data), another direction is to be used as trivial kernel module
on which fs can be based initially.
In theory it can be implemented as block layer scheduler, but simpler is just to implement
a block device which will end up with changed timings and redirecting to
usual one.
First step is to implement it in userspace though and test only algos there.
/devel/fs :: Link / Comments (0)
|