Zbr's days.
April
Sun Mon Tue Wed Thu Fri Sat
         
2007
Months
Apr

About TODO Blog RSS Old blog Projects Gallery Notes

Mon, 30 Apr 2007

Unbelievable milestone in appartment development seems to be reached.


I'm talking about my ceiling - I've done it.
Not completely of course, but tomorrow I will paint second layer of the colour and that will be the last major thing I will do with it. Maybe it will require third one, and I have not yet setup a dotty lights, but it already does not matter - ceiling is ready.
I spent major part of the last two days filling its bits with plaster and polishing its surface, I'm dirty as pig (hmm, its a year of the pig, btw) - I washed contiguous layer of the stucco from my face, when I look to my clothes I frequently can not distinguish if some white and grey parts were there from the beginning or it is dust (although getting into account that most of my clothes is quite dark, it is frequently not a big question), I already forgot colour of my floor and how to walk without shoes. And everything is just about my ceiling.
After I paint it and setup dotty lights I can move further - glue wallpapers, setup floor cover, complete corners and another small parts.
That is a room. Next task is to complete a hall and a checkroom. I will start with painting that ceiling, which is pretty short task, since I polished it today, then I can setup wallpapers and another types of wall cover (I plan to create stone-like surface around enter door). Next one will be kitchen, which is quite simple too - ceiling painting (it does not require major polishing), wallpapers, ceramic tiles, floor cover.
The last one will be bathroom, although it is possible that I will start it before kitchen.

That is a plan.
And it can be done (even quickly), since main task is over.

To (greatly) speed myself up I need to have a table, laptop and internet at home.
After I complete main room, hall and checkroom I will start my table design implementation, likely it will happen this week.

/devel/flat :: Link / Comments (0)


Sun, 29 Apr 2007

Great filesystem contest. Tuned mount options. Postmark results.


I've completed postmark tests for tuned mount options, they mostly include noatime option and additional journal=writeback,extents,nobh options for ext3/ext4 filesystems. That options (mostly journal=writeback I think) completely killed the whole tests - ext3/ext4 becomes a clear winer with extremely huge gap between the next competitor. Frankly, I think there was incorrect setup somewhere, otherwise my consperacy feeling suggests that Linux VFS layer is heavily optimized for ext2/3/4 filesystems only (read: specially unfair for all others). Or something is really broken in design. Or (and likely it is the true) my tests were broken somehow...

I've also started fsbench tests, they will be completed tomorrow.

I will clean results up, prepare nice graphs with as small as possible words so that anyone interested could draw conclusions him/herself.
My goal is just to find an existing maximum performance in different setups.

/devel/fs :: Link / Comments (0)


Recent changes in appartment development.


I hope I finnaly completed plaster filling in the whole appartments - today I fixed some hinged ceiling (I start to slowly hate it) badness as long as some real ceiling problems, I really hope that next time I will only paint second layer of the colour and will be able to draw the line, since I think it is already exact that time when development process should start having some progress, like it was before...

/life :: Link / Comments (0)


Sat, 28 Apr 2007

We took Mephody and Irin home.


Back to Ireland. That was great time with them in Moscow, I hope they will return this summer so we could have another trip in canoes.
Meph has plans to move from Limerick Dell office into London (we all like big cities with stream of life), so he likely will meet Abr and his small family (Hola, Tanya nd small Anton!), otherwise Meph thought of moving back to Moscow (which is quite unlikely though).
Anyway, have a nice time.

/life :: Link / Comments (0)


Great filesystem contest. Dbench talk.


Let's descuss a bit dbench benchmark, it is aimed to determine throughput of the file server accessed by several clients. It is mostly write operations, but there are directory and file reads too (for more details check client.txt file in the distro). Size of each operation varies from several kilobytes to several thousands of kilobytes. This is quite small size compared to another benchmark I ran, so it should obey the same observations, and generally it is true, except the fact that some major filesystems behave extremely poor in such workload, which was not observed in another tests. Tuned/default mount options have upto 30% difference for the winner filesystem, which in turn is upto 100% faster than the next one. Since it is mostly writes you might expect who is a winner...

Stay tuned, I did not yet ran reiser4 tests (hope I will be able to patch 2.6.20-rc tree with reiser4 patches from -mm tree) and did not setup described previously maildir benchmark. More tomorrow.

/devel/fs :: Link / Comments (0)


Great filesystem contest. Continue.


All benchmarks are always completed, but I was asked to include another one which is aimed to show real maildir smtp/pop server behaviour. Yusuf Goolamabbas sent me this link, which mainly similar to postmark bench, but with additional syncs. I will add it into the test today.

So far, let's discuss iozone benchmark. It tests file I/O performance for the following operations: read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread, mmap, aio_read and aio_write. In my tests pread, mmap and aio operations were not tested.
Details will be released a bit later, but I want to say, that I was surprised by ext4 behaviour. Winners are quite predictible in small file size, but big file (starting from 1 MB) size data is very strange - all filesystems behave roughly the same for read operation but with a bit different data for write operations. My guess about good XFS big-size performance was seriously changed - it is good, but ext2 is noticebly faster in write. I know about journalling, but such a difference is a big surprise for me.

/devel/fs :: Link / Comments (0)


Fri, 27 Apr 2007

Things become boring...


Just boring. Work is being done without any brain efforts, own projects are stagnated due to lack of the enthusiasm, there are no flight of the thoughts and artistic fancy. Boring...
I used to cure melancholy with drinks and physical exercises - last week I climbed two times (though quite easy training, but today I even climb on the wall - I got a partner and climbed couple of times, then insured Mephody and Irin who climbed first (and likely the last :) time, that was fun indeed), and drunk three nights with Mephody - this irish (former russian) becomes a bit crazy when he returns to Moscow, and I like it.

But this time things do not help, so I'm sorry if I did not complete things I promised to finish in time, I will do them of course, but in a couple of moments.

/life :: Link / Comments (0)


Great filesystem contest. Introduction.


What is your favorite one?
So far, only postmark tests completed for default mounted filesystems, and also couple of dbench/iozone for default and tuned mounts, and even those results are interesting in some regard.
Let's speculate a bit about postmark benchmark.
This one has been completed for all filesystems for default mount options so far.
This benchmark performs set of random read/write/create/delete operations, each file is usually quite small (size can be random (as in test) or obey some rules for maximum and minimum size (where postmark has some bug which leads to crash, so I did not use that option)). This benchmark emulates SMTP/NNTP/small-e-commerce workload, where system operates with set of relatively small files.
Of course we know that xfs and jfs are bad choise for such load, and the best choise is likely reiserfs (I did not yet tested Reiser4, but I want to).
But... when amount of files is relatively small (10k files) reiser is not that good as we expect, with 30k files it is a winner, but with small difference (maximum 10% though).
We also expect ext4 to be better or at least not worse. This is not entirely correct too. Looking into graphs I would not say I want to use ext4 instead of ext3.

I removed ext2 results (will lilely redo them), since they are too good to be true - in order of 3-10 times better than the nearest competitor, sometimes its speed is in order of 2 times higher than disk speed, which means excessive usage of the page cache, which never shows in ext3/4, so I dropped - will redo later.

So, stay tuned, I will complete other tests (including tuned mount options, mostly notail, writeback journalling and extents) and present graphs here.

Tests are not aimed to show which filesystem is the best for which workload, but to find existing performance limits as a target.

/devel/fs :: Link / Comments (0)


Tue, 24 Apr 2007

Disk performance emulation.


Benchmark I ran on disk emulator did not include disk seek/read/write time, but now I just use pure loop to perform a delay.
To check how numbers are on the real disk and real filesystem I decided to run yet another filesystem benchmark.
I'm writing set of scripts to automatically run xfs, jfs, ext3/4, reiserfs (maybe reiser4) with set of benchmarks to gather different workload results. I plan to test iozone, postmark and dbench (hopefully I will complete setup and scripts to run it the whole night, but it might not be possible to download archives over my miserable 2kb/s link today).
Hardware is quite small - AMD64 Opteron 3500+ with 1gb of ram, single Maxtor 6E040L0 drive in NFORCE3-250 IDE controller (hdparm shows about 57 MB/s buffered disk read speed).
Stay tuned.

Update1: Postmark from NetApp sucks, I managed to crash it (segfault)... But found a workaround, so will use it.
Update2: Started: postmark for big different sets of file, subdir and transaction numbers, dbench for different number of threads and iozone in automatic mode and with throughput test with different number of threads. All is being logged. All tests are performed for ext4, ext3, ext2, reiserfs, xfs and jfs filesystems. After each test fs is remounted, all caches are dropped. I hope they will be completed this night, so tomorrow I will write bunch of parsing scripts to generate nice graphs and show results here. It is even possible to run such tests in automatic mode to detect regressions in each release/patch.
Ok, now I need to go home and have some sleep...

/devel/fs :: Link / Comments (0)


Simple filesystem file creation benchmark.


See, how broken it is.

Simple FS file creation performance

It is purely design bug - during directory entry allocation filesystem code runs through all inodes linked into single linked list and checks where is a slot to put a new entry with given size.
This must be dynamic structure like tree/trie, it can not be hash table, since it might be a situation, when there will not be a place to allocate new hash table as contiguous region, and having several layers of hashes is just a one of the trie implementations (judy array or, as I called my implementation, multidimensional trie). Likely B* tree is the best case, but for log-structured filesystem tree structures requrie quite major overhead (not only data block must be copied, but also its parent and so on til root), I need to think about it some more - likely I will introduce defragmentation process involved in writing - i.e. each writes should create new tree, which will store not only modified block, but create bigger contiguous region to store there file/dir content.

But enough empty words for now, I need to make some bits for paid work (which will bring me second monitor, btw, which should allow even more productive work, although not that much I expect) - project should be completed in a week, but I have not yet found a place for second monitor, so I want to complete it today to start complex task of restructuring computers on the table.
Then I will continue to work on my networking trie implementation instead of hash tables - I will just remove all ifdefs and turn trie on unconditionally, after it is ready I plan to complete set of benchmarks to show that it is possible to be faster not only in lookup (where trie is already faster by design), but including the cost of allocation/freeing of the nodes.

/devel/fs :: Link / Comments (0)


Mon, 23 Apr 2007

Weekend is over.


That was really great time with lots of old friends, only Abr and Tanya were not able to move here from London, we missed you.

I really want to thank all friends who celebrated my birthday, you are definitely the best part of the whole.

/life :: Link / Comments (0)


Fri, 20 Apr 2007

Friday evening.


I had a short climbing trainig - I scratched yet another piece of skin from another finger, so training was not that pleasant as usual, but I completed several traverses and bouldering traces - all of them were quite old and one or two new starts. It looks like I missed one generation of the new traces already - still I climb alone instructors do not allow me to climb higher than 3-4 meters without insurance, so I can only complete starts, there are only few really complex starts I did not finished yet, so something must be done with it...

Later evening I met with Mephody and Irin - they arrived from Ireland to my birthday celebration, which will start tomorrow - we ended up drinking irish wiskey, talking about life and even politics, discussed future plans and ideas, spent an extremely good time on the whole.
I expect perfect weekend has been organized...

/life :: Link / Comments (0)


Power-of-two allocators.


Eric Dumazet has rised an interesting question about existing power-of-two allocator related to no-mmu implementation of the - is it possible to allocate higher-order page and then return part of it as unused (for example if someone has allocated 10-order page and then return 8-order part).
As far as I can tell (I'm not absolutely sure though) it is impossible with SLAB one - each page can only be 'split' into the same-sized parts, so either 10-order, or two 9-order, or 4 8-order, but not one 8-order and one 10-order minus 8-order.
That was one of the reasons I created network allocator, which I proved does fix such power-of-two overhead in the single page, i.e. blocks are combined when freed to form bigger one, and it is possible to allocate exactly requested block not aligned to power-of-two boundary.
But my allocator did not get enough attention (did I say that already for something unrelated? :), so it was a bit postponed.
Let's see, if there will be some interesting suggestions in the thread.

Update: David Howells of RedHat seems to be sure, that it is possible to allocate an order-10 page, then release part of it (say an order-8 subpage). But from the whole thread it seems that he says about no-mmu case, which can work on top of SLOB allocator in some embedded system.

/devel/networking/nta :: Link / Comments (0)


Lemons.


Bruce Schneier wrote an excellent article about how poor products push away good ones in security market. He provides a link to economics nobel prize article about how it is proved in math - it describes situation, when buyer does not know enough about product and seller does not tell the whole info.
Very good written essay, which conclusions and 'steps to reproduce' can be attached not only to the security market but essentially to any other one.

/other :: Link / Comments (0)


Recent appartment development.


Do you recall my ubercool yellow hinged ceiling?
There is no such ceiling anymore - I've completely repainted the whole ceiling (not only hinged part) into pearl colour - looks really cool, but that colour is very sensitive to even small badness in the surface, so I need to put quite a bit of plaster to fix it. Then I will paint the second layer of colour.
Because of that I needed to polish quite a bit of surface (about 10 square meters on the ceiling), which was slavery work. Eventually I touched my genius part and connected emery paper to the corner-grinding machine (bolgarka, this is a must-have instrument for development as long as hammer drill I think), which allowed to speed things very noticebly, but I still was very time-constrained.
Later evening I started to setup water system, but found that I have only one adjustable spanner, which is not enough to properly fixate all details, so I do not have a proper water system yet. About 1 o'clock this night I found, that there is no water in my flat at all (and that is not my fault).
That was challenging to wash away about 1 millimeter layer of dust from my face using drinking water. Today I used the last drops to wash. That is a reason I did not fixed ceiling bits with plaster.
So, I completed a lot of things, but in a long run did not move any further.
Now I like my ceiling even more.

/devel/flat :: Link / Comments (0)


Thu, 19 Apr 2007

Weekend promises to be fun.


Fuel

That is our company fuel for country-side meeting - we (about 16-18 friends) will move to the cottage I rented for weekend in deep 'nature', although territory does have a lot of entertainment infrastructure (restaurant, carting, paintball, horse riding even balloon with parachutes), so we will end up with sauna, shashlik, vodka and fun.

And right now I decided to move away from office to development shop to get some stuff to prepare for Meph and Ira meeting - they will arrive tomorrow from Ireland and will stay in my place for some time. I need to make my loft looking somehow more close to european standards (for example setup water system) and not like development warehouse.

/life :: Link / Comments (0)


Wed, 18 Apr 2007

Climbing eveninng.


To make some more ubnormality in training I decided to run horizontal traverses using only one finger on each arm - that was challenging, but eventually I found a way to move over holds using only one middle finger. Everything was find until I heavily scratched away couple of square santimeters of the skin on right-hand finger, so I was unable to proceed with that exercises. Actually it was stupid training, it does not force neither training level, nor power endurance, so likely I will not do it again.
Also completed several old boulderings and that was all.
Not bad training, but nothing special, and aching finger as a result, but I do not care.

/life :: Link / Comments (0)


Initial implementaion of the simple filesystem in emulator is ready.


It is not filesystem I will work with in future, it does not contain anything of interesting features I plan to implement in a real on-disk filesystem. So far it is simple ext2-like filesystem, which does not even support object removal (and will not actually).
It is used as reference point for emulator testing.
Here is simple test application output, which creates files and directories in simple filesystem:

Creating file /a
Creating dir /b
Creating file /b/c
Reading dir /
 d .
 d ..
 f a
 d b
Reading dir //b
 d .
 d ..
 f c
There are no timings yet, but each access to disk is delayed for each seek and IO operation by appropriate time. I have not ran any heavy testing yet except creation of quite big number of files. But it looks like each nanosleep() with small number of nanoseconds takes about 4 milliseconds on my machine (because of 250 HZ I think), so it is not possible to use emulator for absolute benchmarks, but it still can be used to compare different design approaches.

/devel/fs :: Link / Comments (0)


Tue, 17 Apr 2007

Debian to Ubuntu via apt.


Did I say that Debian sucks? It does. It very fscking does.
It's Etch distro was quite good, but Iceweasel and gnome-terminal crashed constantly (the former crashed several times per day, the latter about once per week), and I was not told by system to report a bug (bug without backtrace is not interested to developers, especially if they are not able to reproduce it), Fedora Core did not crashed, but had a memory leaks likely somewhere in Xorg internals, eventually I installed unstable Debian branch. After one of the updates locales were broken completely and my bug report was closed with words 'we removed locales package since it can not be unintalled', my words about the fact that it broke all terminals and related applications only resulted in 'it is not a bug'. Ok, if it is not a bug of locales/glibc, then it is a bug of Debian.
Since Debian and Ubuntu use the same 'apt' package management tool (which I like very much and stick with Debian due to it) I decided to completely switch to Ubuntu. So I setup its repository, run 'apt-get update dist-upgrade', downloaded more than 300Mb of new archives (using hole in corporate firewall, I do not feel myself guilty, since our admin does not know Linux and setup squid proxy incorrectly, my suggestions and help offer were dropped, so I have miserable bytes per second speed and sometimes it rises to few kilobytes, all that with 1Mbyte/s total channel).
Then problems started to appear - update frequently wanted to overwrite some files, which were shared between old Debian packages and new Ubuntu ones, so until Debian package is removed, files can not be overwritten, since in subsequent Debian package removal needed file will be removed too. So I manually removed some packages, which in turn changed upgrade tree, so later I needed to intall them manually. Some packages like cron, logrotate, rsync, syslogd and logd were updated, but can not be configured likely due to bugs in post-install scripts. I've fixed it by hacking 'update-rc.d' script.
I hope it will not crash and I will stick with it for a long time.

/other :: Link / Comments (0)


Mon, 16 Apr 2007

Climbing evening.


I did not climb since wednesday, so I was in pretty good shape and decided to make a hard one. Actually it was not that complex training, but with jumps, wild screws, traverses and various boulderings.
It looks like helping to interesting people becomes a usual part of the trainig.
At the end I tried to complete a bit complex traverse on the 1.5-2 meters high and failed quite unsuccessfully, so damaged a foot, which is aching quite hard on some moves, but that does not matter, since in climbing movings it seems to not hurt that much, so I can continue if problem will not dissapear until next trainig in wednesday.
So, it was definitely a good one.

/life :: Link / Comments (0)


Filesystem emulator news.


Ok, birthday celebrations are almost over, so I can continue to work on this - so far I implemented initial on-disk format (which is very broken by design, it is a bit intentional though), which does not support removing without major hacks, but I will not fix that.
This on-disk format is a test case for the naive implementation and expected performance - although it should be good for sequential read/write of unfragmented files (I think anything is good enough for such conditions) and probably even for small directory reading (the same performance note here).
It is similar to ext2 format, but simpler. It does not contain any tree inside, just set of indirect links.
So far on-disk format does not support lookup and file size changes, which will be fixed indeed.

Emulator itself does not have a cache to test exactly on-disk format (and find problematic placese like excessive seeks).

/devel/fs :: Link / Comments (0)


Acrypto is officially closed.


Yes, I close acrypto project after more than two years of maintenance.
The reason is simple - I could not convince Linux kernel community to include acrypto into mainline (neither James Morris - previous crypto maintainer, nor Herbert Xu - current one, did not want to include such massive subsystem into the tree), although I did not tried too hard. I continued to support it as out-of-the-tree package until kernel does not support async API.

This day has come - Herbert Xu proposed a new async interface to synchronous by design Linux cryptoapi.

It looks good (although I do consider it as a hack on top of synchronous software-only crypto api - my opinion has not changed after discussion with Herbert more than a year ago :) and lacks a lot of acrypto features (crypto routing, load balancing, autotuning, prio queues and so on), but something can be added if required.
I decided that dropping acrypto and focus on async cryptoapi is at least some progress in that area.
I plan to start porting HIFN 795x driver from acrypto to cryptoapi soon, it should not take too long to complete.

/devel/acrypto :: Link / Comments (0)


Sun, 15 Apr 2007

+1


I used to celebrate middle of the spring each year, so this is it.
Bottle of a good wine, tasty things, good mood...
Have a nice day!

/life :: Link / Comments (0)


Sat, 14 Apr 2007

Friday evening.


Drove a cart with Grange - that was my first time, but it was very fun and quite a bit of extreme. Although I finished with the worst time, I definitely enjoyed the process.

Later degustated and approved black Blavod vodka - quite good drink, does not contain oils and bad after-taste, so I recommend.
It is indeed black (although with green colour).

Blavod black vodka
Blavod black vodka

/life :: Link / Comments (0)


Fri, 13 Apr 2007

Updated TODO list.


I've added 'dropped projects' item, which contains projects which were started and/or completed, but dropped due to various reasons (like kevent).

Also updated 'completed' and 'to be done' parts.
To be done part now includes links to filesystem development, hash analysis and some ideas about robot implementation. Yes, I plan to create a robot. A human killing machine.

Interested reader can checkout my TODO list and provide ideas and links.

/devel/other :: Link / Comments (0)


Friday 13.


It's time for black vodka.

Black blavod vodka and Captain Morgan Rum

My small cellar, likely will be twice reduced today.

/life :: Link / Comments (0)


Thu, 12 Apr 2007

Cache in emulator.


My implementation sucks and make life much more complex than it is supposed to be, so I decided to completely remove directory entry cache and always read content from disk just like I do for usual file content. This means to drop about a half of my code, but that does not matter, things mecome as clear as in the child's tear: filesystem provides set of callbacks to lookup/create/remove/read directory and open/create/read/write/remove file. Directory lookup callback will return an inode, which corresponds to the requested path.
So essentially open_file('/a/b/c/d') will end up with a loop, where several inodes for directories will be looked up on disk, and then file will be opened (it's inode will be stored in the 'process' bitmap of opened files to allow file descriptor access).
Nothing (except inode for file) will cached, so subsequent access to the same directory (for example for different file in the same dir) will require whole lookup again. In theory I can open directories the same way files are opened and thus provide file descriptors for dirctories and store them in 'process' bitmap of free/used files, but I will consider this as an further extension. To test read directory operations and read from/write into multiple files I can use root directory, so no lookups will be performed to open files.
Simple and serves main goal of the emulator - test on-disk format.

/devel/fs :: Link / Comments (0)


Wed, 11 Apr 2007

Climbing evening.


I found new trace (actually new start, since I climb alone and I'm not allowed to climb higher than 3 meters) - it is quite interesting 8a start (I can not climb traces of such complexity, but can complete about 5-6 first holds), which got almost all my attention this day. I also completed several traverses and old boulderings, but that start was extremely interesting itself.
Trace is started on negative slope on the left sector over vinous holds, it is called "sketch in vinous colours".

/life :: Link / Comments (0)


Labradors.


Collegue has a labrador (female) dog, which gave birth to 6 puppies (3 boys and 3 girls), one can find family photos here (direct link to puppies photos).
I'm even thinking about getting one of them, which is very unlikely though getting into account my style of life...

/life :: Link / Comments (0)


Tue, 10 Apr 2007

Different tree/hash algos in filesystems.


#linuxfs irc channel (at irc.oftc.net) discusses performances of different tree/trie speeds - b-tree, radix tree, rb-tree, trie and hash tries.
So far simple algo like trie with higher than 2 bits per node was not discussed, although it is likely the fastest algorithm (at least from what I know and tested) - end node can be accessed in no more than logN(L) lookups, where N is number of bits per node and L is a key length.
Such trie has another huge advantage of not requiring balancing, so it can be completely lockless for traversal and modification ecept cases when node is simultaneously accessed and removed, so RCU mechanism must be used.

To determine on-disk trie usage let's check possible scenarios:

  • readdir() - it traverses all directory entries one-by-one, so esentially it does not matter what algorithm is used - it can be list or tree or whatever.
  • get inode by its number - 64 bit key is used (or 128 bit if we want really big filesystem), inode number can be split into subkeys to be used in trie (or can be used in any other binary or whatever tree or hash table), but much better case is to use on-disk block number as inode number, so inode can be accessed with single seek just into place.
  • file access - read/write blocks inside fragmented file. File position must be transferred into on-disk block number. If each block number would be put into own tree node, it would require about O(log2(N)) seeks to get needed block (where N is number of blocks in file), but balancing data on disk is very heavy operations, so usual binary trees like rb-tree can not be used, since each addon/remove (write/truncate) of the data requires rebalancing. B-tree can be used, so each contiguous range of blocks on disk would correspond to range of file positions. Essentially trie and b-tree are similar, but the former requires very small rebalancing, which does not affect neighbour entries. For exact match setup it is perfect (besides some storage overhead), but for range setup it is trickier.
    Let's consider following situation: we need to find a block number of 0x1234 file position.
    Let's say we have 8 bits per node trie, so second level can be accessed in position 0x34, and atual data block can be obtained from position 0x12 in second layer, referenced from 0x34 position. So far so good, but what if we have contiguous block of 0xff00 bytes, so file position 0x1234 is inside that block, which can be addressed by single lookup (start at 0x0 and size 0xff00 plus offset 0x1234), so in abose setup we just wasted a lot of space. This needs more thoughs.
  • directory lookup like open('/aaaa/b/cccc/defg'). Each directory entry is an array of chars, so radix tree would be very good to store entries, but trie with 8bits per node will be even better (except that it again has some storage overhead - it will use 256 entries to store next pointers, if directory contains only one entry, last 255 of them will be vasted). So with radix tree number of lookups is equal to number of bytes in the directory entry, in trie it can be smaller (down to one) and equal to number shared bits divided by number of bits per node between several directory entries (i.e. for strings '/test' and '/test1' and 8 bits per node there will be 4 lookups in '/' dir with 255*4 entries overhead, while with radix tree there would be 4 or 5 lookups with zero overhead, for case of '/test' and '/qwe', trie would allow to get directory entry in one lookup, while radix tree requires 4 or 3 ones).

/devel/fs :: Link / Comments (0)


2007 Filesystem workshop slides.


Here I posted LWN and wiki liks.
One can find actual presentations (slides) here (Usenix).

/devel/fs :: Link / Comments (0)


Mon, 09 Apr 2007

Climbing evening.


THis time it was simple - I usually do not try to have hard training each monday - mainly because weekends are quite light in food regard (I still do not cook at home) and frequently are a bit drunk.
So, this one was simple - I completed a lot traverses (without weights) and several boulderings. Teached interesting people a bit again.
That was a good time.
At the end tried to spent more time in sauna with cool water in between - it was quite good, but later I felt myself much worse - tired and got alot of body aching, so I can say I do not like such bathing...
Next day (I write it thuesday evening) it aches even more, so the best result of sauna (at least for me) is to spent there about 10 minutes (with more than 90 degrees Centigrade) and move under cool shower without next turn.

/life :: Link / Comments (0)


Filesystem emulator.


I've somehow completed on-disk format and created simple filsystem implementation.
It is quite broken, since lookup is performed in a list-like manner - pointer to the first group of blocks is only accessible from higher layer, so to get tail of the file, which contains of the several data inodes, all tails must be checked one-by-one. That is quite wrong.
On-disk format does not support easy deletion (actually code does not even exist due to that) - directory entries are placed one-by-one, so there is no way to remove one from the middle of the set - that can be solved by introducing start/len pairs for dentry, but it is not done.
I do not like on-disk format even without any tests, but that will be a starting point - something worse that that just is not allowed to exist.
Implementation lacks proper lookup actually - mainly because of the VFS layer implementation limitations - it is only possible to lookup through in-memory inodes, so if something was removed from memory (file closed), it can not be found anymore.
Besides the fact, that it is very broken even without tests, it is quite interesting.
Main tasks to complete are:

  • on-disk lookup mechanism
  • correct on-disk removal
  • testing, testing, testing (there is no code yet)
Not too much work, actually...

/devel/fs :: Link / Comments (0)


Sun, 08 Apr 2007

Filesystem emulator.


I've create directory entry cache instead of inode cache, so file inodes do not carry unused info anymore.
Also started on-disk format implementation - it is pretty simple and similar to extended filesystem format - there are two types of structures on disk- inode and dentry, inode is a data containing object, it has either set of data block "pointers" (block number inside storage) with appropriate block sizes (curently 64 block pointers per inode) for file inode or set of dentry structures for directory inode.
Dentry structure is a structure which describes object inside parent directory, so it contains name, its length and some flags (like inode type).

Inode itself also contains additional attributes (like access time, create time), "pointer" (block number) for the next inode in the chain (if block numbers inside one inode is not enough to cover all data), the rest of the inode (it is aligned to filesystem block size (equal to 4k) on disk) is not used, but will be - I plan to put there file's data (if file is small enough). Directory inode is aligned to the same filesystem block size, but after common data it has dentries instead of block pointers and file's data.

Such design allows to group together directory content (only dentries actually, but they can be extended to mirror (or exclusively contain) parts of the inode information), which will (in theory yet) increase directory reading.
I create filesystem interfaces to be easily extendible to allocate data not into blocks, but extents (which will decrease seek time greatly). Although I was against them, my emulator is simple enough to not contain easy mechanisms for delayed allocation (mainly it does not have page cache), so I will use extents for now.
Extents can also be used to group dentries.

Filesystem is pure 64bit, but I have plans to extend it to 128 bits (hmm, just because I can), it is endian-safe (code in emulator does not convert data, but I use __le* types to remind myself where it is needed).
So far, on-disk format is not completed and I work on initial fs implementation in emulator.

/devel/fs :: Link / Comments (0)


Fri, 06 Apr 2007

Climbing evening.


THat was interesting one - although I did not climb for too much, but I helped other interesting people to start with theirs traces.
My training contained moves with weights on arms and legs: a lot of traverses and some repeated 4-6 holds parts of the vertical trace - I usually completed about 3 times of 5 holds up and down, so I think I could complete the full trace with upper rope insurance - maybe not on-sight, but eventually after couple of starts - that was starts on negative slope upto first insurance point for the bottom rope.

/life :: Link / Comments (0)


Opera.


Am I alone who completely do not hear any rhyme in opera?
I'm listening for Beethoven's Fidelio and it looks like the main goal of human/chorus voices is to create a special sound which is in a perfect harmony with orchestra. And although it is in German (which I happily forgot several years ago), I frequently fail to extract exact phrases and words, but only voice music.
I actually never visited an opera and only listened records, so this moment is quite new for me.

/life :: Link / Comments (0)


File system emulator.


Ok, I've completed operation, VFS and storage layers (layer design picture).

I started to think about simple filesystem on-disk format. Filesystem must implement only four callbacks in my emulator:

  • create() - which allocates and fills new inode of given type. It also should put on-disk representation of the inode into storage.
  • remove() - remove on-disk representation of the inode from disk, mark appropriate space as free and release all resources attached to inode.
  • read() - read data from storage into provided buffer for given inode.
  • write() - write data to storage from provided buffer for given inode.
As you can see - it does not differentiate between directories and files, but files are referenced as file descriptors from operation layer, while directories can not be accessed from that layer (except dir_create(char *path) and dir_remove(char *path) operations). This means that each inode which represents a file has excessive hash table (used to form directory cache for directory inodes). If I will find in tests that this does hurt performance (and mainly if it will not allow to clearly detect problematic cases for on-disk format) I can split it into two types of structures - directory entry (struct dentry in Linux kernel) and usual files, both will have inode structure, but without hash table, which will be moved into directory entry.

VFS file/directory create operation imlemented in recursive way, i.e. when file_create("/a/b/c/d") is called, and end directories in that path do not exist, they will be created.

/devel/fs :: Link / Comments (0)


Joke - game of words in russian.


Rumors say that english-speaking programmers are very curious why russian programmers say "your bunny wrote" each time they strike against some problem.
Ask your neighbour russian-speaking person about it (note: it has quite undecent meaning).

/other :: Link / Comments (0)


Thu, 05 Apr 2007

Filesystem userspace emulator.


Here one can find a design picture of my filesystem emulator.
It works on top of so called storage, which can be programmed to emulate any kind of storages. Currently it works on top of usual memory with configurable delays for appropriate operations. Memory actually should be non-swappable and mapped via hugetlb, but it is not done yet, it is not a complex task to add a several lines into storage initialization path, so it will be postponed a bit.
VFS layer should be non-swappable too - main goal of the emulator is to find perfect on-disk format, so each read/write should go directly into storage.

Currently only storage layer exist and I work on VFS and operation layer, when they are ready I will start to implemt simple filesystem, eventually it will be extended to something I will work with in kernel and real disks. Main features of the filesystem were highlighted already, here is a brief set:

  • log-structured filesystem (absence of the journal, write anywhere filesystem layout)
  • snapshots
  • checksums (with possible recovery using Reed-Solomon codes - think about RAID functionality in your single-disk laptop without big overhead)
  • high performance (delayed allocation, as small as possible number of seeks to get both file data and directory content, directory readahead (directories must be packet closely to each other - the same as files), self-defragmentation
  • plugin layer (encryption, compression, networking support) - optional
  • distributed facilities (thinking about):
    • FS should be able to be part of the remote storage with invisible to user updates and syncs (think about laptop connected to server, the former shares some data with the latter, while offline laptop works with own copy, which then can be synced (merged by 'hands') with server when online)
    • in case of Reed-Solomon (or other) data encoding, checksums and recovery information could be stored on different machine
    • data can be stored in distributed manner on several machines (think about RAID5 on 5 machines, one can be turned off and data can be recovered in real-time from others, or any level RAID in datacenter)
Quite ambitious project (hmm, the more I work, the higher I want to move), but I'm quite sure I can complete it (main features are log-structured and snapshots part and distributed facilities with excessive encoding) - there is only time constraints.
After I complete userspace emulator (and simple fs) - I plan to finish it this week - I will continue to work on network part - mainly unified lookup storage for different socket types (netdev@ thread, although it was met quite cold, I think idea is interesting, so I will continue) and network allocator, which I recall again and again with power-of-two allocation overhead in existing SLAB allocator.
Then I can start real fielsystem implementation.

/devel/fs :: Link / Comments (0)


Wed, 04 Apr 2007

Climbin evening.


This training was initially started with weights, but eventually instructor asked them back, so I ended with boulderings.
Jumps on negative slope after couple of wild screwings made the day - I like them a lot, even although sometimes strike fingers and other parts of the body against the wall and big holds.
That was excellent training, although qute short.

/life :: Link / Comments (0)


Tue, 03 Apr 2007

I want a new year!


I want to ask grandfather Frost (aka Ded Moroz also known as Santa Claus) to bring me this ARM-based OKI H5003 header board with 32Mb SDRAM, 8Mb flash, USB, GPIO and other goodies (no ethernet though, so I'm not 100% sure).
I would also ask to give me ability to stop main universe timer, so that I could complete my projects.

Meanwhile, here is a picture of that board:
OKI H5003 board

The best variant would be either LPC-L2294 with 8Mb SDRAM, CS-EP9302, but there are no such boards in russian resellers.

Update: No new year this time - I will wait until ethernet power board will be available in russian resellers and then order one to perform various interesting things with it - namely 1-wire (w1 in Linux kernel) simulator and GPIO based bus master, and various wireless related projects (RFID, GSM).

/devel/other :: Link / Comments (0)


Mon, 02 Apr 2007

Climbing evening.


Since I spent most of the time in total lazyness, I decided to move and have a good climbing training at least.
So, this one should be simple enough, since friday climbing with weights still influenced, so I decided to invent several new boulderings and have technically good traverses - i.e. without too heavy tireness. In theory it sounded good, but in practice I ended up with jumping boulderings on negative slope - that was very fun movings, wild screwings and jumps made the day.
It was excellent trainig, and I managed not to be tired as usual.

/life :: Link / Comments (0)


Sun, 01 Apr 2007

Hashes.


#define f1(x,y,z)   (z ^ (x & (y ^ z)))		/* x ? y : z */

#define K1  0x5A827999L			/* Rounds  0-19: sqrt(2) * 2^30 */]

	for (i = 0; i < 20; i++) {
		t = f1(b, c, d) + K1 + rol32(a, 5) + e + W[i];
		e = d; d = c; c = rol32(b, 30); b = a; a = t;
	}
Where $a, $b, $c, $d and $e are parts of the digest, W[i] is i'th word of the input. Does it look similar to this one:
{ \
	a -= b; a -= c; a ^= (c>>13); \
	b -= c; b -= a; b ^= (a<<8); \
	c -= a; c -= b; c ^= (b>>13); \
	a -= b; a -= c; a ^= (c>>12);  \
	b -= c; b -= a; b ^= (a<<16); \
	c -= a; c -= b; c ^= (b>>5); \
	a -= b; a -= c; a ^= (c>>3);  \
	b -= c; b -= a; b ^= (a<<10); \
	c -= a; c -= b; c ^= (b>>15); \
}
For me it does - it is roughly the same GF(2) and GF(2^32) transformations, but with different arguments and operations, but the ground is the same.
So, I have quite ambitious goal to write a similar to this code for new type of the hash. But I'm not sure if I will have a time, but at least it is interesting. Quoted above code snipet is part of the well-known hash, other 3 parts are essentially the same with different f?() function.

/devel/math/hash :: Link / Comments (0)