Zbr's days.
June
Sun Mon Tue Wed Thu Fri Sat
         
2007
Months
Jun

About TODO Blog RSS Old blog Projects Gallery Notes

Sat, 30 Jun 2007

Wine cellar.


My cellar only contains bottle of Martini now - yesterday the last bottle of black vodka was completed.
And I was told (even suggested/recommended/asked) to drink it not with usual company, but with completely different people.
Doubts rarely visit me, but right now they have came.
Thinking...

/life :: Link / Comments (0)


Distributed storage niche.


After drunk a bit of vodka with Grange we discussed niche of my distributed storage.
Briefly saying (what it will be), this storage is a block device, which connects several disks over the net. It will allow to implement different redundancy mechanisms to have self-healing abilities in case of died underlying storages.
That is it.
Now what are the competitors.

  • First and the simplest one is network block device. It allows to remotely connect a storage, but one device can handle only one remote peer, so to implement a storage one needs to group peers connected via network block deivices using for example LVM. Device mapper on top of network block devices allows to use existing redundancy mechanisms like software RAID. This looks simple and good, but there are some problems: one device per remote node, essentially this does not allow to scale to really big storages, since number of devices is limited; static configuration (local node can not determine size of the remote storage, that data is provided before startup during configuration and can not be changed after device was started); needs special userspace process on behalf of which all transactions are performed; network block device uses excessive acknowledge protocol, which creates a huge overhead in case of flood of requests; global locking.
  • High-end SCSI storage with hardware redundancy. This one is limited by the price - two unit server can only contain about 14 disks, its price is already too high, if servers can be connected via special interconnect, it increases its price even more, so for petabyte (just an example) storage price will be about the same as number of bytes. Hardware RAIDs are complex to work with, especially when disks start dying. There are no hardware RAIDs with more than RAID6 redundancy, which is just two dead disks per array. RAID6 price is too high. More common RAID5 controllers only allow to have one dead disks per array.
  • High-end storages with different interconnects (like iSCSI), which actually can be cheaper, but still requires LVM/Device mapper/whatever to build a single storage on top of them. Does not scale to really huge volumes because of the price just like usual storage systems.
So, essentially distributed storage just allows to have huge volume storage using existing network infrastructure without any special expensive components as long as with additional high-performance low-latency interconnects, but providing set of features existing systems do not have or its setup is not simple and high perfomance.
Approach I use is quite different from the most known and widely used Lustre storage by Cluster Filesystem Inc (where I was asked to work some time ago :) (there are number of very similar designs over there, which I was pointed privately couple of times). One of the differencies is the fact, that distributed storage does not require any special API to work with it, it looks just like usual huge storage connected locally to the node.

I was told, that I will not be able to complete such task, that there is no need in it, that there are better varians and so on, but frankly I do not care - I like the process, so there are no problems for me.

/devel/dst :: Link / Comments (0)


Fri, 29 Jun 2007

Climbing evening.


That was a good training - I completed several new traces, set of old ones, eventually tried couple of a bit more complex than usual, but failed. One of the traces is over passive yellow holds in the central sector - I do not understand who gave 7a category to it, since it is much more complex than another 7a I tried before, so it was not a surprise that I failed. Then I tried new 6b trace over blue holds in the central sector but with part of the trace over horizontal negative slope - I generally perform quite bad on such parts due to not enough power endurance and it was the very end of the training, so I failed couple of times on that part, but completed without any problems on the vertical wall part of the trace. I'm sure I would complte that trace, if I started it early without previous complex runs.
Anyway, that was a good time there.

/life :: Link / Comments (0)


Thu, 28 Jun 2007

Device mapper sucks or recent changes in distributed storage development.


And thus will not be used.
It has serious limitation on table format - each line in table must form contiguous region, and it is not allowed to change limits (start and size) in constructor, which effectively breaks dynamic configuration of the storage.
So, no device mapper - I will use raw block device and create proper binary protocol to configure device from userspace instead of string based one for device mapper.

So far I only added couple of bits to dynamically add/remove storages and encoding algorithms, which actually can be revisited later if needed.
Each algorithm is an entity which will send/receive (and read/write via usual block IO path if node is local, which is supported too) data from nodes, it will determine if several nodes needs to be read/written (for example to recover if array is degraded or to update checksum) and perform all actions.
Initial trivial algorithm will use round-robin mechanism to write sequential data to several nodes.

/devel/dst :: Link / Comments (0)


Device mapper stack.


Let's see how block IO requests get remapped in the device mapper stack.
Here is a trace chain for usual IO operation:

  • suspend/resume ioctls of the appropriate block device -> __flush_deferred_io()-> *)
  • generic_make_request() for appropriate block device ->make_request_fn()->dm_request() -> *)
*) ->__split_bio()->__clone_and_map()->__map_bio()->target's map callback.

First codepath is used to flush all in-flight BIOs on demand and is not a common path.
Second one is a common way. It is performed in process context (either via ->readpage() and do_generic_mapping_read() or via pdflush during writing) and thus can sleep, there is no serialization lock anywhere in the path, but block layer ensures that only one make_request_fn() callback runs at a time for given queue. In case when make_request_fn() is already running, new BIO will be queued.
This allows not to implement any kind of locking in target's mapping function, but there is an issue with dispatching work for different remote storages.
Let's assume we have two remote devices, one of which is connected via very slow link (I seriously doubt one can build a distributed storage on top of slow links, but it is an example - node can be overloaded or down and effect will be the same), so sending/receiving data to/from that device will take a lot of time, if processing function will sleep, this essentially means that target's map callback should not be allowed to sleep and thus will not perform any processing itself.
Having async AIO or at least in-kernel event dispatching would be great here, but I closed kevent which fits perfectly here, so I will implement some ->poll() based dispatching/awakening mechanism to process non-blocking requests on behalf of special thread.
So, target mapping callback will issue a non-blocking request without queueing (checking ->poll() first is a good idea), if it is not fully completed, the rest will be done in dispatching state machine in dedicated thread.

/devel/dst :: Link / Comments (0)


Job status.


It looks like companies do not want me to work with them partially/contract based, so things will be exactly like they were before - free time after my pain^Wpaid work will be devoted to hacking.
Right now I have not that much time unfortunately, but I will definitely make some progress.

/devel/other :: Link / Comments (0)


Wed, 27 Jun 2007

Climbing evening.


Lazy-boom-slacker Grange was not able to get his ass out of traffic jam, so I climbed alone. As a result no new traces, but a number of interesting boulderings, some old traverses. As next level result I would mention damaged shoulder and leg after wild jumpings, and some interesting new starts (with couple of heavy falls from 3-4 meters high to the back).
Since I did not climb high, there were plenty of time to talk with another climbers on the floor...
It was fun training.

/life :: Link / Comments (0)


Distributed storing dilemma.


I have a qustion about the way data should be organized in distributed storage.
There are two possible ways:

  • sequential data is stored lineary - first storage is filled until it is full, then next one and so on. Of course error recovery codes usage requires another storages to be used, but amount of CRC data is usually noticebly smaller than amount of actual data.
  • sequential data is distributed among several storages according to some algorithm (like round-robin).
The former allows to increase random access speed, the latter - sequential access. Getting into account that even 1Gbit network has about two times faster bulk transfer speed than modern disks, priority of this issue increases.
Likely round-ribin writing is a good candidate for first implementation.

/devel/dst :: Link / Comments (0)


Tue, 26 Jun 2007

First results in distributed storage development.


Pretty trivial: I've created a device mapper target which sends data pages over the net. It can be configured to work with any kind of network media (including various sockets). Well, it is similar to network block device, but does not require special process. BIOs remapped in distributed storage target operates with pages, so with appropriate jumbo frame setup there should be no copies at all (code uses ->sendpage()).
Right now code only sends data and does not receive it at all, so appropriate protocol should be designed.
There is also open question about how BIOs/pages should be remapped.
Device mapper supports three types (which are not documented actually):

  • DM_MAPIO_SUBMITTED - this means that target code will process BIO by itself - either put it and thus call end_io() callbacks, or submit to another layers.
  • DM_MAPIO_REMAPPED - this must be returned when target remapped all fields in place and will not work with given block IO request anymore, so generic code can call generic_make_request(), process that request further and eventually put BIO.
  • DM_MAPIO_REQUEUE - supposed to be used when BIO should be requeued at end_io() time. It is used only in multiple path target.
There will be a map of the pages for local and remote nodes, according to appropriate redundancy self-healing algorithm, so local BIOs will be moved directly to the next layer via generic_make_request(), and BIOs/pages for remote nodes will be processed accordingly. In both cases DM_MAPIO_SUBMITTED path will be used (like now).
There might be a problem, when the same BIO will contain pages for local and remote node, in that case BIO vector will be changed to only contain pages for local node.

Another problem I see is to how to dispatch reading and writing requests withouth locking the channel, so that during single write requests there would be possible to read another ones.

Thinking...

/devel/dst :: Link / Comments (0)


UK visa.


I have been called from UK embassy about 30 minutes ago, although I was not asked, but instead accounts department had a conversation, which did not know about my plans and did not answered embassy questions besides that I do work there.
I want to think, that it will not be a show-stopper.

/devel/other :: Link / Comments (0)


fsblocks and buffer heads.


Nick Piggin from SuSE announced several days ago his rework of the buffer_head interface, which is used as a layer between block layer and filesystems. Its main goal is to obtain a memory region which directly represents content of the storage if read, or memory region, which will be written to the given position on storage. Buffer heads have number of disadvantages, mainly high overhead and possibility to deadlock writeout (i.e. to write a page to disk to free it it might require to have additional allocation). Interface is not that good too.
Fsblocks are supposed to fix that. Although it does have set of advantages over buffer_heads already, not everyone is happy with approach - namelt XFS guys, who want to be able to map arbitrary blocks to storage, namely extents, so better name was suggested - 'buffer_heads on steroids' only due to existing limits of both buffer_heads and fsblocks. So far, only minix filesystem was converted to fsblocks, so there are quite a lot of work yet to be done.

This discussion forced me to recall my fs design notes, one of the main issues I wanted was/is to avoid buffer_heads usage at all (well, it is only needed to map a page without the ugly needs to have a buffer_head for each block which can reside in the given page), i.e. each object on the storage is always aligned to a page size (PAGE_CACHE_SIZE actually, which is 4k), i.e. each inode will have about 100-200 bytes of control information and then will have the rest of the page filled with file's a or directory entries. That would greatly speedup small file lookup and processing as long as directory reading (including fairly trivial directory readahead absence of which is a serious limitation of ext234 filesystems, which leadsto major directory reading performance degradation when directory contains decent amount of files/subdirs).
Such approach can be described by something which gets good parts from both extents and delayed allocation.
So far it is a silent sleeping idea, maybe I will discuss it on KS.

/devel/fs :: Link / Comments (0)


Distributed storage blog tag.


One can track this tag for information about my distributed storage development.

So, to recall how to write in C (I did not do that for about a week or two), I'm writing initial implemnetation for multiple device stack (the same that supports software raid and dm-crypt). It will not be the latest revision obviously, will not support interesting configuration techniques I think about, will not have special failure-aware encoding (like Reed-Solomon or WEAVER), but will only be used to create a storage on top of two devices - local block device and remote one.
Initial goal is not even to achive sequential reading speed equal to sum of speeds of both devices - only to make it really distributed and recall what block layer is. Last time I worked with it about 4-5 years ago, when created asynchronous block device to test acrypto async crypto layer.
Getting into account that I ported dm-crypt to acrypto fairly quickly, that should be that complex task. Expect some news tomorrow.
Switched off...

/devel/dst :: Link / Comments (0)


Mon, 25 Jun 2007

Climbing evening.


That was great training - several new traces, bunch of old interesting ones, jumpings at the end - just bloody perfect.
Physical form seems to be in shape now, so I can start new complex traces, and to work hardly on non-physical tasks.
Excellent mood.

/life :: Link / Comments (0)


I've returned.


My sister had a graduation at school this weekend.
That was fun days there, although a bit nervous. Marina grew to quite effective woman, but she is still a child.
My congratulations and happy way into new life!

/life :: Link / Comments (0)


Fri, 22 Jun 2007

Good suit only becomes better with time, like cogniak.

/life :: Link / Comments (0)


Thu, 21 Jun 2007

UK visa and british embassy.


Today I moved to the office about 7-30 and then moved to british embassy to try to give them documents again. I was in the visa's center in 9-30, it is opened in 9-00, and get number 133 in line. Fortunately there were about 25 people in front of me with the case like mine, so I only spent about 1.5 hours and started the process. First they asked me to rewrite the form (I only tried to give documents), I nicely smiled and said something to operator girl, so I was allowed not to stay in the line again, quickly rewriting the form took about 7 minutes. I do not think that new letters are more readable, but that is a rule.
In 5 days I will know if I got visa or not, so I will order tickets and book hotel or will stay at home.

/life :: Link / Comments (0)


Wed, 20 Jun 2007

Climbing evening.


Excellent training - started with warming traverses, then continued with simple on-sight traces coupled and thirded together. Eventually I moved to very interesting, although already quite old traces - there were a lot of people, so there were no easy way to select a trace.
End of the trainig was a culmination - there is complex enough trace on the serious negative slope in the climbing zone, which had a set of holds, which are supposed to be used via jump between them. Dynamic jump is not that big, but on negative slope it becomes a problem, so I ended up fixing a rope above the place and jumped wildly screaming (likely I said to myself, that I need to fly to given hold no matter how, in some strange language known only to me and my alter-me). Eventually I completed that place, which now contain an additional hold, since no one wanted to jump there.
Excellent finish of the absolutely stupid day.

/life :: Link / Comments (0)


British embassy.


I spent more than a hour in a queue in the embassy and during that time only two people moved forward. I had a choice: either wait there until another 70 people will finish, or take a dinner. I think choice is obvious, so tomorrow I will move there again, this time early morning.

/life :: Link / Comments (0)


Tue, 19 Jun 2007

CRC performance depending on the word size.


To select perfect CRC coding technique which will be used for distributed storage I'm about to start designing, I've ran set of tests to determine how CRC speed depends on size of the word used in calculation. I use simple XOR CRC which just performs XOR operation over provided data block several times and then shows its performance (i.e. number of operations per timeframe).

Here is CRC performance for the same blocksize and number of iterations (i.e. number of CRC sums performed over the same block):

size: 4096, num: 100000, bits: 64, speed: 459.630646 kop/sec.
size: 4096, num: 100000, bits: 32, speed: 348.913483 kop/sec.
size: 4096, num: 100000, bits: 16, speed: 176.067184 kop/sec.
As expected, CRC speed was improved with increasing size of the word, unfortunately for galois field multiplication speed increase is not enough.
Let's discuss Reed-Solomon coding, which requires galois-field multiplication of checksummed (or raw) data words.
For 16 bit CRC sum galois filed multiplication can be performed using 2^16 table (2 lookups per multiplication or division operation) lookup, but for 32 bit CRC galois field multiplication in GF(2^32) can not be performed using table, since it will be too big. To perform galois field multiplication essentially one needs to solve a system of equations, number of equations is equal to size of the values being multiplied, i.e. 32 in our case. Even the fastest method of solving system of equations will be much much slower than 16 bit CRC plus its 2 lookups. Even using Cauchy matrixes for Reed-Solomon encoding with galois multiplication instead of Vandermonde one results in much slower code than table lookup.

So, for the maximum performance we either need to limit Reed-Solomon system to 16 bit CRC (galois field multiplication is used in RAID coding for example), or to use different coding method.
I'm studing WEAVER codes which are faster by design (and use only XOR operations without complex galois field multiplications), but they have own limits, so I'm trying to understand how distributed system with such erasure coding will suffer from using it.

/devel/math/codes :: Link / Comments (0)


Network batched xmit testing.


I've ran several simple tests with desktop e1000 adapter I managed to find.

Test machine is amd athlon64 3500+ with 1gb of ram. Another point is dektop core duo 3.4 ghz with 2 gb of ram and sky2 driver.

Simple test included test -> desktop and vice versa traffic with 128 and 4096 block size in netperf-2.4.3 setup.

Test machine runs 2.6.22-rc5-batch and mainline tree (there is a test with 2.6.22-rc4 and there is a noticeble performance win compared to that tree in the latest git, likely tcp congestion changes resulted in better utilisation). Batched xmit has better numbers.

Results:

2.6.20-ubuntu -> test machine

2.6.22-rc4-batch:
128: 725.43 724.14
4096: 698.63 712.77

2.6.22-rc5-mainline:
128: 760.91 762.04
4096: 784.32 788.53

2.6.22-rc5-batch:
128: 766.70
4096: 788.24


test machine -> 2.6.20-ubuntu

2.6.22-rc5-mainline:
128: 558.16 (Desired confidence was not achieved within the specified iterations.)
4096: 814.01

2.6.22-rc5-batch:
128: 569.14 (Desired confidence was not achieved within the specified iterations.)
4096: 822.72
Batch patchset's design is quite simple: instead of setting up DMA transfer for each new skb, it queues them and if there are several packets in the queue sets up DMA once for whole batched queue (not more than a threshold though).

Update: here is a result for pktgen UDP testing:
batch (batch_min: 0):   241319pps 115Mb/sec
batch (batch_min: 400): 182607pps 87Mb/sec
mainline:               497520pps 238Mb/sec

/devel/other :: Link / Comments (0)


Recent developments.


I was too lazy for the latest week and actually did not perform anything interesting. I can only complted testing and bugfixing of Jens Axboe's network splice (bug happend to be in generic splice code, but was triggered only by network splice patchset, likely it will be included in the next release), some work on SLAB page reference counting (that was even interesting - it was a ground work for initial network splice mechanism, but eventualy I found it is not enough and simpler way is to just reference an skb and do not free it until splice pipe is empty). Right now I'm setting up a testbed for Jamal's network batching (although I did not want to participate initially - first because I did not see a big margin in the idea, probably because I did not understand it fully, and second, work with Jamal was not that productive - after his request for libevent support of kevent he dissapeared and frequently all talks were never ended with code, but with empty discussions, which I do not like). Anyway, I decided to answer, since I was in Cc: list (hint: if I will be in kernel summit and in good mood, I will rise a request to always answer if person is in Cc list, original sender added people on purpose, and if someone has nothing to say, just say that 'it is not interesting for me', 'I have no time', 'I will look at it later' or 'it is in my backlog, I remember this' but never fucking dissapear like in black hole).
So, I found old e1000 adapter and will test Jamal's network batching code with simple netperf setup.

Another testing task is to run btrfs tests in my automatic filesystem testbed (well, semi-automatic, graph generation scripts require manual run so far, maybe I will extend it later). Will start, when Chris Mason released 0.3 version (home quite soon).

But frankly, I'm too bored with paid work testing (I need to write a bug reports for other's automatic testing system, which can not start shell scripts), with splice testing and batching testing.

I want my own project to continue, so expect something new soon.

/devel/other :: Link / Comments (0)


Mon, 18 Jun 2007

Climbing evening.


That was really good training - I climbed high on the walls, not only old already boring traverses and boulderings. There was quite big delay in my training (about a week or so), so it was a bit complex to start, but nevertheless I completed one new trace and couple more complex old ones. Eventually I decided to try one really complex trace (7a), but I was already too tired, so it was not completed - arms were so aching, that sometimes I could not even make a single movement.
Sine now I will climb on the walls and shin up frequently (the last time I climbed high on the walls regulary was almost half year ago!), I expect serious progress. But even if there will not be anything similar, it is good as is, since there are a lot new and interesting traces I will climb.

/life :: Link / Comments (0)


Sun, 17 Jun 2007

Night view.


One can see Ostankinskaya tower (a bit left from the center) and Vorobjovy Gory (right from the center - red lights and farther).

Night view

I wish I would have a tripod to make panoram and better view, for the long shutter time it is quite complex to not have camera moving even a bit.

/life :: Link / Comments (0)


Sailing yacht.


Instead of doing interesting things I spent last two days reading internet forums about sailing yachts. I do understand that it is stupid way to spent the time, but I can not resist.
Actually there are at least two problems with it for me: first I need to understand what type of yacht I like, is it something smaller for local water (there is number of various water reservoir near Moscow) or something bigger for the sea. And full absence of time of course - that is the main problem.
There are several other problems like full absence of money and experience, but it can be solved without major pain.

Actually that is the last time I speak about it - until something changed, there is absolutely no possibility for me to own a yacht, not because of money, but because of time.

/life :: Link / Comments (0)


Trumpet.


I still can not play it.

Trumpet

/life :: Link / Comments (0)


Sat, 16 Jun 2007

I have a dream.


Class 'micro':



Class '25 feet'



I can even buy the former, but likely it is not a good idea: I have no time to work with yacht and do not have experience. Actually the latter is a matter of time, but there is no time.
So, for the nearest future let it be a dream...

/life :: Link / Comments (0)


First unofficial BTRFS performance tests.


Florian D. ran a small FIO tests with following configs:

  • sequential read
  • random writes
  • sequential read again
Here are results with filesize:300MB, bs:4K:
   btrfs                  reiserfs               ext3
   usr% sys% bw   sec.    usr% sys% bw   sec.    usr% sys% bw   sec.
1  5    51   68.3 4.6     1    17   67.4 4.6     5    24   68.0 4.6
2  0    1    0.7  431     2    21   29.8 10.5    3    18   29.0 10.8
3  0    1    2.3  133     1    19   70.5 4.4     5    24   68.6 4.5
BtrFS performance is quite bad in that setup.

/devel/fs :: Link / Comments (0)


Fri, 15 Jun 2007

NILFS by AMAGAI Yoshiji at Nippon Telegraph and Telephone Corporation.


NILFS is log-structured filesystem by Japan telecom company, which allows to have snapshots and garbage collection over them.
Looks interesting, but it has fundamental flaw - it writes all metadata updates as a log in a special file, which means heavy fragmentation when oldest updates are removed. It also supports (at least it looks so) copy-on-write (and it looks like only for data), which allows to forget about slow recovery after crash and can improve performance.
NILFS supports checksums for segment data and related metadata. Superblock and management blocks maintain own checksums. Currently it is only crc32. There are no checksums for individual data blocks.
NILFS uses simple B-tree structure with 64-bit keys.
Like Btrfs it does not support sync, direct and mmapped processing and quotas.
It also does not allow to have writable snapshots.

First papers about his filesystem appeared two years ago.
There are no benchmarks on the official site.

It looks like log-structured filesystems become popular (getting into accout when I first time wrote about such filesystems in this blog there were zero projects in that direction).

/devel/fs :: Link / Comments (0)


BTRFS by Chris Mason at Oracle.


There are number of features currently not implemented, and which limits design:

  • Absence of delayed allocation.
  • This means that fragmentation will kill a filesystem, even if it is extent based (which is just a block size increase). Currently btrfs allocates new on-disk chunk for each 8k write.
  • Bad tree locking
  • Which does not allow multithreaded writing.
  • Some minor nits like absence of sync, direct and mapped processing, which is unlikely to be design problem though.
Actually, I plan to watch this project, but currently limit my filesystems interests to the block layer - I still want to implement distributed storage with raid-like (not exactly raid, since for higher order checksums (like 32 bit crc) it requires slow Galois-field multiplications, but something like that) functionality, which could be put under the filesystem and allow to put usual filesystems on top of highly-sized arrays distributed over the net.

/devel/fs :: Link / Comments (0)


Thu, 14 Jun 2007

I've visited Yandex.


That was really interesting - its office building, tables on the street under the roof, interior - it looks like a good place to work with.
My main objection against any job except my current one is that right now I frequently have anough free time after the paid work projects to do my own, but if I will move to some other place, there will be interesting projects of course, but practice shows, that eventually they are completed and boring work starts, and it will take all the time, while right now it is quite simple tasks, which I can complete quickly and work with my own projects.
That was until I heared (in rough translation): "You will do whatever projects you like, if they will be interested for us (Yandex), it will be a plus."

But there is a small problem with my current place - there are several prjects, which will suffer a bit without my attention (read: if I will go away (I have two weeks by russian laws), they will suck), and thus some people will have troubles (besides the fact, that my bosses will think I'm a moron, that dropped them in such situation).
So, I have some kind of moral contract.

/devel/other :: Link / Comments (0)


Tue, 12 Jun 2007

New filesystem by Oracle.


Chris Mason has announced new filesystem, which has following feature set:

  • Extent based file storage (2^64 max file size)
  • Space efficient packing of small files
  • Space efficient indexed directories
  • Dynamic inode allocation
  • Writable snapshots
  • Subvolumes (separate internal filesystem roots)
  • Object level mirroring and striping (not ready)
  • Checksums on data and metadata (multiple algorithms available)
  • Strong integration with device mapper for multiple device support (not ready)
  • Online filesystem check (not ready)
  • Very fast offline filesystem check
  • Efficient incremental backup and FS mirroring (not ready)
Quick design notes:
  • One large Btree per subvolume
  • Copy on write logging for all data and metadata
  • Reference count snapshots are the basis of the transaction system. A transaction is just a snapshot where the old root is immediately deleted on commit
  • Subvolumes can be snapshotted any number of times
  • Snapshots are read/write and can be snapshotted again
  • Directories are doubly indexed to improve readdir speeds
Impressive - that is a good thing.
Getting into account that it is exactly what I wanted to implement, I do not see any interest in continuing that project. And that is a bad thing.

/devel/fs :: Link / Comments (0)


I'm back from canoe trip.


That was bloody excellent!
A short story with photos about the trip.

Go!

/life :: Link / Comments (0)


Fri, 08 Jun 2007

I've moved to the canoe trip.


If I will not sink (and I will not, neither element can kill a seaman in my mind), I will be back this thuesday.

/life :: Link / Comments (0)


Linus and Andrew talk about current Linux kernel state.


Link.
For me the most interesting was filesystem part - Linux does need a new filesystem, which must be simple and fast. Neither can satisfy at least partially both parts - each one is complex and slow in some or all patterns.
I want to change that, developing my own filesystem, but dues to total lack of time, progress is minimal. Maybe it will even be a totally broken approach I decided to take, but I want to know that myself.
We will see...

/devel/other :: Link / Comments (0)


Ok, after some work on network splice, it somehow works.


Although received data is not valid (file contains several repeated chunks sometimes, and sometimes previous pieces of the original file, likely it is result of incorrect page-boundary crossing processing), kernel does not crash.
That forced me to change page releasing code in fs/splice.c a bit, since I think it is not correct, that page can be blindly freed there, and clone skb for each page splice requests, which is likely too big overhead, but on receiving fast clone is unused frequently, so maybe there is some gain there.

/devel/other :: Link / Comments (0)


Playing with splice and networking.


Splice is a in-kernel mechanism, which allows to perform zero-copy transfer of the pages between different users: it is possible to 'move' data between usersapce (vmsplice) and/or files (file descriptors). For example sendfile can be implemented via splice call (and it is for some file types). Receiving splicing, from another side, is not supported.
There were several attempts to implement receiving zero-copy, I recall at least three: my work, patch by Alexey Kuznetsov and work by intel folks (the latter is very similar to what Alexey proposed, but was more generic, since it was first splice work, while Alexey's and mine works were purely receiving zero-copy (Alexey implemented single-copy approach for unaligned data, while I changed driver to always properly align data)).
Couple of days ago Jens Axboe from Oracle posted his variant, which used SLAB pages (that pages are allocated using kmalloc() function and contain network data if driver does not use pages as fragments), but was quite broken, since SLAB pages do not have reference counting (the only page which has non-zero reference counter is first page in the combined set - SLAB uses 0 and higher-order pages to store objects), and it never change reference counting when storing data in that pages. So, it is impossible to just increase a refenrece counter for any SLAB page, since that will end up badly when page will be reclaimed in SLAB. I tried to fix that issues and eventually completed reference counting for SLAB pages, which was based heavily on Jens' work, but here comes another problem.
While SLAB page is not being freed, it can be reused, and thus the same address inside the page can store different data at a different time. So, if skb, which holds network packet, will be freed, but splice will not finish with given page, it is possible that freed pointer will be returned after subsequent allocation, and data will be overwritten by the next packet. When splice will finish its work (for example dump page to the disk), incorrect data will be there.
The right way is to stop skb freeing if page, its data referes to, is being used by splice. Seems simple, but it is not - the same page can contain quite a lot of packets, so page must hold a reference for every skb, which data is placed into given page, but that task is not that simple - there are no unused members in page structure.
While I write this post, Jens posted a patch, which implements exactly the same idea, but with introduction of privite field in the splice private structure.
Let's check this out.

/devel/other :: Link / Comments (0)


I was invited to work in Yandex - small russian Google.


I declined (as with google), but they insisted to meet (not as with google :), so I will go to see how they work. Actually, if I would ever work in Google or Yandex, I would definitely like to create a automatic tracking system over theirs maps, which would allow to put marks on the map and select the shortest way between the points getting into account information about traffic jams and so on. There are such systems all over the world, but they are heavily limited to the specially crafted vectorized maps, while I would start with plain pixmaps.
but working in such company (no matter if it is Yandex, Google, SWSoft or anything else) requires to devote much of the time to them, while I prefer my own projects (without any gain though), so that will eventually ends up with cancellation of my own ideas. So, no, at least right now.

/devel/other :: Link / Comments (0)


OpenBSD hackathon.


Hackroom teardown and second climbing day.

/devel/other :: Link / Comments (0)


Thu, 07 Jun 2007

Putin killed princess Diana.


Failed to resist to give a link to the the absolutely true article in "The Indendependent" UK newspaper - the only straight and completely unbiased newspaper in the world.
Citation:

In the last months of her life, Diana was severely opposed to Putin's efforts to become the new dictator of the Russian Federation

I think world goes stupid faster than a snawball, and I'm sitting and watching.

/other :: Link / Comments (0)


Wed, 06 Jun 2007

Climbing evening.


This one was quite interesting, although I did not climb high on the walls, but spent most of the time at the bottom doing various boulderings, traverses and starts. There were couple of new interesting boulderings, but mostly it was the same as usual training.
I've found a training stream, which allows to quickly warm up, but it is too easy to cool down, and then feet starts aching when trying to push them in the climbing shoes (minus 3 sizes), so training must be quite aggressive with smaller rest times. But,frankly, it is unfair to climb at the bottom - you think that you do somthing serious and intresting, but as soon as arms start aching, you drop down and wait, while on the wall there is no way to move down, since after it you must start from the very beginning, which is not interesting, so you get the latest bits of power out of the body until either trace is completed, or you can not lift your arm at all.

/life :: Link / Comments (0)


Locking in SuperH.


Shame on me, I did not mention tas.b instruction, which does exactly what is needed for spinlock implementation - it reads a byte, specified in register, then sets a bit in the returned value, and writes it back. If value read is zero, special bit is set. All operations are performed with locked bus, so it is atomic.
I read description for that instruction many times, but failed to translate or understand. But, whatever it was, spilocks are pefectly valid on SH-4 family, so it would be possible to create SMP system, since I even found multiprocessor serial communication in 7750R, which allows to transfer data between multiple CPUs, where each receiver is addressed via unique ID.

/devel/sh :: Link / Comments (0)


Tue, 05 Jun 2007

I will move to 2007 Linux Kernel Summit.


At least I've started to collect needed documents to obtain UK visa.
Summit will be hosted in Cambridge, UK, September 4-6.

/devel/other :: Link / Comments (0)


SuperH 7751 (sh-4) limitations.


Main one is full absence of locking instructions (or instructions, which automatically locks memory region access like ppc lwarx/stwcx or lock prefix for x86), which means that any atomic operation must contain irq disable/enable code. Furthermore, it is impossible to create a spinlock, which can serialize parallel execution on several CPUs.
It is possible to workaround the problem using locked PIO access, but that is very slow.
Likely SuperH SMP will only support newer CPU family. SuperH SMP support is scheduled for 2.6.23 timeframe.

/devel/sh :: Link / Comments (0)


Problems with HIFN driver development for cryptoapi.


Actually it is more serious problem.
Existing crypotapi is very software-centric, i.e. it does not have any hardware context nad thus should not initialize it per request. In hardware (at least in HIFN) each crypto request must be programmed into device's ROM and registers, so when user creates a new crypto request (read: new crypto user requests operations creating crypto TFM structure), it is only assigned with software context, which contains besides other fields, crypto functions (encrypt/decrypt/setkey), but does not contain pointer to crypto device (since software does not need it, it can store and initialize itself in setkey time).

Another crypto devices (padlock and geode) do not have such problems, since padlock just uses the same software approach, but with different asm instructions, and geode has only one device, which has global pointer accessed from crypto processing functions.

/devel/acrypto/hifn :: Link / Comments (0)


New CARP - Common Address Redundancy Protocol - release.


There are no changes after previous release, since API was not changed. It is a sign, that I do care about this project, although if I would started it from scratch, I would changed some bits, although it is possible to do it right now, but I do not have feature requests for that.

Anyway, here is kernel CARP homepage.

/devel/other :: Link / Comments (0)


Development.


A bit noisy though. Again no photoshop or gimp. Small HDR processing.

Development

/life :: Link / Comments (0)


Sunrise.


View from my window.

Sunrise

Early morning shots - no photoshop or gimp, pure math :)

/life :: Link / Comments (0)


Hard time, summer in the city.


I have a photo-camera, a sleeping-bag and liter of black vodka.
And I'm going to canoe trip next weekend (previous trip story).
Wish myself seven feet under the keel.

/life :: Link / Comments (0)


A night sky.

Night sky

/life :: Link / Comments (0)


Mon, 04 Jun 2007

Climbing evening.


That was a really good training - I even climbed high - two sets, each one of two traces without the rest in between, the former was two on-sight traces (which I did not complete fully - two fails on the second trace, although it is not that complex, but I am not in the perfect shape yet) of low complexity (interesting traces, but not that much to repeat them again), and two old ones - first one had the same complexity as first on-sight (6a+) and second one a bit more complex (6c), but it was old, so it was not too hard, but nevertheless I failed.
Give me two weeks of regular trainings, and I will complete all traces on the vertical walls, which are less then 7a complexity (and actually there are couple of interesting 7a I tried, and partially completed).
I want to climb!
The rest of the training was devoted to old traverses and starts - there are several new traces (starts) I did not yet completed.

/life :: Link / Comments (0)


Second release of the HIFN crypto driver for 2.6 cryptoapi.


Added a lot of cryptoapi bindings.
Driver now supports AES, 3DES and DES ciphers with all four crypto modes (ECB, CBC, CFB and OFB). My cat has pissed on my domestic slippers when he saw that. And believe me, he does know, how to distinguish between good and other code.

Well, I do not have a cat, but if I would, he would definitely piss on my slippers.

Anyway, patch can be found in archive.

/devel/acrypto/hifn :: Link / Comments (0)


I've became a Nikon D40 owner.


Nikon D40

Stay tuned.

/life :: Link / Comments (0)


Sun, 03 Jun 2007

Google.


Interesting, why do Google HR people ask me about persons I would recommend in Moscow, and then never send them a mail?

/other :: Link / Comments (0)


Boot loader source code.


Interested reader can find bootloader code, which resides in MBR and reads kernel from offset of 1 segment using standard SuperH IPL (aka BIOS), in archive.
Size of the image it reads should be less than 1179648 bytes, otherwise you will need to increase that parameter (see comments). One can also change offset from where to read a kernel image. There is also compilation and CF writing script, which is easily understandible.
To write kernel image (with default parameters in bootlaoder) you just need to run following command:

dd if=arch/sh/boot/zImage of=/dev/sdb obs=512 seek=1
where /dev/sdb should be replaced with device node for the compact flash attached to your host system.

Remember, that it does not support anything besides booting (i.e. no partition tables, no command line).

/devel/sh :: Link / Comments (0)


I've just booted 2.6.22-rc2 Linux kernel on SuperH 7751R CPU (LANDISK board).


Dmesg:

SH IPL+g version 0.9, Copyright (C) 2000 Free Software Foundation, Inc.

This software comes with ABSOLUTELY NO WARRANTY; for details type `w'.
This is free software, and you are welcome to redistribute it under
certain conditions; type `l' for details.

2002/09/09 Making.  2004/09/08 I-O DATA NSU Update.
266:133:33 on base clock 22.22MHz and SDRAM 4 burst. CF boot.

PCIC initialization done.
MASTER:48bit LBA mode non support
Disk drive detected: LEXAR ATA FLASH V1.00 11014102039199095066 
LBA: 001EBF10
DiskSize: 1031675904Byte
PIO MODE1
Set Transfer Mode result: 50 
> b
Set Transfer Mode result: 50 
Initialize Device Parameters result: 50 
IDLE result: 50 
Starting from MBR
Error
00000000
Jumping to second stage
checksum: 0x4f222fe6 
input_len: 0x0010baf4 
input_data: 0x8c80300c 
Uncompressing Linux... 

insize: 0x00000000 
inptr: 0x00000000 

insize: 0x0010baf4 
inptr: 0x00000001 Ok, booting the kernel.
cၲͥrrjj:ͪͪj"Bjɕ
B:ͥrrJсRչҚҒ
ISHV$&'H.]X_pfn = 0xc000, low = 0xe000
Zone PFN ranges:
  Normal      49152 ->    57344
early_node_map[1] active PFN ranges
    0:    49152 ->    57344
Node 0: mem_map starts at 8c1eb000
I-O DATA DEVICE, INC. "LANDISK Series" support.
Built 1 zonelists.  Total pages: 8128
Kernel command line: console=ttySC1,9600 console=tty0 mem=32M
Setting GDB trap vector to 0x80000100
PID hash table entries: 128 (order: 7, 512 bytes)
Using tmu for system timer
Using 8.333 MHz high precision timer.
Console: colour dummy device 80x25
Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
Memory: 30520k/30520k available (1345k kernel code, 2248k reserved, 438k data, 72k init)
PVR=04050005 CVR=20480000 PRR=00000113
I-cache : n_ways=2 n_sets=256 way_incr=8192
I-cache : entry_mask=0x00001fe0 alias_mask=0x00001000 n_aliases=2
D-cache : n_ways=2 n_sets=512 way_incr=16384
D-cache : entry_mask=0x00003fe0 alias_mask=0x00003000 n_aliases=4
Mount-cache hash table entries: 512
CPU: SH7751R
NET: Registered protocol family 16
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
Autoconfig PCI channel 0x8c1bacc0
Scanning bus 00, I/O 0xfe240000:0xfe280000, Mem 0xfd000000:0xfe000000
00:00.0 Class 0200: 10ec:8139 (rev 20)
        I/O at 0xfe240000 [size=0x100]
        Mem at 0xfd000000 [size=0x100]
00:02.0 Class 0c03: 1033:0035 (rev 43)
        Mem at 0xfd001000 [size=0x1000]
00:02.1 Class 0c03: 1033:0035 (rev 43)
        Mem at 0xfd002000 [size=0x1000]
00:02.2 Class 0c03: 1033:00e0 (rev 04)
        Mem at 0xfd003000 [size=0x100]
PCI: Using configuration type 1
Time: SuperH clocksource has been installed.
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 0, 4096 bytes)
TCP established hash table entries: 1024 (order: 1, 8192 bytes)
TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
TCP: Hash tables configured (established 1024 bind 1024)
TCP reno registered
gio: driver initialized
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
SuperH SCI(F) driver initialized
sh-sci: ttySC0 at MMIO 0xffe00000 (irq = 25) is a sci
sh-sci: ttySC1 at MMIO 0xffe80000 (irq = 43) is a scif
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
loop: module loaded
8139too Fast Ethernet driver 0.9.28
8139too 0000:00:00.0: This (id 10ec:8139 rev 20) is an enhanced 8139C+ chip
8139too 0000:00:00.0: Use the "8139cp" driver for improved performance and stability.
eth0: RealTek RTL8139 at 0xfd000000, 00:a0:b0:6c:d0:eb, IRQ 5
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ehci_hcd 0000:00:02.2: EHCI Host Controller
ehci_hcd 0000:00:02.2: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:02.2: irq 5, io mem 0xfd003000
ehci_hcd 0000:00:02.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 5 ports detected
ohci_hcd 0000:00:02.0: OHCI Host Controller
ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:02.0: irq 7, io mem 0xfd001000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ohci_hcd 0000:00:02.1: OHCI Host Controller
ohci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:02.1: irq 8, io mem 0xfd002000
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
usbcore: registered new interface driver libusual
heartbeat: version 0.1.0 loaded
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
VFS: Cannot open root device "" or unknown-block(0,0)
Please append a correct "root=" boot option; here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
I'm now accepting congratulations.

Although Linux is well supported on that board, and actually it was preinstalled there (until OpenBSD developer's hands murdered it on/from compact flash), this is a serious step for me.
It is second non-x86 CPU I managed to start Linux on. First one was ppc32 (AMCC 405 GPr), I ported Linux to our company's own board about two years ago (although initial loader was not written by me too).
So, I only wrote an initial loader in SuperH asm, which resides in CF's MBR, it reads linux image from predefined offset from CF into RAM and jumps there. It does not initialize caches, MMU and other parameters, since it was done by IPL (Initial Program Loader), which resides on NAND flash, and actually board boots from that flash, and only later that code jumps into CF's MBR.
Loader does not even initialize command line, but can perform a checksum of the image, which can be tested against checksum written to the compact flash.

I will play a bit with this board until Grange returns from OpenBSD hackathon. I will try to setup a partition table and filesystem, so that userspace could be started there.
Actually when he will return, I will ask for Xscale board theirs company develop. They run Linux there without serious problems (it is based on ixp425(?)), but Linux in general sucks on that CPU, since it does not have support for scheduling domains (upto 16), which allows to greatly improve rescheduling latency (or they are not turned on, or whatever, I was quite surprised about that news actually). OpenBSD has such support, and there is a patch for Linux 2.6, but it is quite broken (for example shared memory does not work).

/devel/sh :: Link / Comments (0)


Ancient times.


All bottles of alcohol in my loft looks really ancient - you know, all those dusty, covered with cobweb bottles with some troubled liquid. That is what I have (had until yesterday). Looks cool - that is a positive moment of living in the developed loft.

/life :: Link / Comments (0)


Linus Torvalds talk about GIT.




Actually the only reason I started to watch it is to try to understand, if I can understand Linus' and Andrew's speech (btw, does Linus look a bit nervos there?). I can, although not always.
I talked with someone in english only one time in my life (not counting english lessons in school and university).
And I'm going to move to Linux Kernel Summit (not 100% sure, need to complete technical details) to talk with core kernel hackers.. Ugh... You might get quite a few troubles talking with me, although for me, the only problems is very small undecent word dictionary.
You have been warned.

/devel/other :: Link / Comments (0)


Sat, 02 Jun 2007

I've declined to work in Google again.


And now thinking if I'm stupid or not?
And also thinking about how to bribe my budget to get some money to buy a camera and to travel to Cambridge this September to Kernel Summit.

I want to get DSLR camera, although I'm far from being a good photographer, but I like a process of getting shots using DSLR/SLR cameras much more, than with automatic digital cameras. Likely I would start with something simple like Nikon D40 with lens from the kit, later I would add a long-focus lens (like to photo remote objects, mostly industrial buildings).

/devel/other :: Link / Comments (0)


I do not know why, but I do like SuperH asm.


Although I know it only about a week, I like it.

Btw, code below can be optimized a bit, but it was not a main goal.

/devel/sh :: Link / Comments (0)


Problem with broken linux image has been found.


As I posted recently I managed to start kernel booting, but image can not be decrompressed and thus does not bood beyond initial code. I expected problem to be in the code which reads data from compact flash into memory - it uses IPL (BIOS) code to copy data, which is undocumented and likely broken.
Today I started to play with data in RAM and decided to perform a simple checksum of the image I copy and compare it with on-flash image checksum. I got pretty simple checksum - 4-byte xor over whole image using following SuperH asm code:

checksum:
	mov	#18, r1		! number of iterations
	shll16	r1

	mov	r13, r2		! initial address
	mov	#4, r6		! step
	xor	r5, r5		! checksum
l:
	mov.l	@r2+, r4
	xor	r4, r5

	sub	r6, r1

	tst	r1, r1
	bf	l
Then checksum (which resides in r5) was printed into serial console via IPL (BIOS) call:
print_32:
	mov	#8, r6		! max
	mov	#1, r2		! step
	mov	r5, r3		! checksum
	mov	#-4, r10	! bits to shift
pl:
	mov	r3, r0

	and	#0xf, r0
	mov	r0, r11
	
	mova	hex, r0
	mov	r0, r4
	add	r11, r4
	mov	#0, r0
	mov	#1, r5
	trapa	#0x3f

	shad	r10, r3
	sub	r2, r6
	
	tst	r6, r6
	bf	pl
	...
hex:
	.string "0123456789abcdef"
Although it prints reversed hex number, it was enough to find, that after 1^17 bytes read checksum stopped to be changed (because of xor property it means that data was not changed after that limit), and 2^17 was the boundary where checksum was equal to the one, calculated over compact flash image on the host system.
So, my idea about wrong code, which reads data from compact flash into the RAM, is correct, now I need to think (or decipher sh-lilo second stage loader's code) about how to correctly read about one megabyte of data from flash. Main problem is of course LBA/CHS addressing and some unknown feature called 'device number' in comment to the reading code in sh-lilo.
Back to drawing board...

/devel/sh :: Link / Comments (0)


Fri, 01 Jun 2007

Climbing evening.


As of friday, it was gneral physical endurance training - a lot of exercises with small weights or without them. This time I started late and only had about one hour for them, so I only completed 5 rounds of 7 types, each one of 10 exercises in the set. So instead of 10 rounds of 10 exercises for each type I only completed half, but also added several traverses, most ly warming, but one was a bit complex, although way too old.
During exrcises something was displaced in the right shoulder again, but since it is not first time, I got used to the aching parts.

Ancient latins and greeks finished theirs summits with fights, which frequently was the last argument in the disput, so if I will move to the Linux Kernel Summit (and likely I will), I need to add punching bag exrcises into my set - Linus and most of the developers are quite big, although quite slow I think :)

/life :: Link / Comments (0)


Hardcore Linux Kernel debugging.


In this image (500kb) I see a linux kernel, a broken loader and a blonde in a red dress.
So far it is the only way to determine what I put into the RAM and where inflating code fails.

/devel/sh :: Link / Comments (0)


OpenBSD hackathon moves to the countryside.


Or climbing in Canada.

/devel/other :: Link / Comments (0)