Zbr's days.
May
Sun Mon Tue Wed Thu Fri Sat
       
2008
Months
May
Sep
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Sat, 31 May 2008

Ole-ole-ole-ole, kvanti chempion!

Match of the century - 24 hours of footbal in my Alma Mater.

Today Department of Quantum and Physic Electronic (which I finished do not even remember when, but I started studying in MIPT 10 years ago) play with axes, or theirs another name: Department of General and Applied Physics.
After about half of the match we won with +18 goals (31:13).

This happens once per year and usually I tried to move to MIPT and watch part of the game like this year. Tomorrow will move there too of course to met with old friends and celebrate the win!

/life :: Link / Comments (0)


Fri, 30 May 2008

Local filesystem randomg read/write performance. POHMELFS parallel testing.

I promised to publish POHMELFS parallel processing results yesterday, even if they are miserable. Unfortunately there are no interesting results at all. In the released version POHMELFS is 32bit only, since it does not have special ->open() callback which forces to open files with O_LARGEFILE flag to support more than 4Gb (actually only 2Gb, since kernel uses signed size_t, which is only 31 bit large) sizes and superblock maximum size is set to 32 bits, so all 32 bit results are not very interesting, since having 2Gb/s random read speed is really stupid sentence, since all reading happend from the cache.

While results with more than 2Gb are... Let me first show you how XFS and Ext3 behave in case of random writes.

A short preface.
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs) 3Ghz 32 bit Xeons with 8gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with SEAGATE ST3300007LC 10k rpm 300 Gb testing disk.
Its linear reading speed is about 90 MB/s. Dmesg:

scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
        <Adaptec AIC7902 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs
scsi 0:0:2:0: Direct-Access     SEAGATE  ST3300007LC      0003 PQ: 0 ANSI: 3
 target0:0:2: asynchronous
scsi0:A:2:0: Tagged Queuing enabled.  Depth 32
 target0:0:2: Beginning Domain Validation
 target0:0:2: wide asynchronous
 target0:0:2: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM RTI WRFLOW PCOMP (6.25 ns, offset 63)
 target0:0:2: Ending Domain Validation
Kernel version is 2.6.25 (and 2.6.24 for the first ext3 test).

I used two such machines as servers for iozone read/reread, write/rewrite and random read/write testing. File size is limited to 8Gb only, since it is the only interesting fair case, record size varies from 8Kb to 1Mb.

Before I started 8Gb POHMELFS testing, I decided to check how local filesystem behave in such scenario.
XFS was tuned this way: (mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1; mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/)
Ext3 was created and mounted with default options on machine with only 4Gb of RAM though.

So, testing.
Here is a results table from iozone (before I interrupted it) with read/reread, write/rewrite and random read/write tests for XFS (either default, or tuned like on link above).
                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   73671   64052    77565    80107   35281    5085
8388608      16   74437   66095    77611    80065   66854    8808
8388608      32   74683   66780    77564    80202  121442   14576
8388608      64   74936   66908    77537    80372  215377   22583
8388608     128   74928   68598    77542    80247  339304   32280
8388608     256   73609   69615    77534    80143  365081   40571
8388608     512   73763   69830    77547    80317  420704   48501
8388608    1024   73940   69474    77602    80065  406266   47295
I.e. 5 MB/s random write speed for 8kb record!

Do you really want to know ext3 speed? Pregnant kids and women should skip next paragraph.
I interrupted test after almost 2 (!) hours or random writing of 8Gb file with 8Kb records on default ext3. Test was not completed and I do not really know its performance (note, that this machine has only 4Gb of ram, other hardware details were described above), but it will be less than 1 MB/s.
Ext4 behaves much better in this aspect (ount options: rw,noatime,data=writeback,extents):
                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   69593   74200    77324    81340   35538    5088
8388608      16   66745   70038    73676    77271   65715    8704
8388608      32   68253   70320    73652    77258  121690   14469
8388608      64   68421   71291    73653    77042  209629   22005
8388608     128   68438   71340    73658    76988  332021   30381
8388608     256   68921   71254    73651    76912  435586   40683
8388608     512   69079   71728    73551    76815  549136   49298
8388608    1024   66611   71217    73683    76581  552459   49220
POHMELFS results are coming...

/devel/fs :: Link / Comments (0)


Wed, 28 May 2008

POHMELFS got read balancing between multiple server and simultaneous write to them.

I hate laziness, but sometimes drop into that hole... So last couple of days I just stupidly wasted by time (well, I read Lisp and failed to find GTK binding for CLISP, made some code and kernel bug fix, but that does not count). Today lazyness started to be really boring, so I made some small progress in POHMELFS parallel processing.

It got ability to send transactions to multiple servers by default and balance reading between them (so far it does it always from the first server, in case of error it switches to second, but it is trivial to change). This was implemented via special routes for each transaction, which are stored per network state, so if one of the servers did not answer, we would not resend data to others. It also makes trees smaller, which should allow faster reading in case of lots pending writing transactions.
Code is in testing stage currently, I will complete read balancing tomorrow and test it against multiple servers on different machines, when data is placed on disk, so that random access would be slow. Having two servers I exect to get linear speed increase. If test will be disk IO bound, it is possible to add multiple servers on the same machine, so that each server would run on its own disk (I have two resonable fast SCSI disks on each testing machine).
Results will be published here of course (well, even if they are miserable :).

/devel/fs :: Link / Comments (0)


Sun, 25 May 2008

Every lisper did that.

#!/usr/bin/clisp
(defun f (m)
  (do ((k 0 (1+ k))
       (c 0 n)
       (n 1 (+ c n)))
    ((eql k m)
     (format t "~r" c))))
(f 317)

Guess the result:seven hundred and ninety-three vigintillion, five hundred and ninety-one novemdecillion, four hundred and seven octodecillion, eight hundred and four septendecillion, one hundred and fifty-one sexdecillion, nine hundred and twenty-six quindecillion, five hundred and ninety-three quattuordecillion, seven hundred and ninety-three tredecillion, forty-two duodecillion, one hundred and twenty-six undecillion, eight hundred and ninety-one decillion, one hundred and twenty-eight nonillion, eight hundred and nineteen octillion, six hundred and ten septillion, seven hundred and ten sextillion, one hundred and forty quintillion, one hundred and forty-five quadrillion, thirty-seven trillion, nine hundred and fifty-eight billion, two hundred and seventy-three million, seven hundred and seventy-seven thousand, three hundred and ninety-seven

/devel/other :: Link / Comments (4)


New POHMELFS release. Full transaction support. Data and metadata cache coherency.

Irish Tullamore Dew helped this POHMELFS release to see the light.

Short changelog:

  • Full transaction support for all operations (object creation/removal, data reading and writing). Data reading transactions are not optimal yet and will be improved in the next release (although fast).
  • Data and metadata cache coherency support. More details on how this is implemented one can find in appropriate section.
  • Transaction timeout based resending. If given transaction did not receive reply after specified timeout, transaction will be resent (possibly to different server).
  • Switched writepage path to ->sendpage() which improved performance and robustness of the writing.
  • Preliminary support for parallel data processing. Code to write data to multiple servers in parallel and balance reading between them was imported, but is not used right now.
  • Fair number of bugfixes.
Next release is scheduled for the beginning of the next month, and will likely include following features:
  • Improved reading transactions.
  • Server redundancy extensions (ability to store data in multiple locations according to regexp rules, like '*.txt' in /root1 and '*.jpg' in /root1 and /root2.
  • Client parallel extensions: ability to write to multiple servers and balance reading between them. Code was imported to the current version, but not enabled yet.
  • Client dynamical server reconfiguration: ability to add/remove servers from working set by server command and from userspace.
  • Start generic server distribution development.
As usual one can grab the latest source from archive or GIT tree.

/devel/fs :: Link / Comments (0)


Sat, 24 May 2008

This was supposed to be POHMELFS release day.

But no, it is scheduled for tomorrow because of the very interesting way I decided to implement reading transactions. The way it works right now is quite miserable, so I want to clean things up and make a really good patch.

Page reading code will create single transaction for the bunch of pages and will schedule next one if pages are not yet received instead of waiting for transaction to be completed, and only wait at the very end (if needed). With addition of async copy from receiving kernel thread into reading userspace via copy_to_user() (in todo), this will became the fastest possible way of doing reading over the net I think.

So far changelog contains following items:

  • Full transaction support for all operations (object creation/removal, data reading and writing). Data reading transactions are not optimal yet and will be improved in the next release.
  • Data and metadata cache coherency support. More details on how this is implemented one can find in devel section.
  • Transaction timeout based resending. If given transaction did not receive reply after specified timeout, transaction will be resent (possibly to different server).
  • Switched writepage path to ->sendpage() which improved performance and robustness of the writing.
  • Fair number of bugfixes.

/devel/fs :: Link / Comments (0)


Wed, 21 May 2008

iput() locking in POHMELFS.

iput() is a very tricky call in Linux VFS, besides the fact that it drops inode when its reference counter reached zero, it also waits until all associated pages are flushed to storage too.
POHMELFS uses singler per network state (network connection structure) thread, which only reads async replies from the server, so it is possible, that reply which requres iput() (for example create command reply) will happend in parallel with object removal, so inode will be deleted, but yet not freed. When reply is received and iput() called, it will try to free inode and wait until all associated to its mapping pages are synced. But page sync happens on reply to another command (consider for example several writeback transactions), which can not be processed, since thread is waiting them to be completed. This problem can not be fixed by introducing multiple threads, since each one can be exactly in the same situation simultaneously.

In turn we should not allow to grab inode and free it in the receiving path. This is ok for writeback transactions, since inode can not be freed until pages are synced, so just by holding pages we are able not to lock, but object creation for empty files or directories does not have pages attached, so they have to be synced with special transaction. There still can be a problem with empty file though - some pages can be attached and it can be removed while system waits for creation transaction complete, but actually we do not need to know about that - we shuold not grab inode it all, since transaction already contains all needed into, namely inode number, so we can lookup inode (if it still exist) and mark it as created without need for lock-prone grab/put.

This bit took me last three days, during which POHMELFS moved to non-blocking receiving and timeout-based sending (and returned back), it got scanning 'watchdog' which resends trasactions if they were not acked after some time and eventually dropes them if they still does not get a reply, POHMELFS got couple of new operations supported and likely something else to existing set of features implemented to date (full transaction support for all operations and data and metadata coherency protool were added for the next release).
New release is scheduled for the end of the week, and there is no readpage transaction support yet...
So, stay tuned!

/devel/fs :: Link / Comments (3)


Things getting worse...

$ clisp 
  i i i i i i i       ooooo    o        ooooooo   ooooo   ooooo
  I I I I I I I      8     8   8           8     8     o  8    8
  I  \ `+' /  I      8         8           8     8        8    8
   \  `-+-'  /       8         8           8      ooooo   8oooo
    `-__|__-'        8         8           8           8  8
        |            8     o   8           8     o     8  8
  ------+------       ooooo    8oooooo  ooo8ooo   ooooo   8

Welcome to GNU CLISP 2.42 (2007-10-16) 

Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2007

Type :h and hit Enter for context help.

[1]> (defun test-func () (format t "It's a test func"))
TEST-FUNC
[2]> (test-func) 
It's a test func
NIL
[3] (exit)
Bye.
This one has, imho, the less ugly command line... And I'm against SLIME and Emacs. Also tried SBCL, GNU CL and something else, but likely CLIPS will stay.

Instead of sleeping (it will be time to wake up soon in Moscow slums) or at least catching POHMELFS bugs (last several days were solely devoted to this task and fair number of them were fixed as long as some interesting features introduced (probably new), so likely new release will see the light later this week), I'm drinking some beer and making first steps into this. So far looks quite new and probably interesting, but every entrance article about it I read told, that if you are after 25 years old, it is likely impossible to change something in your perception. I'm after, but think that it will be fun and probably will become a really good tool for me.

The more I think about it, the more interesting tasks (as long as those I'm already thinking about like CAPTCHA) I find...

/devel/other :: Link / Comments (5)


Mon, 19 May 2008

Russia Canada5:4

Yesterday Russia became a hockey world champion, first time for the last 15 years!

Final goal!

/other :: Link / Comments (2)


Sat, 17 May 2008

POHMELFS got full data and metadata cache coherency support. Transaction support for majority of the commands.

linux-2.6.pohmelfs$ git-diff-tree -r --stat 21549d0a101 master
 fs/pohmelfs/dir.c   |  108 ++++++--------------
 fs/pohmelfs/inode.c |  279 ++++++++++++++++++++++++++++++++++++--------------
 fs/pohmelfs/net.c   |  216 ++++++++++++++++++++++++++++++---------
 fs/pohmelfs/netfs.h |   43 +++++++-
 fs/pohmelfs/trans.c |   55 +++++++++-
 5 files changed, 484 insertions(+), 217 deletions(-)
It was rather simple task due to async event processing support.
Each time client creates, reads or writes object to server, information about its interest is stored on server. When any other client updates the same object (like changing attributes or writes data), all interested clients get notifications with new data (new attributes, or in case of writing possibly new size and flag, which page has to be fetched from the server, since it is not valid anymore). Writing happens during writeback as before, so commands like "echo Some_message > /mnt/file" immediately syncs size of the file to zero and after some time writes there actual data, when system will decide to start writeback.

Also ported all but one commands to transaction mechanism, which means they all will be resent if currently active network connection goes down. Although most of the commands are not synchronous, and thus will not be resent after timeout, this can be trivially changed if there will be major demand on that.

Only reading has not yet been ported to transaction model, which is a next task to complete. This transactions have to be synchronous, since we do want to read data, while do not actually care about full directory content.

This changes have to be seriously tested and all problematic places to be resolved, for example they slow metadata operations noticebly, since now system sends a message each time new object is created, although kernel archive untarring now takes about 5 seconds against previous 2-3 including sync on 4-way machine with 8gb of RAM and it is still not comparable to 30+ seconds for async NFS, it has to be investigated further.

After full move to transaction model and cache coherency testing (that model may be not complete for some usage, since locks are not yet supported), POHMELFS will make its first steps into distributed area...

Stay tuned!

/devel/fs :: Link / Comments (0)


Fri, 16 May 2008

Metadata cache coherency support in POHMELFS.

Client:

$ ls -lai /mnt/test
3 -rw-r--r--  1 root root 94208 2008-05-16 22:27 test
$ sudo chown zbr.zbr /mnt/test 
$ ls -lain /mnt/test
3 -rw-r--r-- 1 2319 1002 94208 2008-05-16 22:27 /mnt/test
Server:
fserver_get_client_data: thread: 3085847440, cmd: 8, id: 0, start: 2, size: 94, ext: 0.
fserver_transaction: thread: 3085847440, trans: 0, size: 94, sub: cmd: 10, id: 3, start: 0, size: 70, ext: 6.
fserver_inode_info: path: '/test', size: 94208, mode: 100644, uid: 2319, gid: 1002.
So, server now contains all metadata information about updated object on client, pohmelfs_setattr() is synchronous for remotely read inodes and for already synced indoes, created originally locally. It does nothing, if object is not yet synced to server, since syncing will provide that info itself.

The only missing thing is to asynchronously broadcast that data to other clients, which requires to create a cache of objects to be interesting for given client, each client will be automatically added into group of interests when it lookups object, so when attribute for given object is being set, update will be sent to interested parties. Client will be dropped from group of interests, when it drops appropriate inode locally (which will force sending a special message).

/devel/fs :: Link / Comments (0)


Thu, 15 May 2008

Meanwhile at appartment development side.

I installed vater system for the shower and thought to install the whole cabin, but found (as usual) that I do not have drills for the ceramic tiles. So, that will be postponed for a while.
Also I expect glue for ceramic tiles to be delivered today (as long as brick tiles), so that I can start hall granite covering. Although I'm a bit tired after water system installation, which took major part of the day.
It is actually simple task, but only when you have simple access to all parts. Now imagine a 10 sm thick wall, where you managed to drill two holes, each one about 2 sm in diameter (less than two fingers thick). In a meter below-left there is a bigger hole for sanitary (about 15x15 sm). Water system hatch is located 2.5 meters right to this.
Task is to put thin water tubes from water hatch to two small holes, but that splitter would be installed near bigger sanitary hole. Without direct access to any tube (you can only feel it, can not see) you have to connect them (also need to mention, that it is quite hard to put both hands into bigger hole for sanitary system) via different connectors using spanners.

I've completed the task, although not sure if it is really safe. That was challenging, and power sucking, so probably I will just slack this evening and hack some bits of captcha. Will also cover my table with the last colour level (yes, yes, it is still not done) and/or fill second varnish layer for x-shelves (they look really cool after mordant and varnish)...

/devel/flat :: Link / Comments (0)


POHMELFS distributed plans.

After healthy discussion started after my announcement of the second POHMELFS release, its time to highlight main ideas settled in the thread.

First, POHMELFS will be moved into parallel distributed filesystems, but still being very good as network filesystem. In particular, that will include ability to read data from one of the connected server (not particulary from currently active, how its done right now), writing will happen to all connected servers simultaneously (and transaction will be committed after all servers returned completion acknowledge).

Protocol will be extended to support dynamic addtion and removal of the servers to/from currently connected group. Probably there will be some kind of a status messages for servers (i.e. going offline, do not send me data, or I'm becoming slow, do not read from me and so on). It will be done in addition to cache coherency messages (I'm yet to implement, but because of other tasks, this was a bit postponed, probably to weekend), which will include two types of requests: page invalidation and inode update (that will also mean that POHMELFS will start supporting attributes (maybe even extended), right now it doesn't :). Such cache coherency protocol should scale better than classical MOSI (and its derivatives) and particulary better than pNFS spec proides (leases to operations for some servers), since it is still possible to work in parallel with the same file, especially without any overhead of data processing does not cross different client boundaries, but it has to be tested in practice.

POHMELFS server will be extended to support distributed facilities. Very likely it will be some kind of PAXOS algorithm, although probably in its very limited mode for the beginning. So far it will be really simple, so that I could touch all its corner cases and found optimal development strategy.

All client extensions are rather not that complex, although not always trivial, so that should not take too much time, so probably you will get something interesting soon.
Server extensions will be a bit slower, since I will start essentially from the distributed system ground and gradually move upstairs.

/devel/fs :: Link / Comments (0)


Tue, 13 May 2008

New POHMELFS release. Transactions, performance, failover.

Irish Jon Jameson (6 years of experience, really good stuff) brings us this new POHMELFS release.

Main features include:

  • Fast transactions. System will wrap all writings into transactions, which will be resent to different (or the same) server in case of failure.
  • Failover. It is now possible to provide number of servers to be used in round-robin fasion when one of them dies. System will automatically reconnect to others and send transactions to them.
  • Performance. Super fast (close to wire limit) metadata operations over the network. By courtesy of writeback cache and transactions the whole kernel archive can be untarred by 2-3 seconds (including sync) over GigE link (wire limit! Not comparable to NFS).
The nearest roadmap includes:
  • Full transaction support for all operations (only writeback is guarded by transactions currently, default network state just reconnects to the same server).
  • Data and metadata coherency extensions (in addition to existing commented object creation/removal messages).
  • Server redundancy.
One can check out POHMELFS homepage for more details. You can download latest release (against 2.6.25 kernel tree) from archive or GIT tree.

/devel/fs :: Link / Comments (0)


Mon, 12 May 2008

Meanwhile at appartment development side.

I moved to development shop and got zillions of stuff there including various colours for ceiling in kitchen and room's ceiling plinth, ordered brick-like tiles for kitchen (about one third of walls there will be covered with bricks), got some intrument (like rubber hummer for the tiles), ordered glue for the ceramic granite for hall, also got a shower (yumi, my shower cabin was delivered today too) and related stuff for water system installation.

By the original plan, I wanted to isntall shower cabin today, but getting into account current time, it is too late for loud work, so I will proceed with my table instead. It will be completed today, or call me a ... whatever you like (out of curiosity, is there an english undecent word dictionary? I know russian one exists).
If things will move fast, I will also cover with varnish my X-shelves, and probably will make some photos...

/devel/flat :: Link / Comments (2)


Fast POHMELFS transactions.

With new transactions and new waiting mechanism (see below) system now untars the whole kernel tree in less than 3 seconds over the GigE link (including subsequent sync, which takes less than second always), while async NFS (remote side is tmpfs in both cases) performs that in a bit more than 30 seconds. In addition POHMELFS write speed is 125 MB/s (wire limit) vs. less than 90 MB/s in NFS (dd from /dev/zero with 1 MB block size and 1000 blocks).

That's what I call a good result.

Transaction mechanism invoked in writeback path is now completely async too, i.e. it does not wait until remote side confirms that transaction was received and processed, but writeback does not drop transactions after sending function returned, instead it stores it in the in-flight storage and proceeds with the next one. Transaction can accumulate up to 90 pages in a single frame.
When reply is received, async thread searches for given transaction and complete it (unlocks page, although it can be done in writeback, since page is being copied, cleanup writeback bits, drops it from appropriate radix tree and drops reference counter). If transaction was not sent due to some error it will be tried to be sent to different servers, if some error was returned from the server, it will be resent to different ones. Since original writeback path does not know about transactions in-flight anymore, any timeout has to be checked by dedicated thread (or workqueue), which will detect too old transactions (by simply checking them from the beginning, since each new transaction has incrased id) and resend them to remote servers.

There is a small problem though - if object size is more than single transaction can accumulate (90 pages), it will be split into several transactions, where first one will contain object creation command and some data to be written, while others will contain only data. If server runs multiple threads per client (default is one though), it is possible that not first transaction will be processed first, so server will write some data into non-existent file, so transaction will fail. There are two ways to fix this isuue: either wait in writeback on client while creation transaction is completed, and then send all others like described above, or add creation command into every subsequent transactions until object is created on the server (special bit is set on local inode in that case). Likely the latter is better case.

/devel/fs :: Link / Comments (0)


Wed, 07 May 2008

Fast transactions in POHMELFS.

POHMELFS just switched to faster transactions allocated one-by-one with even smaller overhead (although it does not use kernel_sendpage() for page sending yet, it copies data).
System does not serialize after all transactions are completed (it waits after each one), but with new transaction allocation it is 1.5 times faster: 98MB/s vs. 64MB/s, note that without waiting for transaction completion it gets full wire speed of 125MB/s with 1500 byte MTU. And it is with highmem pages and thus slow kmap() of each one, and unmap after completion. I do not use ->sendpage() since it will force to split proper set of iovecs into mixed calls of kernel_sendmsg() and kernel_sendpage(), which I want to avoid so far. Now it is (again) faster than NFS, but I want to move further.
So, solution is rather trivial: wait until several transactions are completed. There is the whole infrastructure already there - in-flight transaction storage, per-transaction completion and destruction callbacks, proper reference counting and async completion.
Still only writing transactions are used (i.e. reading/lookup and others will not redirected to different servers).
There are some bugs of course, but that's the first development version after all.

/devel/fs :: Link / Comments (0)


Tue, 06 May 2008

New captcha solving problem.

Just in case you will notice some delay in filesystem or network development, reason is simple. I decided to devote some time to new captcha cracking problem, namely this ones:

Captcha problem

The reason is simple, I want to test my captcha breaking ideas on something which is real. And also I was frustrated by theirs abuse team, which was not able to fix spam filter based on messages I sent them (bounce and original, just like requested).
It is pretty unlikely though that something will appear anytime soon, but I do want to test some ideas...

/devel/captcha :: Link / Comments (0)


Mon, 05 May 2008

POHMELFS transaction support. Failover (re)connection to different servers.

POHMELFS just got full transaction support. So far it is only used in ->wrteipages() callback, which is invoked by writeback mechanism. POHMELFS uses lazy transaction support, namely it waits after each transaction, which includes header and data to be written for at most 14 pages, 14 is a magic number of pages, which corresponds to struct pagevec size, used by generic writeback, transaction size is limited by mount option and is 32 pages by default. Performance was dropped from 125 MB/s down to 64 MB/s, which is not acceptible. Main problem is of course waiting for transaction to be completed (i.e. completion message from server). There should not be per transaction waiting, instead writeback has to allocate as much transactions as needed and proceed one after another, and only start waiting for them, when there are no more pages to be written. This is the next task.

Transaction mechanism allows quite simple reconnection to different master servers in case of failure, and rollback of the failed transaction. For example one can provide different number of main servers (which have to be in sync with each other and be able to be synchronized themselfs, or they just can use shared storage), so POHMELFS client will switch between them if current one has failed. System will detect it and reconnect, if reconnect fails, next server will be used and the whole transaction will be resent there.
It is also possible to write transaction to different server on demand (it may or may not to be connected already, but it has to have address structure, so far it is only obtained during pre-mount configuration), which is a prerequistic for parallel data processing. One can create a simple patch to write transactions one after another to severs in round-robing fasion.

Right now only write transactions are used (and can be combined with object creation if needed), read ones are pending as long as multiple parallel transactions (which is not complex, but main task is how to wait them all to be completed, very similar code is used in pohmelfs_aio_read()).

There is also pending task of cache coherency support (server side originated messages to clients, which used the same pages, which another client is writing into, also including metadata coherency messages like uid/gid/inode size and other changes), it is not that complex task, and mostly requires server modifications.

Stay tuned!

/devel/fs :: Link / Comments (0)


Sun, 04 May 2008

Tanks in the city!

Wanted to visit Moscow and look how we play balalayka, drink vodka and walk with bears?
Not now, we drive our tanks instead.



The Victory Day repetition, april 29 night.

/other :: Link / Comments (3)


Fri, 02 May 2008

Design of the POHMELFS transaction model.

It is heavily based on how netlink is implemented in Linux kernel. Besides the fact that it is likely the most ugly and complex protocol among communication models supported by the kernel, it is exactly the most effective, extendible and feature rich one.
This model is based on the attributes, which are embedded into the message. Each attribute has header, which includes size of the attached data. So, one can put effectively unlimited amount of data into any message (limited only by size field and practical assumptions of the communication), and it is possible to create message, which will contain any number of different attributes.
The main problem of the netlink is its padding and alignment ugliness. Protocol tries to get the every bit out of the communication, so there is huge amount of very hairy things there.

I like to drink and (un)fortunately I got pretty bad quality drinks some times, but I'm absolutely sure, when Alexey Kuznetsov designed netlink attrubute alignment policies he had really bad hangover after likely the ever worst crap he drunk.

So, netlink attributes are very ugly, but you can extend it how you like.
The same applies to POHMELFS transactions.

You can put any new attribute into the transaction in a very trivial manner (I worked with netlink alot, even created kernel connector to simplify kernel development side, so I know that taste), although transaction size is limited, it is controlled only by mount option (default is 32 IO vectors each one of PAGE_SIZE (4k on x86) in one transaction).

Thus one can easily implement for example any protocol security labeling, just add new per-packet attribute.

So, it is easily possible to infinitely extend communication protocol with full backward compatibility.

/devel/fs :: Link / Comments (0)