Zbr's days.
April
Sun Mon Tue Wed Thu Fri Sat
   
6
     
2008
Months
Apr
Nov Dec

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Sun, 06 Apr 2008

The is only one way: asynchronous.

This is a new motto for POHMELFS. It is a completely new filesystem now.

POHMELFS got new page processing code (sending side: commands and data), new lookup, which is based on the Linux VFS inode cache without reinventing the wheel (comment says it is very smp-friendly, although I do not quite understand how it is possible with global inode_lock), it also got completely new object creation and referencing path. It is possible to create a huge path (up to 4k, but can be easily extended if there will be such demand) with multiple objects in it with only single network command.
But the main feature of new POHMELFS is its name cache. I did not find how to hook into VFS dentry cache, so invented own. It is fast to travers from child to the highest level parent, which is actively used in POHMELFS writeback path. Although it is not 100% the best storage, but a simple RB-tree (and thus requires smp-unfriendly mutex), the whole idea shows its gains already. Eventually it will be replaced with faster and more scalable approach protected by RCU (even properly sized hash table will show better scalability, although dynamic resizing of hash tables prevents RCU usage), but I started from the simplest ground.

POHMELFS already outperforms async NFS during untarring and completely saturates my testing Xen domains (both network and disk speed), while NFS is almost two times slower. Testing machines have 256 Mb of RAM, maximum 3 MB/s interconnect speed (something is broken in Xen setup likely, since it is supposed to be 100 mbit/s and there is no high load), which is very unfriendly (read: in such scenario POHMELFS will show its worse results) for POHMELFS, but nevertheless it is fast.

It became not only much faster, but also simpler. Its userspace server has two times less lines of code (816 vs. 1613), kernel side is smaller and simpler too: mainly there are no zillions of different trees indexed by any possible keys, so far only per-inode tree of child names for readdir and per-superblock path entry cache.

There are drawbacks of course: there is no receiving code (at all). It will be a dedicated thread, which will asynchronously process all incoming packets (mostly readdir async return, read page content and cache-coherency messages). First two are really simple. The last one will be implemented as a full MOSI/MSI library for inode content. Likely it will be possible to use in my other projects.

P.S. I frequently think that I'm very good vapourware seller :)
Stay tuned!

/devel/fs :: Link / Comments (0)