Zbr's days.
November
Sun Mon Tue Wed Thu Fri Sat
       
15
 
2007
Months
Nov

About TODO Blog RSS Old blog Projects Gallery Notes

Thu, 15 Nov 2007

Ground points of the filesystem development.

1. Data read/write rebalance in the filesystem.
When it is possible to add/remove storages from the system, there is a clear question about theirs utilisation. First, when you have your data spread over different nodes/storages, reading will always be faster, since it can be performed in parallel.
From another point of view, this can lead to heavy data fragmentation, if done incorrectly (like in case of tightly packet data in the first place, which after spreading will require heavy write/update overhead).
So, this is a good solution for read-mostly setups, but is a bad choice for write-mostly cases.
The cleanest solution for this issue I see is to use copy-on-write sematic, which implies that each new write will be placed to the new location. Thus in case of new storage added to the filesystem, it will be readily utilized for new writes, which in turn can work with delayed allocation and extents heavily reducing fragmentation.
Reading is a bit more trickier, ideally data should be spread over the new storage, but having large contiguous regions for the same file is a huge win because of read-ahead logic and the way disks work, so only fragmented files have to be moved around. Here we enter defragmentation land, which is very small and easy in copy-on-write design - file should be read and written to get a new contiguous region, or special operation should be introduced to do essentually the same, without writing to the data (like do that on sync or flush).

So, to summarise my ideas, the only needed thing for having high-performance read and write in case of multiple (or extendible) storages is to have copy-on-write semantic behind IO logic with correctly implemented balancing algorithms (like proper delayed allocation and extent usage).
This is a first base point of my filesystem design.

2. Locking.
Obviously, the less locks you have, the less time you will spent in busy loops (zero in the perfect case).
Thus main design principle is to allow multiple IO (simultaneous reads and writes) and metadata (file creation/deletion and so on) operations.
While multiple readers are handled just fine in Linux kernel via generic_file_aio_read() all writers are stuck in generic_file_aio_write()'s inode->i_mutex, which effectively blocks multithreaded writing to the same file. But inode->i_mutex should only guard metadata updates actually, not writing itself, so this issue has to be resolved in any filesystem, aimed for high performance applications (no filesystem in Linux kernel tries to avoid grabbing inode->i_mutex for writes currently). Getting into account number of hacks I implemented for network without touching a lot of core code, I'm pretty sure I will be able to do so for own filesystem only.

3. Motivation.
I do strongly believe that it is impossible to make a really good things when you are forced to do them. So, my idealism says me, that when you are paid to do the work, it will not be completed in the best way. Do not confuse, when you get money for things you do for yourself or on your own intention, they are completely different approaches.

4. Fun.
It has to be fun. If project starts sucking the power without good feedback, it has to be completed to the next milestone and frozen. If something is not interesting, it should be avoided.

That were my rules for success filesystem project, the last two items obviously apply to any other project.

Stay tuned :)

/devel/fs :: Link / Comments (0)


CEPH distributed storage.

It was announced on LWN and kerneltrap recently.
I already wrote about this filesystem, after that I found (from discussion with Zach Brown) that this filesystem does not have a byte-range locking and when number of threads write to the same file, they become sync writes (i.e. no cache coherence protocols involved). I'm also not sure what this is about: I/O workloads should be done with the client cache off because the writeback is too non-deterministic.

That was my envy comments :), now good news.
First, Sage Weil (an author) works full-time on this project and funds it from own web hosting company, so it is possible to attract developers for money (he even hired someone to write kernel client instead of FUSE one). Second, it has completed design and working implementation (although some design issues are questionable).
So, likely it is a good choice to take a look for you, if you are searching for the solution which should be ready shortly.

/devel/dst :: Link / Comments (0)