Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Thu, 15 Nov 2007

Ground points of the filesystem development.

1. Data read/write rebalance in the filesystem.
When it is possible to add/remove storages from the system, there is a clear question about theirs utilisation. First, when you have your data spread over different nodes/storages, reading will always be faster, since it can be performed in parallel.
From another point of view, this can lead to heavy data fragmentation, if done incorrectly (like in case of tightly packet data in the first place, which after spreading will require heavy write/update overhead).
So, this is a good solution for read-mostly setups, but is a bad choice for write-mostly cases.
The cleanest solution for this issue I see is to use copy-on-write sematic, which implies that each new write will be placed to the new location. Thus in case of new storage added to the filesystem, it will be readily utilized for new writes, which in turn can work with delayed allocation and extents heavily reducing fragmentation.
Reading is a bit more trickier, ideally data should be spread over the new storage, but having large contiguous regions for the same file is a huge win because of read-ahead logic and the way disks work, so only fragmented files have to be moved around. Here we enter defragmentation land, which is very small and easy in copy-on-write design - file should be read and written to get a new contiguous region, or special operation should be introduced to do essentually the same, without writing to the data (like do that on sync or flush).

So, to summarise my ideas, the only needed thing for having high-performance read and write in case of multiple (or extendible) storages is to have copy-on-write semantic behind IO logic with correctly implemented balancing algorithms (like proper delayed allocation and extent usage).
This is a first base point of my filesystem design.

2. Locking.
Obviously, the less locks you have, the less time you will spent in busy loops (zero in the perfect case).
Thus main design principle is to allow multiple IO (simultaneous reads and writes) and metadata (file creation/deletion and so on) operations.
While multiple readers are handled just fine in Linux kernel via generic_file_aio_read() all writers are stuck in generic_file_aio_write()'s inode->i_mutex, which effectively blocks multithreaded writing to the same file. But inode->i_mutex should only guard metadata updates actually, not writing itself, so this issue has to be resolved in any filesystem, aimed for high performance applications (no filesystem in Linux kernel tries to avoid grabbing inode->i_mutex for writes currently). Getting into account number of hacks I implemented for network without touching a lot of core code, I'm pretty sure I will be able to do so for own filesystem only.

3. Motivation.
I do strongly believe that it is impossible to make a really good things when you are forced to do them. So, my idealism says me, that when you are paid to do the work, it will not be completed in the best way. Do not confuse, when you get money for things you do for yourself or on your own intention, they are completely different approaches.

4. Fun.
It has to be fun. If project starts sucking the power without good feedback, it has to be completed to the next milestone and frozen. If something is not interesting, it should be avoided.

That were my rules for success filesystem project, the last two items obviously apply to any other project.

Stay tuned :)

/devel/fs :: Link / Comments ()