|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Fri, 14 Mar 2008
Why binary trees are bad. New cache structure for pohmelfs.
I already found experimentally that write-through cache scales very badly,
even noticebly worse than without cache at all for some workloads, so an ideal
solution
does not involve any kind of write-through operations notably no synchronous commands,
which require immediate response.
This means that inode numbers will differ on client and server, so there should be
some kind of tracked dependency between them so that operations on different machines
can be done in sync. Initial though was to use binary tree to store pointers to appropriate
inodes, which would be indexed on server and clients by combination of hashes of inode (direntry)
and its parent data. Even embedded systems can easily have millions of inodes, so choice
was thought to be correct from the first point of view. Now I think different since there
is a serious problem with indexing of such a tree.
Since the only information common to both client and server is object name it should be used
as a key, maybe not name directly, but its hash, that does not matter at this point. Here comes a
problem with binary tree choice: in binary tree there is no connection between real parent in the
filesystem and parent in the binary tree, so there will be serious problems when we will put two
different object with the same name into binary tree - there will be a conflict. To solve
this problem we should use some information about where this object is placed, i.e. information
about its parent directory. Using parent name hash as a part of the key in the binary tree
does not solve problem too, since there might exist multiple directories with the same name and
the same object in it. We could solve the problem by putting into key hash of the object's name
and hash of parents key (which in turn is hash of the name and hash of its parent key), so this
recursive hashing would end up at the highest level (i.e. root directory). This works, but there
might be scalability problem with the following issues:
- server has to either cache opened directories or reopen it one-by-one when accessing an object
- when object is moved/renamed all keys of its children and parent has to be changed
which is unacceptible. So new solution was thought of.
So far I have two ideas:
- kind of radix tree
- multi-layer hash tables indexed by double name hash
While the former is kind of obvious, the latter is quite interesting but very simple idea. Consider
that each directory has a hash table of its children, it is indexed by double hash of child's name.
We need double hash to remove possibility of collision (I can not prove mathematically (maybe only yet)
that there are two hashes which will not allow simultaneous collision in both, but feel quite strongly
that such hash pairs exist) and to use them in network commands. Commands can be optimised either
to use full path if it is short enough (just sent a path string during writeback or readpage as a
path to where data belongs) or use an array of hashes of the path elements instead of '/' separated
names. Hash tables actually have to be changed to different data structure capable of hosting not only
small hash values, but full 32 or 64 bit hashes. It can be a binary tree or judy array, something similar
to what was used in
unified socket storage. The former looks a bit excessive.
Using such approach it is possible to lookup an object with O(k) operations where 'k' is number of directories
in a path, very usually it is smaller than 10, which for binary tree corresponds to as much as 1024 inodes,
which is too small for the real system.
This approach (especially when full path is being sent) allows to eliminate mentioned above scalability problems.
Implementation start is scheduled for today, but I have to think about details first.
/devel/fs :: Link / Comments (4)
|