Zbr's days.
July
Sun Mon Tue Wed Thu Fri Sat
   
22
   
2008
Months
Jul
Nov Dec

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Tue, 22 Jul 2008

POHMELFS distributed facilities design notes.

Since I'm quite busy with VISA/hotel/tickets and overall preparations for Kernel Summit, there is no development progress, but it should be completed very soon I think, and so I will write here some design notes I have in mind about how POHMELFS server will be designed. It is not a finished draft, but somewhat a rough direction paint.

POHMELFS will utilize distributed hash table approach, i.e. storage will support ability to get an obect based on some key attached to it. In a local filessytem we already work with hash table: directory lookup is no more than lookup for inode object based on its name, i.e. lookup for the value based on attached key. And although key in this case is not created based on object itself (like hash of the content or some other function), it still is a (turn on your imagination here) table lookup.

Cloud of POHMELFS servers will utilize similar approach. Consider a single server in the system. When it joins the cloud (I ommit this proccess for now, and will describe it below) first time, it is empty, so it gets some unique id, either via administrator steps or randomly, or it just waits in the queue to be filled with new data, so it will get id at that time, it does not matter for now how it gets its id, but this id is propagated to some cloud of its neighbours (or if it would be a bittorrent or napster to the main server).
There are two ideas on how to treat this ID: either as a part of the filename, or as a nameless pointer in the abstract namespace, I will show below that actually it does not matter.

Now, let's check what will happen when user wants to perform some IO on given file.
Every file access actually happen to inode, stored on disk. In our case it can be stored somewhere we do not know yet where, so we need to perform a lookup to get address of the node in cluster which contains our data. In existing schemas like bittorrent or Lustre there is a server (or small cloud of servers) which contain mapping information about where this or that object is placed in data cloud, so simple lookup to this server(s) return needed info. This approach does not scale to really lots of nodes and is failure-prone.
Instead I consider completely distributed metadata storage. Let's check how system will lookup the whole path in our case.

Each path starts from the root directory, which is '/', which in turn is a id in the global namespace (or hash from this string or whatever else mapping), so we first need to lookup a node, which is responsible to content of this directory. Each node contains routes only to the very limited set neighbour nodes (in various designs this number varys, but idea lays in the fact, that node, performing lookup, does not know which node contains needed info). Gnutella system just broadcasted this lookup request to all of its neighbours, so each one broadcasted it to its neighbours and so on until one of the system replied, that it contains needed info. Amount of unneded broadcasts killed Gnutella next day after Napster was closed.
So, this approach does not scale, and instead we need to map needed directory into node address in a more intelligent way. There are at least two the most appealing design choices: ring-based structure implemted in CHORD and multidimensional torus implemented in CAN.
Right now it does not matter, let's assume that we found a node, which has information about content of the needed directory. When we have that data, we can find next node (or this info can be cached on 'parent' directory node) and so on until get node, which is resposible for storing content of the needed object.

When new node joins the cloud it connects to one or another known node (provided either in public service or by administrator) and sends there information about its available space, gets ID and just waits until some client connects to it and start writing a data.
When node joins with some content, which was written to it by the system before, or written by local users bypassing distributed mechanism, node has to tell this information to the node, which holds parent directory. This information should be stored in each directory it exports, or it can be provided by administrator, for example this node exports dir '/zbr' which is actually a subdir of '/home', so node will lookup '/home' directory content owner and update its records, that now it contains new dir. There is a problem here: what if there is already another node, which also claims to have dir '/zbr' in '/home'? This can be handled via attached to each object extended attribute, which will tell us the last modification date, so system can select either the last modified '/zbr' dir or that node, which contains dir with the biggest number of the same replicas. It can be setup by administrator.

Main advantage of this joining scheme is the fact, that we actually do not need to know content of any object in the exported directory, we publish only high-level object, which may or may not contain some inner file or dir. Thus we do not need to hash millions of files in the exported directory and publish them one by one, we do not need to store information about each inner object, no need attach full path to each object and so on.

When we will decide to split the same object between multiple node, we will need to introduce not only name based lookup, but also extend it to the offset inside the object. This can be done by introducing ssytem wide 'block size', so each file is actually set of blocks of given size, so when we found a node, resposible for storing information about directory, where it is located, this node can also contain information where each part of the object was stored.

Looks quite simple, but... Devil is in the details.
I obviously missed some bits in the design (and I created it in mind during talk being under 'impression' of the greece spirit while talking with asm@, who suggested to look at Kademlia project), like redundancy management of the nodes, splitting of the node content between multiple nodes and other bits, but it is one of the first drafts, so things can be changed if needed.

Stay tuned, I will be very soon back to development process (DST first :), since paper work for kernel summit travel seems to reach its end.

/devel/fs :: Link / Comments (0)