|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Tue, 22 Jul 2008
POHMELFS distributed facilities design notes.
Since I'm quite busy with VISA/hotel/tickets and overall preparations
for Kernel Summit, there is no development progress, but it should be
completed very soon I think, and so I will write here some design notes
I have in mind about how POHMELFS server will be designed. It is not a
finished draft, but somewhat a rough direction paint.
POHMELFS will utilize distributed hash table approach, i.e. storage
will support ability to get an obect based on some key attached to it.
In a local filessytem we already work with hash table: directory
lookup is no more than lookup for inode object based on its name, i.e.
lookup for the value based on attached key. And although key in this
case is not created based on object itself (like hash of the content or
some other function), it still is a (turn on your imagination here) table lookup.
Cloud of POHMELFS servers will utilize similar approach. Consider a single
server in the system. When it joins the cloud (I ommit this proccess for now,
and will describe it below) first time, it is empty, so it gets some unique
id, either via administrator steps or randomly, or it just waits in the queue
to be filled with new data, so it will get id at that time, it does not matter
for now how it gets its id, but this id is propagated to some cloud of its
neighbours (or if it would be a bittorrent or napster to the main server).
There are two ideas on how to treat this ID: either as a part of the filename,
or as a nameless pointer in the abstract namespace, I will show below that actually
it does not matter.
Now, let's check what will happen when user wants to perform some IO on given file.
Every file access actually happen to inode, stored on disk. In our case it can be stored
somewhere we do not know yet where, so we need to perform a lookup to get address
of the node in cluster which contains our data. In existing schemas like bittorrent
or Lustre there is a server (or small cloud of servers) which contain mapping information
about where this or that object is placed in data cloud, so simple lookup to this server(s)
return needed info. This approach does not scale to really lots of nodes and is failure-prone.
Instead I consider completely distributed metadata storage. Let's check how system will lookup
the whole path in our case.
Each path starts from the root directory, which is '/', which in turn is a id in the global
namespace (or hash from this string or whatever else mapping), so we first need to lookup
a node, which is responsible to content of this directory. Each node contains routes only
to the very limited set neighbour nodes (in various designs this number varys, but idea
lays in the fact, that node, performing lookup, does not know which node contains needed info).
Gnutella system just broadcasted this lookup request to all of its neighbours, so each one
broadcasted it to its neighbours and so on until one of the system replied, that it contains
needed info. Amount of unneded broadcasts killed Gnutella next day after Napster was closed.
So, this approach does not scale, and instead we need to map needed directory into node address
in a more intelligent way. There are at least two the most appealing design choices: ring-based
structure implemted in CHORD and multidimensional torus implemented in CAN.
Right now it does not matter, let's assume that we found a node, which has information
about content of the needed directory. When we have that data, we can find next node (or this
info can be cached on 'parent' directory node) and so on until get node, which is resposible for
storing content of the needed object.
When new node joins the cloud it connects to one or another known node (provided either in public
service or by administrator) and sends there information about its available space, gets ID
and just waits until some client connects to it and start writing a data.
When node joins with some content, which was written to it by the system before, or written by
local users bypassing distributed mechanism, node has to tell this information to the node, which
holds parent directory. This information should be stored in each directory it exports, or it
can be provided by administrator, for example this node exports dir '/zbr' which is actually a subdir
of '/home', so node will lookup '/home' directory content owner and update its records, that now
it contains new dir. There is a problem here: what if there is already another node, which also
claims to have dir '/zbr' in '/home'? This can be handled via attached to each object extended attribute,
which will tell us the last modification date, so system can select either the last modified '/zbr'
dir or that node, which contains dir with the biggest number of the same replicas. It can be setup by
administrator.
Main advantage of this joining scheme is the fact, that we actually do not need to know content of any
object in the exported directory, we publish only high-level object, which may or may not contain some
inner file or dir. Thus we do not need to hash millions of files in the exported directory and publish
them one by one, we do not need to store information about each inner object,
no need attach full path to each object and so on.
When we will decide to split the same object between multiple node, we will need to introduce not only
name based lookup, but also extend it to the offset inside the object. This can be done by introducing
ssytem wide 'block size', so each file is actually set of blocks of given size, so when we found a node,
resposible for storing information about directory, where it is located, this node can also contain
information where each part of the object was stored.
Looks quite simple, but... Devil is in the details.
I obviously missed some bits in the design (and I created it in mind during talk being
under 'impression' of the greece spirit while talking with asm@, who suggested to look
at Kademlia project), like redundancy management of the nodes, splitting of the node content between
multiple nodes and other bits, but it is one of the first drafts, so things can be changed if needed.
Stay tuned, I will be very soon back to development process
(DST first :), since paper work for kernel summit travel
seems to reach its end.
/devel/fs :: Link / Comments ()
|