Zbr's days.
September
Sun Mon Tue Wed Thu Fri Sat
           
29
           
2007
Months
Sep

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Sat, 29 Sep 2007

Design of the WEAVER redundancy code module in distributed storage.

Here I will briefly describe ideas of the WEAVER implementation for DST.

First, it requires some extension for configuration (although it is possible with existing commands to send all related information, it is a bit ugly), so I will introduce private algorithm's command which will allow to transfer all needed information. This will not even break backward compatibility.

Next is WEAVER implementation itself.
I will create 50% efficiency codes, which means half of each node will contain data (first half, so that later it could be used for linear algorithm for recovery for example, but that is quite ugly usage) and another half will contain parity information (and optionally checksum of the whole node or data/parity checksums separately). Reading is trivial (just like in case of usual mirroring) and will not be described here. With such nodes layout each write to the distributed storage easily allows to find a node where data should be written, private structure for each node will contain a list of nodes, which parity blocks have to be updated when given node is being written to. Each private structure will have the same per-block bitmap like mirror algo maintains, which will track clean/dirty blocks of the data. So, when number of blocks are being written to node A, number of parity blocks will be updated on the appropriate nodes. Parity update is quite trivial - it requires a read of the given parity block, xor with old data block (which will be read just before writing is performed) and xor with new data block, then parity block has to be written back. Dirty bitmap for data and parity will be updated after data and parity are written.

Complex part starts when number of nodes is turned off.
Let's first describe the case, when one node is turned off.

In this case things are just the same like above, but data is not written (and parity calculated, when write happens for other nodes, linked to given one by algorithm), so appropriate bitmap will contain not uptodate blocks.
When node is turned on, resync starts.
Just like in mirror algorithm, it will be performed only when some operation happens (which probalby should be extended to be done even if there are no operations in-flight) - I will share as much code as possible.
Each data block can be constructed by using some parity and data blocks from different nodes, so instead of reading block from remote node and writing it to given one (like in mirror), system will read number of data and parity blocks from different nodes, when they are ready, new data block will be reconstructed and written to not-synced node. When data block is reconstructed, all nodes, which parity depends on this block, will be updated accordingly.

For one failed node this looks quite simple, in theory multiple failed node is just the same, since until number of failed nodes reached fault tolerance number used in algorithm calculations, there is always number of 'good' nodes, which can be used to reconstruct at least one failed node, with newly recovered node it is possible to reconstruct missed parity bits needed to recover another node and so on. But that is theory, practice, I'm sure, will show me big number of surprises and complexities, but that is exactly why I started the project.

Right now I'm thinking about userspace implementation, I hope to complete it quite soon (maybe next week), so that kernel part would not be that hard to be done (mostly because kernel debugging is usually much longer because of compilations and reboots). When it is ready, I will mark project as completed.
I'm not sure I will push it into the kernel at that point, I will see.
After it is ready I will likely start to recall filesystem bits I collected and probably will implement simple (read: initially trivial) distributed filesystem.
Or maybe not, I think this project will flush me enough, so I might start something completely different, like process migration over the network, Markov chains text analysis, letter describing language for captcha solving or network protocol over laser link or something completely new.

Stay tuned and you will not miss interesting things :)

/devel/dst :: Link / Comments (1)