|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Sat, 29 Sep 2007
Design of the WEAVER redundancy code module in distributed storage.
Here I will briefly describe ideas of the WEAVER implementation for DST.
First, it requires some extension for configuration (although it is possible
with existing commands to send all related information, it is a bit ugly),
so I will introduce private algorithm's command which will allow to transfer
all needed information. This will not even break backward compatibility.
Next is WEAVER implementation itself.
I will create 50% efficiency codes, which means half of each node will contain
data (first half, so that later it could be used for linear algorithm
for recovery for example, but that is quite ugly usage) and another half will contain
parity information (and optionally checksum of the whole node or data/parity checksums
separately). Reading is trivial (just like in case of usual mirroring) and will not be described
here. With such nodes layout each write to the distributed storage easily allows to find
a node where data should be written, private structure for each node will contain
a list of nodes, which parity blocks have to be updated when given node is being written to.
Each private structure will have the same per-block bitmap like mirror algo maintains, which will
track clean/dirty blocks of the data. So, when number of blocks are being written to node A,
number of parity blocks will be updated on the appropriate nodes. Parity update
is quite trivial - it requires a read of the given parity block, xor with old data block
(which will be read just before writing is performed) and xor with new data block,
then parity block has to be written back. Dirty bitmap for data and parity will be updated
after data and parity are written.
Complex part starts when number of nodes is turned off.
Let's first describe the case, when one node is turned off.
In this case things are just the same like above, but data is not written (and parity
calculated, when write happens for other nodes, linked to given one by algorithm),
so appropriate bitmap will contain not uptodate blocks.
When node is turned on, resync starts.
Just like in mirror algorithm, it will be performed only when some operation happens (which probalby
should be extended to be done even if there are no operations in-flight) - I will share as much code
as possible.
Each data block can be constructed by using some parity and data blocks from different nodes,
so instead of reading block from remote node and writing it to given one (like in mirror), system
will read number of data and parity blocks from different nodes, when they are ready, new data block
will be reconstructed and written to not-synced node. When data block is reconstructed, all nodes,
which parity depends on this block, will be updated accordingly.
For one failed node this looks quite simple, in theory multiple failed node is just the same, since
until number of failed nodes reached fault tolerance number used in algorithm calculations,
there is always number of 'good' nodes, which can be used to reconstruct at least one failed node,
with newly recovered node it is possible to reconstruct missed parity bits needed to recover another
node and so on. But that is theory, practice, I'm sure, will show me big number of surprises and complexities,
but that is exactly why I started the project.
Right now I'm thinking about userspace implementation, I hope to complete it quite soon (maybe next week),
so that kernel part would not be that hard to be done (mostly because kernel debugging is usually
much longer because of compilations and reboots). When it is ready, I will mark project as completed.
I'm not sure I will push it into the kernel at that point, I will see.
After it is ready I will likely start to recall filesystem bits I
collected and probably will
implement simple (read: initially trivial) distributed filesystem.
Or maybe not, I think this project will flush me enough, so I might start something completely
different, like process migration over the network,
Markov chains text analysis,
letter describing language for captcha solving
or network protocol over laser link
or something completely new.
Stay tuned and you will not miss interesting things :)
/devel/dst :: Link / Comments (1)
Please solve this captcha to be allowed to post (need to reload in a minute): 25 - 8
|
KernelPanic wrote at 2007-10-10 15:30: