|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Fri, 13 Jul 2007
Friday 13. But it works.
I'm talking about distributed storage.
Initial implementation of the distributed storage with linear
(remote) device mapping (details below) works in Linux now.
Linear mapping is no more than trivial concatenation of the several (remote)
nodes into single local one. There is no redundancy or failover or whatever,
there is only proof-of-concept code, which allows to form a local device,
which is created on top of several (remote) nodes. I created a single ext3 filesystem
on top of it (part of the filesystem is on one device, part on another, but filesystem
does not know about it, it also does not know about any special operations needed
to be performed to work - like those ones needed for NFS to operate correctly).
It is quite trivial, but works ok.
Actually this is no more than a usual device mapper (due to limitations of its config
I managed to create third system to combine devices into one (device mapper,
md (multiple device) and now distributed storage - I will definitely create
a device mapper module to form remote nodes not looking at its table file limitations,
since is it usually simpler for users to work with existing system)), but with ability
to combine local and remote nodes.
There are unsolved problems of course:
- there is no block IO split, i.e. if single block request crosses the boundary
between devices, it will not be splitted into the two, but completed with errors.
This definitely must be fixed, even though wast majority of the requests is page-size only.
- quite bad userspace support - configuration utility is ugly (it has the same ugly char device
with ioctl commands instead of netlink).
- local export was not tested - there are only userspace remote nodes, which work
on top of usual file.
But nevertheless this is big step forward.
There are things to think about:
- Synchronization. Currently each block request is handled in order it arrived - i.e.
no new requests will be sent to the remote device until current one is completed. There are
number of pros and cons for this idea and likely things should be changed to allow per-page tag
tp show to which request it belongs to (or better just put a global offset pf the request) - this is
similar to how iSCSI works, where each request has its tag, so that there is possibility to send
several of them. This is likely a good idea, but needs furhter investigation.
Actually, talking about iSCSI - I doubt it will work good (at least on small systems),
if main sending function works on top of non-blocking socket with following loop
without any polling and sleeping:
static void iscsi_xmitworker(struct work_struct *work)
{
struct iscsi_conn *conn = container_of(work, struct iscsi_conn, xmitwork);
int rc;
/*
* serialize Xmit worker on a per-connection basis.
*/
mutex_lock(&conn->xmitmutex);
do {
rc = iscsi_data_xmit(conn);
} while (rc >= 0 || rc == -EAGAIN);
mutex_unlock(&conn->xmitmutex);
}
And I do not talk about atomic-only allocations in iSCSI stack.
I do not say distributed storage is a perfect solution,
but I designed it with quite a few issues in mind instead of following
theoretical-only design (which is what several people proposed to do
before actually starting to code and find all problematic places)...
- Failure mode. Currently if one node is broken, the whole device stops working - this must be handled
by per-algorithm decision (i.e. redundancy algo will just update required nodes instead of failing).
But existing linear algorithm requires that property, so it is not that bad for now.
Another issue with failure mode is recovery - if some node was repaced, it requires some information to be written into it
during recovery phase, doing that via main node (like eall existing system - they syncs
via main device) is not optimal in distributed scenario, better would be to say
some remote node to send some info into new node directly instead of via main node,
but that depends on the algorithm being used, so it should be postponed until
some interesting redundancy mechanism is developed (like WEAVER codes).
But overall, first distribution storage has been formed on top of several remote nodes
(with trivial autoconfig - remote device sizes are requested by distributed storage core
and that data was used acordingly to form the device).
Briefly saying, this is although quite small, but definitely a success.
/devel/dst :: Link / Comments (0)
|