Zbr's days.
July
Sun Mon Tue Wed Thu Fri Sat
13
       
2007
Months
Jul

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Fri, 13 Jul 2007

Friday 13. But it works.

I'm talking about distributed storage.

Initial implementation of the distributed storage with linear (remote) device mapping (details below) works in Linux now.

Linear mapping is no more than trivial concatenation of the several (remote) nodes into single local one. There is no redundancy or failover or whatever, there is only proof-of-concept code, which allows to form a local device, which is created on top of several (remote) nodes. I created a single ext3 filesystem on top of it (part of the filesystem is on one device, part on another, but filesystem does not know about it, it also does not know about any special operations needed to be performed to work - like those ones needed for NFS to operate correctly). It is quite trivial, but works ok.
Actually this is no more than a usual device mapper (due to limitations of its config I managed to create third system to combine devices into one (device mapper, md (multiple device) and now distributed storage - I will definitely create a device mapper module to form remote nodes not looking at its table file limitations, since is it usually simpler for users to work with existing system)), but with ability to combine local and remote nodes.

There are unsolved problems of course:

  • there is no block IO split, i.e. if single block request crosses the boundary between devices, it will not be splitted into the two, but completed with errors. This definitely must be fixed, even though wast majority of the requests is page-size only.
  • quite bad userspace support - configuration utility is ugly (it has the same ugly char device with ioctl commands instead of netlink).
  • local export was not tested - there are only userspace remote nodes, which work on top of usual file.
But nevertheless this is big step forward.

There are things to think about:
  • Synchronization. Currently each block request is handled in order it arrived - i.e. no new requests will be sent to the remote device until current one is completed. There are number of pros and cons for this idea and likely things should be changed to allow per-page tag tp show to which request it belongs to (or better just put a global offset pf the request) - this is similar to how iSCSI works, where each request has its tag, so that there is possibility to send several of them. This is likely a good idea, but needs furhter investigation.
    Actually, talking about iSCSI - I doubt it will work good (at least on small systems), if main sending function works on top of non-blocking socket with following loop without any polling and sleeping:
    static void iscsi_xmitworker(struct work_struct *work)
    {
    	struct iscsi_conn *conn = container_of(work, struct iscsi_conn, xmitwork);
    	int rc;
    
    	/*
     	 * serialize Xmit worker on a per-connection basis.
    	 */
    	mutex_lock(&conn->xmitmutex);
    	do {
    		rc = iscsi_data_xmit(conn);
    	} while (rc >= 0 || rc == -EAGAIN);
    	mutex_unlock(&conn->xmitmutex);
    }
    And I do not talk about atomic-only allocations in iSCSI stack.
    I do not say distributed storage is a perfect solution, but I designed it with quite a few issues in mind instead of following theoretical-only design (which is what several people proposed to do before actually starting to code and find all problematic places)...
  • Failure mode. Currently if one node is broken, the whole device stops working - this must be handled by per-algorithm decision (i.e. redundancy algo will just update required nodes instead of failing). But existing linear algorithm requires that property, so it is not that bad for now.
    Another issue with failure mode is recovery - if some node was repaced, it requires some information to be written into it during recovery phase, doing that via main node (like eall existing system - they syncs via main device) is not optimal in distributed scenario, better would be to say some remote node to send some info into new node directly instead of via main node, but that depends on the algorithm being used, so it should be postponed until some interesting redundancy mechanism is developed (like WEAVER codes).
But overall, first distribution storage has been formed on top of several remote nodes (with trivial autoconfig - remote device sizes are requested by distributed storage core and that data was used acordingly to form the device).

Briefly saying, this is although quite small, but definitely a success.

/devel/dst :: Link / Comments (0)