Zbr's days.
June
Sun Mon Tue Wed Thu Fri Sat
         
28
2007
Months
Jun

About TODO Blog RSS Old blog Projects Gallery Notes

Thu, 28 Jun 2007

Device mapper sucks or recent changes in distributed storage development.


And thus will not be used.
It has serious limitation on table format - each line in table must form contiguous region, and it is not allowed to change limits (start and size) in constructor, which effectively breaks dynamic configuration of the storage.
So, no device mapper - I will use raw block device and create proper binary protocol to configure device from userspace instead of string based one for device mapper.

So far I only added couple of bits to dynamically add/remove storages and encoding algorithms, which actually can be revisited later if needed.
Each algorithm is an entity which will send/receive (and read/write via usual block IO path if node is local, which is supported too) data from nodes, it will determine if several nodes needs to be read/written (for example to recover if array is degraded or to update checksum) and perform all actions.
Initial trivial algorithm will use round-robin mechanism to write sequential data to several nodes.

/devel/dst :: Link / Comments (0)


Device mapper stack.


Let's see how block IO requests get remapped in the device mapper stack.
Here is a trace chain for usual IO operation:

  • suspend/resume ioctls of the appropriate block device -> __flush_deferred_io()-> *)
  • generic_make_request() for appropriate block device ->make_request_fn()->dm_request() -> *)
*) ->__split_bio()->__clone_and_map()->__map_bio()->target's map callback.

First codepath is used to flush all in-flight BIOs on demand and is not a common path.
Second one is a common way. It is performed in process context (either via ->readpage() and do_generic_mapping_read() or via pdflush during writing) and thus can sleep, there is no serialization lock anywhere in the path, but block layer ensures that only one make_request_fn() callback runs at a time for given queue. In case when make_request_fn() is already running, new BIO will be queued.
This allows not to implement any kind of locking in target's mapping function, but there is an issue with dispatching work for different remote storages.
Let's assume we have two remote devices, one of which is connected via very slow link (I seriously doubt one can build a distributed storage on top of slow links, but it is an example - node can be overloaded or down and effect will be the same), so sending/receiving data to/from that device will take a lot of time, if processing function will sleep, this essentially means that target's map callback should not be allowed to sleep and thus will not perform any processing itself.
Having async AIO or at least in-kernel event dispatching would be great here, but I closed kevent which fits perfectly here, so I will implement some ->poll() based dispatching/awakening mechanism to process non-blocking requests on behalf of special thread.
So, target mapping callback will issue a non-blocking request without queueing (checking ->poll() first is a good idea), if it is not fully completed, the rest will be done in dispatching state machine in dedicated thread.

/devel/dst :: Link / Comments (0)


Job status.


It looks like companies do not want me to work with them partially/contract based, so things will be exactly like they were before - free time after my pain^Wpaid work will be devoted to hacking.
Right now I have not that much time unfortunately, but I will definitely make some progress.

/devel/other :: Link / Comments (0)