|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Sun, 15 Jul 2007
Distributed storage suspend mode or live data migration.
I've just thought what feature I should add into this -
suspend mode or live migration.
Let's say you want to change remote node - either temporary suspend
all IO for given node (for example to change a local disk)
or replace completely one node with another (for example
switch to different remote machine), so that until
data migration from one node to another or during disk replacement
all block requests, which would be completed on given node,
will be frozen until node is ready. Requests to the different nodes should
continue without stops.
Actually I would be surprised if such functionality does not exist in existing block
layer hotplug, but I do not even know how to test if it is there or not -
documentation sucks, there is no feature list (at least I do not know about it),
so I will reinvent the wheel (again).
There is something like queue plug and unplug, but as usual - specs suck. I will
check LWN kernel line, I recall
Jonathan Corbet wrote about it, but even if it does exist, it can not help
in distributed storage, since it is a single device for the block layer
and thus has only one queue, but stopping all IO requests because of one node
is not politically correct I think. Such decision should be made by algorithm of course,
since redundancy might require several nodes to be updated for single block IO,
or even more - to write to some another node algorithm must read some data from suspended node,
so this is the only place which knows about what IO must be frozen.
I'm thinking about should I release alpha version right now (modulo testing I
need to perform for local mode) or implement some other tasty things and show distributed
storage only after that... Pros and cons?
/devel/dst :: Link / Comments (0)
Distributed storage system.
I've added sysfs support, so device tree looks like this
(a storage named 'storage' created with two remote nodes):
/sys/devices/storage/
/sys/devices/storage/alg : alg_linear
/sys/devices/storage/n-800/type : R: 192.168.4.80:1025
/sys/devices/storage/n-800/size : 800
/sys/devices/storage/n-800/start : 800
/sys/devices/storage/n-0/type : R: 192.168.4.81:1025
/sys/devices/storage/n-0/size : 800
/sys/devices/storage/n-0/start : 0
/sys/devices/storage/remove_all_nodes
/sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
/sys/devices/storage/name : storage
As you can see, there are two nodes in linear algorithm,
first one start at 0 sector and has 800 sectors size,
second one starts at 0 sector and has 800 sectors size too.
Implemented initial failover mechanism - if there is recoverable error
(i.e. not -ENOMEM), then appropriate algorithm's callback
is invoked. Right now it does not perform any action,
but can for example reconnect to remote node and resend a block request.
To implement this I need to refactor code a bit.
Extended userspace support. To setup above array one just needs to run following comamnds:
# ./dst -n storage -A alg_linear -f /dev/dst -a kano -p 1025
# ./dst -n storage -A alg_linear -f /dev/dst -a via -p 1025 -R
To remove an array:
# ./dst -n storage -A alg_linear -f /dev/dst -D
Here is small help for userspace options:
Usage: ./dst -n storage_name -A algorithm -b backlog -f device_path
-s start -S size -d local_disk -a addr -p port -r <remove>
-R <start array> -D <del array> -h <help>
So, to be ready for the alpha release I need to test local export (so far I only tested
userspace remote peer, which works on top of usual file (can be a device file though)) and
local (local block devices) targets.
Also watched three parts of "Lethal Weapon" film to help brain not to explode
or flow out of my ears - that's an excellent time.
Stay tuned.
/devel/dst :: Link / Comments (0)
Interesting note about device mapper.
It never performs allocation returned value check.
Since for hotpath it uses either memory pool or biosets
(which is in turn memory pool too), and allocation
happens always in process context (where sleeping is allowed),
neither of them can fail, since
memory pool api internally spins forever (if sleeping is allowed
in the context) until requested data block can be obtained.
/devel/other :: Link / Comments (0)
|