Zbr's days.
July
Sun Mon Tue Wed Thu Fri Sat
15
       
2007
Months
Jul

About TODO Blog RSS Old blog Projects Gallery Notes

Sun, 15 Jul 2007

Distributed storage suspend mode or live data migration.

I've just thought what feature I should add into this - suspend mode or live migration.
Let's say you want to change remote node - either temporary suspend all IO for given node (for example to change a local disk) or replace completely one node with another (for example switch to different remote machine), so that until data migration from one node to another or during disk replacement all block requests, which would be completed on given node, will be frozen until node is ready. Requests to the different nodes should continue without stops.
Actually I would be surprised if such functionality does not exist in existing block layer hotplug, but I do not even know how to test if it is there or not - documentation sucks, there is no feature list (at least I do not know about it), so I will reinvent the wheel (again).
There is something like queue plug and unplug, but as usual - specs suck. I will check LWN kernel line, I recall Jonathan Corbet wrote about it, but even if it does exist, it can not help in distributed storage, since it is a single device for the block layer and thus has only one queue, but stopping all IO requests because of one node is not politically correct I think. Such decision should be made by algorithm of course, since redundancy might require several nodes to be updated for single block IO, or even more - to write to some another node algorithm must read some data from suspended node, so this is the only place which knows about what IO must be frozen.

I'm thinking about should I release alpha version right now (modulo testing I need to perform for local mode) or implement some other tasty things and show distributed storage only after that... Pros and cons?

/devel/dst :: Link / Comments (0)


Distributed storage system.

I've added sysfs support, so device tree looks like this (a storage named 'storage' created with two remote nodes):

/sys/devices/storage/
/sys/devices/storage/alg : alg_linear
/sys/devices/storage/n-800/type : R: 192.168.4.80:1025
/sys/devices/storage/n-800/size : 800
/sys/devices/storage/n-800/start : 800
/sys/devices/storage/n-0/type : R: 192.168.4.81:1025
/sys/devices/storage/n-0/size : 800
/sys/devices/storage/n-0/start : 0
/sys/devices/storage/remove_all_nodes
/sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
/sys/devices/storage/name : storage
As you can see, there are two nodes in linear algorithm, first one start at 0 sector and has 800 sectors size, second one starts at 0 sector and has 800 sectors size too.

Implemented initial failover mechanism - if there is recoverable error (i.e. not -ENOMEM), then appropriate algorithm's callback is invoked. Right now it does not perform any action, but can for example reconnect to remote node and resend a block request. To implement this I need to refactor code a bit.

Extended userspace support. To setup above array one just needs to run following comamnds:
# ./dst -n storage -A alg_linear -f /dev/dst -a kano -p 1025
# ./dst -n storage -A alg_linear -f /dev/dst -a via -p 1025 -R
To remove an array:
# ./dst -n storage -A alg_linear -f /dev/dst -D
Here is small help for userspace options:
Usage: ./dst -n storage_name -A algorithm -b backlog -f device_path 
	-s start -S size -d local_disk -a addr -p port -r <remove> 
	-R <start array> -D <del array> -h <help>
So, to be ready for the alpha release I need to test local export (so far I only tested userspace remote peer, which works on top of usual file (can be a device file though)) and local (local block devices) targets.

Also watched three parts of "Lethal Weapon" film to help brain not to explode or flow out of my ears - that's an excellent time.
Stay tuned.

/devel/dst :: Link / Comments (0)


Interesting note about device mapper.

It never performs allocation returned value check.

Since for hotpath it uses either memory pool or biosets (which is in turn memory pool too), and allocation happens always in process context (where sleeping is allowed), neither of them can fail, since memory pool api internally spins forever (if sleeping is allowed in the context) until requested data block can be obtained.

/devel/other :: Link / Comments (0)