Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Mon, 13 Oct 2008

Massive documentation update for the distributed storage. New release.

Andrew Morton expresed (somewhat angry imho :) lack of documentation for the DST as a review-stopper, so I cleaned up up some simple stuff he reported (like style changs, kcalloc() instead of kzalloc(), config dependency and other such things) and wrote about 500 lines of code documentation. Not that much, but it is a bit more than 10% of the whole DST project:

$ git commit -a -m "Documentation update."
Created commit 4886f36: Documentation update.
 7 files changed, 476 insertions(+), 18 deletions(-)

$ git-diff-tree -r --stat origin master
warning: refname 'origin' is ambiguous.
 drivers/block/Kconfig           |    2 +
 drivers/block/Makefile          |    2 +
 drivers/block/dst/Kconfig       |   14 +
 drivers/block/dst/Makefile      |    3 +
 drivers/block/dst/crypto.c      |  731 +++++++++++++++++++++++++++++
 drivers/block/dst/dcore.c       |  963 +++++++++++++++++++++++++++++++++++++++
 drivers/block/dst/export.c      |  662 ++++++++++++++++++++++++++
 drivers/block/dst/state.c       |  838 ++++++++++++++++++++++++++++++++++
 drivers/block/dst/thread_pool.c |  345 ++++++++++++++
 drivers/block/dst/trans.c       |  335 ++++++++++++++
 include/linux/connector.h       |    4 +-
 include/linux/dst.h             |  572 +++++++++++++++++++++++
12 files changed, 4470 insertions(+), 1 deletions(-)
As usual one can grab new release from the archive or via GIT tree.

/devel/dst :: Link / Comments ()


Mon, 06 Oct 2008

New distributed storage release.

New DST release contains following changes:

  • Keepalive messages to early detect failed nodes, which are sent if there is no traffic between the nodes.
  • Listening socket reuses address now, which speeds up stop/start sequence.
  • Fixed bug with wrong debug option, which could read uninitialized memory.
  • Change module name from dst.ko to nst.ko, since the former is used by dvb card.
  • Whitespace cleanup.
As usual patch is available from archive or via GIT tree.
Enjoy!

Asked for inclusion again. Let's make bets on number of comments for the patch :)

/devel/dst :: Link / Comments ()


Wed, 24 Sep 2008

New DST release.

This is a maintenance release, which contains following changes:

  • Use idr to manage minor numbers. Now create/remove/create sequence does not produce new minor, but uses previous one, which is now freed.
  • Added cache name to the node. It is possible to have freed node still being alive while we register new node with the same name, so its cache name should be different.
  • Wait during node removal until there are no pending transaction, so node would be freed in process context and not in the receiving threads itself.
  • Warn user if there is no security permission config file during export node initialization. No client will be allowed to connect without explicit security association.
  • Tune default size of the page pool for crypto processing a bit.
I want to thank Remy Ritchen (remy.ritchen_gmail.com) for his excellent tests and analysis.

As usual, DST is available from archive and via git tree.

/devel/dst :: Link / Comments ()


Sat, 13 Sep 2008

New distributed storage release.

This is maintenance only release of the DST, which brings us following changes:

  • Fixed memory leak in crypto thread initialization error path. Noticed by Sven Wegener (sven.wegener_stealer.net).
  • Unprotected tree access (exceptionally stupid bug, I was made blind by the electronic equipment), and tricky bug_on catch in scsi code caused by incorrect bio flag initialization in the exporting node. 64bit alignment fix. Bugs reported by Rémy Ritchen(
  • Couple of bogus compilation warnings about unintialized variables cought by different compiler.
  • Allow both hread and write permission, not only read or write in security config.
Patch can be found in git tree or archive.

The most tricky bug is scsi's BUG_ON(), which did not even contain any DST related calls.
It was cought at drivers/scsi/scsi_lib.c:1175:
  kernel BUG at drivers/scsi/scsi_lib.c:1175!
  RIP: 0010:[]  [] scsi_setup_fs_cmnd+0x64/0x70
  ...
  [] ? sd_prep_fn+0xa8/0x9b0
  [] ? __cfq_slice_expired+0x59/0xb0
  [] ? cfq_dispatch_requests+0x8d/0x330
  [] ? elv_next_request+0x119/0x250
  [] ? scsi_request_fn+0x6b/0x3c0
  [] ? generic_unplug_device+0x24/0x30
  [] ? blk_unplug_work+0x41/0x80
Which is the following code:
int scsi_setup_fs_cmnd(struct scsi_device *sdev, struct request *req)
{
	struct scsi_cmnd *cmd;
	int ret = scsi_prep_state_check(sdev, req);

	if (ret != BLKPREP_OK)
		return ret;
	/*
	 * Filesystem requests must transfer data.
	 */
	BUG_ON(!req->nr_phys_segments);
Which means that request structure did not contain any segment to process. Origianlly I thought that it is because of some tricky elevator steps, which selected wrong request queue because of all debug showed, that sync bio (block IO request with BIO_RW_SYNC bit set) is handled differently compared to the same request without this flag. But experiments with various flags showed, that bug occurs no matter how, but just in completely unpredictible place.

Fortunately I managed to catch it in a debug trap in block IO merging path, which showed me, that block IO requests with very srtange read/write and flags fields was a cause of this error. Looking more precisely to the block queue allocation path, I found, that its default initialization is not correct, and my setup happens before it, so it did not contain the right parameters for the maximum request sizes (hw and phys sectors). This also showed, that one block IO request in the export node had clone and other local-only fields, which is very wrong for the bio to be submitted, which actually resulted in the seen bug. Those fields were set by the client bio and should not be transferred to the remote one, so I only limited flag fields to show that bio is uptodate and have blockable IO bit.

That's the story about how things were hacked this day (its a middle of the night actually, while I'm waiting for the taxi to move to the airport), so POHMELFS locking algorithm was not implemented today, and likely is postponed to the next weekend when I return, since I got a group theory book and made some prints about numbers theory (after completed reading Vinogradov's book), so I will have what to read in all four planes (two in each direction) if I will not fall asleep, and likely I will not have much time in Portland: we will need to talk/listen to other people and check local pubs (people suggested some coctail places, but I prefer beer).

See you in Portland.

/devel/dst :: Link / Comments ()


Tue, 09 Sep 2008

New distributed storage release: "There is no spoon, black and white".

This is a very minor DST update, which contains following changes:

  • sector_t compilation warnings removed.
  • Debug, init, alloc, whatever cleanups noted by Sven Wegener (sven.wegener_stealer.net).
  • S o m e c h e c k p a t c h . p l m a s t u r b a t i o n
  • New name: "Linux benevolent dictator said: there is no spoon, black and white"
Actually I fixed only small amount of the crap returned by checkpatch.pl, particulary I did not fix cases of long lines, when it is actually a comment added after some variable, or things like
for (i=0; i<n; ++i) and
struct some_name
{
...
}
when checkpatch.pl wants
for (i = 0; i < n; ++i) and
struct some_name {
...
}
But tried to remove more than 80-characters code strings, trailing spaces and couple of other warnings.

Now I will concentrate on POHMELFS locking and then distributed facilities. Stay tuned, new version will be extremely cool in this regard!

/devel/dst :: Link / Comments ()


Mon, 08 Sep 2008

New distributed storage release.

It brings us following changes:

  • Permission checks in export node. Read-only connections.
  • Remove DST node from the global table not only when it is freed, but also on demand with node del command.
I think project is completed. I added inclusion request (with grammar error of course, how else) into announcement mail.

Check it out!

/devel/dst :: Link / Comments ()


Wed, 27 Aug 2008

Completely new Distributed STorage (DST) release.

DST is a block layer network device, which among others has following features:

  • Kernel-side client and server. No need for any special tools for data processing (like special userspace applications) except for configuration.
  • Bullet-proof memory allocations via memory pools for all temporary objects (transaction and so on).
  • Zero-copy sending (except header) if supported by device using sendpage().
  • Failover recovery in case of broken link (reconnection if remote node is down).
  • Full transaction support (resending of the failed transactions on timeout of after reconnect to failed node).
  • Dynamically resizeable pool of threads used for data receiving and crypto processing.
  • Initial autoconfiguration. Ability to extend it with additional attributes if needed.
  • Support for any kind of network media (not limited to tcp or inet protocols) higher MAC layer (socket layer). Out of the box kernel-side IPv6 support (needs to extend configuration utility, check how it was done in POHMELFS).
  • Security attributes for local export nodes (list of allowed to connect addresses with permissions). Not used currently though.
  • Ability to use any supported cryptographically strong checksums. Ability to encrypt data channel.
Distributed storage was completely rewritten from scratch recenly. I dropped essentially mirrored features of teh device mapper in favour of the more robust block io processing and effective protocol.

One can grab sources (various configuration examples can be found in 'userspace' dir) from archive, or via kernel and userspace GIT trees.

/devel/dst :: Link / Comments ()


Sat, 23 Aug 2008

Distributed storage.

Here we go, DST got all problems with reference counters fixed, there is somewhat new observation I made for myself: block device has to provide open and release callbacks to block device operation structure, which have to increase and decrease appropriate reference counters of the underlying object, since otherwise it is possible to remove it with proper del_gendis(), blk_cleanup_queue() and put_disk(), but some references will exist in the mapping (like in the block device info structure), so subsequent sync will crash the machine. Also tested lots of reconnection stuff, transaction resending and timeout and so on.

Actually I would make a new release, but decided to test crypto stuff first. It was copied from POHMELFS and should work out of the box, but this requires an additional check of course.
Since tomorrow I will have an almost minute free fall from the several kilometers high if weather permits, checks, bug fixes and release are postponed for the start of the week. Obviously if there will be no 'issues' with landing...

Stay tuned!

/devel/dst :: Link / Comments ()


Tue, 19 Aug 2008

Completed DST protocol implementation.

I did not yet test crypto processing (and there is no crypto autonegotiation yet, I will extend automatic configuration protocol to allow nested attributes, so it could be extended in the future if there will be any need for new parameters to be synced between client and server). Also server side does not check security attributes during connection (like read/write per-address permissions).

Because of excessive logging there were no possibility to check performance issues, which in turn resulted in a too frequent stale transaction timer fires, excessive resending and so on, so I introduced maximum amount of work to be done in each scan. With this change I was able to successfully create ext3 filesystem on 8Gb storage connected over 3 MB/s link to the remote node.

There is also an issue with broken connection: system tries to reconnect to the server and does not allow to unload module if there are pending block requests, each transaction has maximum number of retries, so system waits until each one reaches zero, which may take too long. This may or may not be a good idea actually, but I think I will implement transaction flushing during module unloading. Server node has the same issues if there are pending blocks requested by client and yet not sent.

And although it sounds like a lot of work, it actually is not. I just need not to digress, which is the most complex part :)

/devel/dst :: Link / Comments ()


Sun, 17 Aug 2008

DST. BIO. Barriers. Alignment.

I actually was wrong when talked about problems distributed storage may have with non-page-aligned vectors inside single block rquest. Actually neither client nor server should not know about how another peer works with given block request. Then only thing which should be transferred between the peers is start of the request and its size. And of course flags and operation mode (i.e. automatic sync/barrier support and read/write operation). That's it. Server will allocate as many pages for the request as needed for own page size, client will also process them just like a contigous flow of bytes coming out of the network pipe, which then are placed into number of pages client was asked to read or write. Simple.

BIO can not have holes inside it, but it can have multiple pages to be partially filled. And this information should not be shared between the nodes at all.
Which basically means that I do not need to even think about how to handle this problems and just complete protocol between the peers. Stay tuned!

/devel/dst :: Link / Comments ()


Thu, 14 Aug 2008

Distributed storage debugging.

DST testing revealed number of bugs, which could be easily fixed if I would not need to debug essentially two separated subsystems of the DST: client and server. Both share lots of code, but it is quite problematic to find who broke the protocol, when one of them starts complaining.

So far peers can connect and start initial data exchange, but there is a major problem, if page size differs on the nodes. Block IO request (bio structure) operates on pages (stored in the bio_vec structure), and has a size and offset attached to each page. If page size differs, and server node has smaller page, then it should somehow store information about how to split own set of pages allocated for given bio size into chunks expected by the client (since we need to transfer size/offser pair for each page). There is no such mechanism right now. It is possible to implement naive approach, when server node will allocate bio pages with sizes requested by the client, but this will break just after short time, since the only guaranteed kernel allocation in Linux VM is single page.
Another approach is to allocate the whole block request (bio) on the server for each page of client data (bio_vec structure), but this will have too big overhead on sequential access and in common case, when page size is equal on both sides of the network channel.

Network block device does not have this problem, since its server lives in userspace and can allocate arbitrary amount of ram, which will be contiguous in virtual memory. Using virtual memory is very slow, although it is possible to just allocate needed buffers using vmalloc(). iSCSI uses single command per block request.

So far I plan to implement following scheme for reading command (which is only one which has described problem): client will iterate over all block requests in each bio it is about to send, and will send as many commands as number of non contiguous blocks in given block request. Server will receive that blocks as separate subcommands, and will allocate a new bio for each such request. Client will need to increment transaction reference counter to the number of such commands, since server can reply to them in arbitrary order.
In the common case this actually should not happen, and I did not see it in practice either, since most reading bios come either from readahead (where they are contiguous) or single block requests (which if bigger than page size will also be contiguous), but nevertheless in theory such bios, where there is number of non-contiguous blocks, can exist and DST should be ready for them.

/devel/dst :: Link / Comments ()


Tue, 12 Aug 2008

Distributed storage testing revealed first bugs. Also some Japanese notes.

So, we need to celebrate its birth.

Suntory Old Whiskey

Japan does can produce a very tasty whiskey! I recommend 'Suntory Old Whisky' label, although I got the last bottle in my favourite shop, so I would not be surprised, if it is not that popular drink.

/devel/dst :: Link / Comments ()


Wed, 06 Aug 2008

Distributed storage, POHMELFS, netchannels development.

While a lot of action around filesystems rised recently, I made a short delay there and concentrated on lower block layer: DST.
Distributed storage essentially got export capabilities, i.e. data receiving, crypto processing, block layer request allocation and submitting, reply generation and so on, although it is more like a proof-of-concept right now, since requires lots and lots of testing. There are also plans for some additional features, but it is not that lot of work. So project completion is very close.

POHMELFS priorities have been switched a bit. After number of talks with people I decided first to implement the right locking semantics (probably will be turned on/off by mount option), which would allow simultaneous read/write to be performed the way people expect from local filesystem. Currently it uses a bit tricky cache coherency protocol, which in some cases can end up with different results than expected from local filesystem.
Next will be distributed server-side hash table development.

Netchannels will also get new release very soon. It will be simplified and some unneded funtionality (like netchannels NAT) will be removed.
I will also run some new tests with userspace network stack, namely latency measurements.

/devel/dst :: Link / Comments ()


Wed, 30 Jul 2008

Distributed storage development progress report.

DST got full transaction support (resending, timeout completion, error recovery, memory pool allocation for all kinds of transactions, single transaction allocation per IO request), socket processing (initialization of the connected and listened sockets, failover recovery of the connection, receiving thread, network helpers), crypto processing of the requests (thread pool utilization for crypto operations, cipher/hash initialization, cached pages for sending crypto processing).
Thinking of moving receiving and listen/accepted sockets processing to the thread pool too, likely it is a way to go, right now they have own threads.

Missing bits include the actual data sending/receiving and client accepting by listened socket (and appropriate initalization of the all needed infrastructure). This is a quite major part, but likely it will be completed sooner than later.

/devel/dst :: Link / Comments ()


Mon, 28 Jul 2008

Distributed storage development progress. Thread pools.

Today I implemented simple thread pool subsystem, which allows to create set of threads, to add/remove them them from this set in run-time, and to schedule a work to be done by them. Work is specified as to functions: setup() - it is called when system has selected a thread for execution, so caller can setup needed data, and action() - it is called by thread itself, it has access to the data, provided at initialization time.
Work scheduling has a timeout parameter, which corresponds to time system will wait for free thread, otherwise error is returned.
System is generic enough not to contain any notion about DST or crypto, only two new data types: struct thread_pool and struct thread_pool_worker, only the former is visible to the user.
API looks like this:

void thread_pool_del_worker(struct thread_pool *p);
struct thread_pool_worker *thread_pool_add_worker(struct thread_pool *p,
	char *name,
	int (* init)(void *private),
	void (* cleanup)(void *private),
	void *private);

void thread_pool_destroy(struct thread_pool *p);
struct thread_pool *thread_pool_create(int num, char *name,
	int (* init)(void *private),
	void (* cleanup)(void *private),
	void *private);

int thread_pool_schedule(struct thread_pool *p,
	int (* setup)(void *private, void *data),
	int (* action)(void *private),
	void *data, long timeout);
init() and cleanup() callbacks above are used after new thread is created, so that user could initialize per-thread data, for example it is used to allocate some cached pages and initialize crypto algorithms.

This thread pool system is used by the crypto processing code in the distributed subsystem: when block io request is about to be sent, or when system has received reply for the read request, it schedules crypto processing work to the pool, initialized at DST node setup time.

Crypto processing does not yet work in DST as long as some other bits, so far I only played a bit with its initlialization sequence, so it was split to network, crypto, security initializations and node start, which registers new storage in the block layer subsytem. This steps allow to introduce later additional initialization steps if needed without breaking backward compatibility.

Next steps include proper network initialization and processing and transaction management helpers. Then I will combine all existing code and make a first renewed release.
Stay tuned!

/devel/dst :: Link / Comments ()


Sat, 19 Jul 2008

Disributed storage is dead, long live the Distributed storage!

As you may know, DST project was an attempt to implement redundant, failover resistant, flexible block level storage subsytem. Among other features it supported ability to map multiple remote nodes via linear or mirroring algorithms to single node, reconnect to failed node, reading balancing and parallel writing to multiple nodes (in case of mirroring) and so on.

Now it has gone. There is no more distributed storage you knew before, instead there is completely new project being developed, which main goal is to provide a transport layer for the block requests only. Consider it as Network Block Device on huge steroids. Consider it as iSCSI on huge steroids. Consider it as ATA-over-Ethernet on even more huge steroids.
It is just an example of what all those protocols should have. And only that.
An it does not sound very ambitious, previous DST versions already supported lots of features, which never existed (and in some cases were impossible to be added) in another block level network storages.
DST moves further.

There will be no mirroring and overall ability to map multiple devices into single one, instead one should use Device Mapper for this goal, since its features were simply mirrored (although I tried to optimize them sometimes) in DST, and amount of targets was noticebly smaller.

Now DST is just a simple block device which operates on top of network connection. With just a single exception: its done right.

Features planned for the new Distributed Storage:

  • kernelspace client and server
  • initial autoconfiguration between client and server nodes
  • automatic reconnect to failed target
  • transaction model: resending, timeout error completion, full rollback of the failed transaction
  • wire speed performance
  • data channel encryption, strong checksumming
  • cryptographical authentification
  • ability to work on top of any network protocol
  • barriers support (when, if any, Device Mapper will start support them, DST will not need to be changed)
  • flexible protocol with simple ability to extend it to needed functionality
  • trivial configuration
Project is being written from scratch, but it is actually very simple, and should be quite small, so expect its first release quite soon.
It will be pushed upstream when ready.

/devel/dst :: Link / Comments ()


Fri, 18 Jul 2008

Completed distributed storage redesign.

I also managed to play second octave F# and sometimes the whole chromatic scale down to small (minor?) octave F on my trumpet, and I belive I started to understand overall trumpet kung-fu, but expect it is not what you wanted to read under DST tag.

So, DST becomes smaller, cleaner and simpler. Notably, I decided to drop userspace target completely for now.
Kernel part now operates on transaction entity, which holds a reference to the node, where data should be sent/received. There can be at most two such nodes if block IO request spans the boundary. In case of mirroring (which will be dropped for the first release) list of nodes to mirror this data to will be maintained by the first node, so transaction will not need to know about them.
In theory block request can be as much as BIO_MAX_PAGES pages, which is 256 for now, but I decided to limit minimum node size to be not smaller than above bio limit, so there will be always at most two nodes per request.
Each node has either block device behind it (so it will just call generic_make_request() with different block device for given bio), or network state machine.
Network state will have two threads: RX and TX. Receive one is used to get replies for the read/write messages, search appropriate transaction and complete it. In case of DST server it will also handle read/write requests and generate replies, but the whole processing will be exactly the same, client node will have a switch to process read/write requests from the network, but they should be only received by server.
Sending thread is tricky. It is used as fallback for non-blocking sockets, which are used first at generic_make_request() time, i.e. when higher level user performed read or write, if block was not fully sent, then it is queued to this thread and it will try to send the rest of the data when polling allows. ->make_request_fn() function returns in this case and higher layer can proceed with own operations.
Transaction is not freed until reply is received from the remote side or resending retry count fires.
Transaction is always allocated (from the appropriate memory pool) and that is actually all allocations in DST itself. In case it works with block devices, it is possible to clone a bio, when it crosses the boundaries (or even always, I have to check it, but it is essentially what device mapper with lots of own additional allocations), but it should be very rare condition.
Network stack will allocate data itself too.

That was a theory. Practice tells me, that essentially 90% of the code should be rewritten from scratch, so I recloned the tree and so far implemented generic bits of registering block device, creating various sysfs files and directories and other similar trivial bits. I still plan to finish it this weekend (without mirroring), but things may turn to me a different side though...

/devel/dst :: Link / Comments ()


Tue, 15 Jul 2008

Distributed storage development roadmap.

Yes, DST project is alive and will beat out the crap very soon, since I decided to change its underlying architecture, and switch to transaction model just like POHMELFS. This basically means that as long as system has enough RAM writing operations will be extremely fast, reading can be balanced between multiple nodes (in mirror), transactions can be resent, failover mechanism becomes much simpler, and system overall will be much more robust to failures.

Transaction model also means that system requires explicit acknowlege from remote side, and there are two possibilities here: two handle implicit ack which comes with TCP ack packets like I experimented before, and send explicit ack from server for each client's request.
\ The former approach although has smaller performance overhead, still suffers from the fact, that pages sent via DST are always stateless, i.e. at this layer there is no knowledge about who sends this page. We can determine inode page belongs to, can even get a socket when page is about to be released when ack has been received, but we can not know from exactly which PIPE it was submitted into given socket, so when multiple threads send the same page via miltiple sendfile() calls we do not know when and how page will be released. We can put pipes this page belong to into single-linked list (since page has only two unused at this point pointers: LRU list head, and one of them is used to determine that this page belongs to sendfile()/splice codepath), and likely traversing this list will not hurt usual users, but malicios one can create a local DoS with this approach. After some experiments with the splice code today I decided to drop this idea implementation for now.
There is a strong argument in favour of explicit acks from the server: this allows to make asynchronous transaction processing (with implicit acks we can not hook into processing path, since we do not know where exactly skb with our pages is chained), and this does not hurt perfromance (which was proven by POHMELFS benchmarks).

So, overall plan to develop DST is to switch to transaction model and perform async processing of all events (there are only two actually: reading and writing of the given pages to given locations).
This task is not that complex, so I expect some new results later this week. Stay tuned!

/devel/dst :: Link / Comments ()


Sun, 30 Mar 2008

Continue DST roadmap.

So, I have to admit that I rethought my opinion about mirroring/redundancy at filesystem layer - it is useful for lots of cases, and modulo bugs in DST mirroring (mostly a leak, which I can not find in my lab, and network/block layer race, which exists in sendfile() for years and just strikes DST a lot, which has a workaround though) I decided to rewrite mirroring algorithm in a way it could be used in other projects.

There is also an idea of how to fix abovementioned network/block layer race in a very non-disturbing manner, which was privately called soft DST barriers. Idea is to replace skb destructor with private one, which will commit that pages are no longer used (for example call bio_endio() or release splice buffer), this callback will be installed only for special sockets, which provide it (like DST, sendfile() or any other ->sendpage() users like samba). Idea was not killed on its roots, which is a good start sign.

/devel/dst :: Link / Comments ()


Thu, 27 Mar 2008

Distributed storage roadmap.

DST project was quiet for a while, but actually it is not.
There is a bug in mirror algorithm, which I consider to rewrite. Not becuase of this bug, but because it will be used in special setup, where its extension required.
Consider a high-available *SQL cluster with multiple storage nodes combined into mirror and several main systems, which operate with database software. Unfortunately only single main system works with queries, other has to be turned on when first one fails. Task is to create a system, which will automatically switch between main nodes and recover if either main nodes or storage nodes become unavailable, so that the whole system does not stop if something wrong happend with machines. It has to scale to tens of nodes as a must and later hundreds without problems.

This is not a performance scalability solution - so far only single node should be able to collect multiple data nodes into storage, and if that node fails it has to be switched, but so far I do not know any working and free solution for the problem. But solution created for the main node switching can be used in cases when any server (for example metadata server in cluster) failed and has to be switched.

It will also force me to finally implement barriers in DST.

As a possible helper for availability messages I consider abandoned CARP-like protocol (in userspace).

/devel/dst :: Link / Comments ()


Thu, 31 Jan 2008

BTRFS subvolumes.

Chris Mason created a short specification for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks on several devices and use tricky algoritms to distributed the load between storages.
Overall this is excellent idea, but specification rises some questions and I belive it is too heavily tied to ZFS design.

I will drop my thoughts here, which may be completely wrong though.

Here are some features btrfs will support with subvolume implementation: Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents; Checksum failure resolution by using a mirrored copy; Striped data extents and others.

They are clear targets for block layer, but there are following notes on why it is not:

If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but against data itself (i.e. it has to be recalculated after read), since during transfer data can be damaged and it is not that rare condition. Thus checksums from different mirror can be both be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.

If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big address space, it would not have sufficient information to allocate mirrored copies on different devices. Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST supports such iteraction for example.

Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem: there is pool of devices, and there are allocation policies for each one in the following form (just an example): files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3, small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed number of storages.

And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation, since a screenshot of high-level design clearly has Mac's shadows and fonts :)

/devel/dst :: Link / Comments ()


Tue, 22 Jan 2008

New DST release: Succumbed to live ant.

This is a maintenance release only and it contains only following change:

  • do not allocate big enough address structure on the stack during local export node initialization
Great thanks to Serge Leschinsky and Konstantin Kalin for testing.

As usual one can get the latest version from project homepage or via git tree.

/devel/dst :: Link / Comments ()