Zbr's days.
October
Sun Mon Tue Wed Thu Fri Sat
     
11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  
2008
Months
OctNov Dec

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Mon, 06 Oct 2008

New distributed storage release.

New DST release contains following changes:

  • Keepalive messages to early detect failed nodes, which are sent if there is no traffic between the nodes.
  • Listening socket reuses address now, which speeds up stop/start sequence.
  • Fixed bug with wrong debug option, which could read uninitialized memory.
  • Change module name from dst.ko to nst.ko, since the former is used by dvb card.
  • Whitespace cleanup.
As usual patch is available from archive or via GIT tree.
Enjoy!

Asked for inclusion again. Let's make bets on number of comments for the patch :)

/devel/dst :: Link / Comments (4)


Wed, 24 Sep 2008

New DST release.

This is a maintenance release, which contains following changes:

  • Use idr to manage minor numbers. Now create/remove/create sequence does not produce new minor, but uses previous one, which is now freed.
  • Added cache name to the node. It is possible to have freed node still being alive while we register new node with the same name, so its cache name should be different.
  • Wait during node removal until there are no pending transaction, so node would be freed in process context and not in the receiving threads itself.
  • Warn user if there is no security permission config file during export node initialization. No client will be allowed to connect without explicit security association.
  • Tune default size of the page pool for crypto processing a bit.
I want to thank Remy Ritchen (remy.ritchen_gmail.com) for his excellent tests and analysis.

As usual, DST is available from archive and via git tree.

/devel/dst :: Link / Comments (0)


Sat, 13 Sep 2008

New distributed storage release.

This is maintenance only release of the DST, which brings us following changes:

  • Fixed memory leak in crypto thread initialization error path. Noticed by Sven Wegener (sven.wegener_stealer.net).
  • Unprotected tree access (exceptionally stupid bug, I was made blind by the electronic equipment), and tricky bug_on catch in scsi code caused by incorrect bio flag initialization in the exporting node. 64bit alignment fix. Bugs reported by Rémy Ritchen(
  • Couple of bogus compilation warnings about unintialized variables cought by different compiler.
  • Allow both hread and write permission, not only read or write in security config.
Patch can be found in git tree or archive.

The most tricky bug is scsi's BUG_ON(), which did not even contain any DST related calls.
It was cought at drivers/scsi/scsi_lib.c:1175:
  kernel BUG at drivers/scsi/scsi_lib.c:1175!
  RIP: 0010:[]  [] scsi_setup_fs_cmnd+0x64/0x70
  ...
  [] ? sd_prep_fn+0xa8/0x9b0
  [] ? __cfq_slice_expired+0x59/0xb0
  [] ? cfq_dispatch_requests+0x8d/0x330
  [] ? elv_next_request+0x119/0x250
  [] ? scsi_request_fn+0x6b/0x3c0
  [] ? generic_unplug_device+0x24/0x30
  [] ? blk_unplug_work+0x41/0x80
Which is the following code:
int scsi_setup_fs_cmnd(struct scsi_device *sdev, struct request *req)
{
	struct scsi_cmnd *cmd;
	int ret = scsi_prep_state_check(sdev, req);

	if (ret != BLKPREP_OK)
		return ret;
	/*
	 * Filesystem requests must transfer data.
	 */
	BUG_ON(!req->nr_phys_segments);
Which means that request structure did not contain any segment to process. Origianlly I thought that it is because of some tricky elevator steps, which selected wrong request queue because of all debug showed, that sync bio (block IO request with BIO_RW_SYNC bit set) is handled differently compared to the same request without this flag. But experiments with various flags showed, that bug occurs no matter how, but just in completely unpredictible place.

Fortunately I managed to catch it in a debug trap in block IO merging path, which showed me, that block IO requests with very srtange read/write and flags fields was a cause of this error. Looking more precisely to the block queue allocation path, I found, that its default initialization is not correct, and my setup happens before it, so it did not contain the right parameters for the maximum request sizes (hw and phys sectors). This also showed, that one block IO request in the export node had clone and other local-only fields, which is very wrong for the bio to be submitted, which actually resulted in the seen bug. Those fields were set by the client bio and should not be transferred to the remote one, so I only limited flag fields to show that bio is uptodate and have blockable IO bit.

That's the story about how things were hacked this day (its a middle of the night actually, while I'm waiting for the taxi to move to the airport), so POHMELFS locking algorithm was not implemented today, and likely is postponed to the next weekend when I return, since I got a group theory book and made some prints about numbers theory (after completed reading Vinogradov's book), so I will have what to read in all four planes (two in each direction) if I will not fall asleep, and likely I will not have much time in Portland: we will need to talk/listen to other people and check local pubs (people suggested some coctail places, but I prefer beer).

See you in Portland.

/devel/dst :: Link / Comments (0)


Tue, 09 Sep 2008

New distributed storage release: "There is no spoon, black and white".

This is a very minor DST update, which contains following changes:

  • sector_t compilation warnings removed.
  • Debug, init, alloc, whatever cleanups noted by Sven Wegener (sven.wegener_stealer.net).
  • S o m e c h e c k p a t c h . p l m a s t u r b a t i o n
  • New name: "Linux benevolent dictator said: there is no spoon, black and white"
Actually I fixed only small amount of the crap returned by checkpatch.pl, particulary I did not fix cases of long lines, when it is actually a comment added after some variable, or things like
for (i=0; i<n; ++i) and
struct some_name
{
...
}
when checkpatch.pl wants
for (i = 0; i < n; ++i) and
struct some_name {
...
}
But tried to remove more than 80-characters code strings, trailing spaces and couple of other warnings.

Now I will concentrate on POHMELFS locking and then distributed facilities. Stay tuned, new version will be extremely cool in this regard!

/devel/dst :: Link / Comments (1)


Mon, 08 Sep 2008

New distributed storage release.

It brings us following changes:

  • Permission checks in export node. Read-only connections.
  • Remove DST node from the global table not only when it is freed, but also on demand with node del command.
I think project is completed. I added inclusion request (with grammar error of course, how else) into announcement mail.

Check it out!

/devel/dst :: Link / Comments (0)


Wed, 27 Aug 2008

Completely new Distributed STorage (DST) release.

DST is a block layer network device, which among others has following features:

  • Kernel-side client and server. No need for any special tools for data processing (like special userspace applications) except for configuration.
  • Bullet-proof memory allocations via memory pools for all temporary objects (transaction and so on).
  • Zero-copy sending (except header) if supported by device using sendpage().
  • Failover recovery in case of broken link (reconnection if remote node is down).
  • Full transaction support (resending of the failed transactions on timeout of after reconnect to failed node).
  • Dynamically resizeable pool of threads used for data receiving and crypto processing.
  • Initial autoconfiguration. Ability to extend it with additional attributes if needed.
  • Support for any kind of network media (not limited to tcp or inet protocols) higher MAC layer (socket layer). Out of the box kernel-side IPv6 support (needs to extend configuration utility, check how it was done in POHMELFS).
  • Security attributes for local export nodes (list of allowed to connect addresses with permissions). Not used currently though.
  • Ability to use any supported cryptographically strong checksums. Ability to encrypt data channel.
Distributed storage was completely rewritten from scratch recenly. I dropped essentially mirrored features of teh device mapper in favour of the more robust block io processing and effective protocol.

One can grab sources (various configuration examples can be found in 'userspace' dir) from archive, or via kernel and userspace GIT trees.

/devel/dst :: Link / Comments (0)


Sat, 23 Aug 2008

Distributed storage.

Here we go, DST got all problems with reference counters fixed, there is somewhat new observation I made for myself: block device has to provide open and release callbacks to block device operation structure, which have to increase and decrease appropriate reference counters of the underlying object, since otherwise it is possible to remove it with proper del_gendis(), blk_cleanup_queue() and put_disk(), but some references will exist in the mapping (like in the block device info structure), so subsequent sync will crash the machine. Also tested lots of reconnection stuff, transaction resending and timeout and so on.

Actually I would make a new release, but decided to test crypto stuff first. It was copied from POHMELFS and should work out of the box, but this requires an additional check of course.
Since tomorrow I will have an almost minute free fall from the several kilometers high if weather permits, checks, bug fixes and release are postponed for the start of the week. Obviously if there will be no 'issues' with landing...

Stay tuned!

/devel/dst :: Link / Comments (0)


Tue, 19 Aug 2008

Completed DST protocol implementation.

I did not yet test crypto processing (and there is no crypto autonegotiation yet, I will extend automatic configuration protocol to allow nested attributes, so it could be extended in the future if there will be any need for new parameters to be synced between client and server). Also server side does not check security attributes during connection (like read/write per-address permissions).

Because of excessive logging there were no possibility to check performance issues, which in turn resulted in a too frequent stale transaction timer fires, excessive resending and so on, so I introduced maximum amount of work to be done in each scan. With this change I was able to successfully create ext3 filesystem on 8Gb storage connected over 3 MB/s link to the remote node.

There is also an issue with broken connection: system tries to reconnect to the server and does not allow to unload module if there are pending block requests, each transaction has maximum number of retries, so system waits until each one reaches zero, which may take too long. This may or may not be a good idea actually, but I think I will implement transaction flushing during module unloading. Server node has the same issues if there are pending blocks requested by client and yet not sent.

And although it sounds like a lot of work, it actually is not. I just need not to digress, which is the most complex part :)

/devel/dst :: Link / Comments (2)


Sun, 17 Aug 2008

DST. BIO. Barriers. Alignment.

I actually was wrong when talked about problems distributed storage may have with non-page-aligned vectors inside single block rquest. Actually neither client nor server should not know about how another peer works with given block request. Then only thing which should be transferred between the peers is start of the request and its size. And of course flags and operation mode (i.e. automatic sync/barrier support and read/write operation). That's it. Server will allocate as many pages for the request as needed for own page size, client will also process them just like a contigous flow of bytes coming out of the network pipe, which then are placed into number of pages client was asked to read or write. Simple.

BIO can not have holes inside it, but it can have multiple pages to be partially filled. And this information should not be shared between the nodes at all.
Which basically means that I do not need to even think about how to handle this problems and just complete protocol between the peers. Stay tuned!

/devel/dst :: Link / Comments (0)


Thu, 14 Aug 2008

Distributed storage debugging.

DST testing revealed number of bugs, which could be easily fixed if I would not need to debug essentially two separated subsystems of the DST: client and server. Both share lots of code, but it is quite problematic to find who broke the protocol, when one of them starts complaining.

So far peers can connect and start initial data exchange, but there is a major problem, if page size differs on the nodes. Block IO request (bio structure) operates on pages (stored in the bio_vec structure), and has a size and offset attached to each page. If page size differs, and server node has smaller page, then it should somehow store information about how to split own set of pages allocated for given bio size into chunks expected by the client (since we need to transfer size/offser pair for each page). There is no such mechanism right now. It is possible to implement naive approach, when server node will allocate bio pages with sizes requested by the client, but this will break just after short time, since the only guaranteed kernel allocation in Linux VM is single page.
Another approach is to allocate the whole block request (bio) on the server for each page of client data (bio_vec structure), but this will have too big overhead on sequential access and in common case, when page size is equal on both sides of the network channel.

Network block device does not have this problem, since its server lives in userspace and can allocate arbitrary amount of ram, which will be contiguous in virtual memory. Using virtual memory is very slow, although it is possible to just allocate needed buffers using vmalloc(). iSCSI uses single command per block request.

So far I plan to implement following scheme for reading command (which is only one which has described problem): client will iterate over all block requests in each bio it is about to send, and will send as many commands as number of non contiguous blocks in given block request. Server will receive that blocks as separate subcommands, and will allocate a new bio for each such request. Client will need to increment transaction reference counter to the number of such commands, since server can reply to them in arbitrary order.
In the common case this actually should not happen, and I did not see it in practice either, since most reading bios come either from readahead (where they are contiguous) or single block requests (which if bigger than page size will also be contiguous), but nevertheless in theory such bios, where there is number of non-contiguous blocks, can exist and DST should be ready for them.

/devel/dst :: Link / Comments (0)


Tue, 12 Aug 2008

Distributed storage testing revealed first bugs. Also some Japanese notes.

So, we need to celebrate its birth.

Suntory Old Whiskey

Japan does can produce a very tasty whiskey! I recommend 'Suntory Old Whisky' label, although I got the last bottle in my favourite shop, so I would not be surprised, if it is not that popular drink.

/devel/dst :: Link / Comments (0)


Wed, 06 Aug 2008

Distributed storage, POHMELFS, netchannels development.

While a lot of action around filesystems rised recently, I made a short delay there and concentrated on lower block layer: DST.
Distributed storage essentially got export capabilities, i.e. data receiving, crypto processing, block layer request allocation and submitting, reply generation and so on, although it is more like a proof-of-concept right now, since requires lots and lots of testing. There are also plans for some additional features, but it is not that lot of work. So project completion is very close.

POHMELFS priorities have been switched a bit. After number of talks with people I decided first to implement the right locking semantics (probably will be turned on/off by mount option), which would allow simultaneous read/write to be performed the way people expect from local filesystem. Currently it uses a bit tricky cache coherency protocol, which in some cases can end up with different results than expected from local filesystem.
Next will be distributed server-side hash table development.

Netchannels will also get new release very soon. It will be simplified and some unneded funtionality (like netchannels NAT) will be removed.
I will also run some new tests with userspace network stack, namely latency measurements.

/devel/dst :: Link / Comments (0)


Wed, 30 Jul 2008

Distributed storage development progress report.

DST got full transaction support (resending, timeout completion, error recovery, memory pool allocation for all kinds of transactions, single transaction allocation per IO request), socket processing (initialization of the connected and listened sockets, failover recovery of the connection, receiving thread, network helpers), crypto processing of the requests (thread pool utilization for crypto operations, cipher/hash initialization, cached pages for sending crypto processing).
Thinking of moving receiving and listen/accepted sockets processing to the thread pool too, likely it is a way to go, right now they have own threads.

Missing bits include the actual data sending/receiving and client accepting by listened socket (and appropriate initalization of the all needed infrastructure). This is a quite major part, but likely it will be completed sooner than later.

/devel/dst :: Link / Comments (0)


Mon, 28 Jul 2008

Distributed storage development progress. Thread pools.

Today I implemented simple thread pool subsystem, which allows to create set of threads, to add/remove them them from this set in run-time, and to schedule a work to be done by them. Work is specified as to functions: setup() - it is called when system has selected a thread for execution, so caller can setup needed data, and action() - it is called by thread itself, it has access to the data, provided at initialization time.
Work scheduling has a timeout parameter, which corresponds to time system will wait for free thread, otherwise error is returned.
System is generic enough not to contain any notion about DST or crypto, only two new data types: struct thread_pool and struct thread_pool_worker, only the former is visible to the user.
API looks like this:

void thread_pool_del_worker(struct thread_pool *p);
struct thread_pool_worker *thread_pool_add_worker(struct thread_pool *p,
	char *name,
	int (* init)(void *private),
	void (* cleanup)(void *private),
	void *private);

void thread_pool_destroy(struct thread_pool *p);
struct thread_pool *thread_pool_create(int num, char *name,
	int (* init)(void *private),
	void (* cleanup)(void *private),
	void *private);

int thread_pool_schedule(struct thread_pool *p,
	int (* setup)(void *private, void *data),
	int (* action)(void *private),
	void *data, long timeout);
init() and cleanup() callbacks above are used after new thread is created, so that user could initialize per-thread data, for example it is used to allocate some cached pages and initialize crypto algorithms.

This thread pool system is used by the crypto processing code in the distributed subsystem: when block io request is about to be sent, or when system has received reply for the read request, it schedules crypto processing work to the pool, initialized at DST node setup time.

Crypto processing does not yet work in DST as long as some other bits, so far I only played a bit with its initlialization sequence, so it was split to network, crypto, security initializations and node start, which registers new storage in the block layer subsytem. This steps allow to introduce later additional initialization steps if needed without breaking backward compatibility.

Next steps include proper network initialization and processing and transaction management helpers. Then I will combine all existing code and make a first renewed release.
Stay tuned!

/devel/dst :: Link / Comments (0)


Sat, 19 Jul 2008

Disributed storage is dead, long live the Distributed storage!

As you may know, DST project was an attempt to implement redundant, failover resistant, flexible block level storage subsytem. Among other features it supported ability to map multiple remote nodes via linear or mirroring algorithms to single node, reconnect to failed node, reading balancing and parallel writing to multiple nodes (in case of mirroring) and so on.

Now it has gone. There is no more distributed storage you knew before, instead there is completely new project being developed, which main goal is to provide a transport layer for the block requests only. Consider it as Network Block Device on huge steroids. Consider it as iSCSI on huge steroids. Consider it as ATA-over-Ethernet on even more huge steroids.
It is just an example of what all those protocols should have. And only that.
An it does not sound very ambitious, previous DST versions already supported lots of features, which never existed (and in some cases were impossible to be added) in another block level network storages.
DST moves further.

There will be no mirroring and overall ability to map multiple devices into single one, instead one should use Device Mapper for this goal, since its features were simply mirrored (although I tried to optimize them sometimes) in DST, and amount of targets was noticebly smaller.

Now DST is just a simple block device which operates on top of network connection. With just a single exception: its done right.

Features planned for the new Distributed Storage:

  • kernelspace client and server
  • initial autoconfiguration between client and server nodes
  • automatic reconnect to failed target
  • transaction model: resending, timeout error completion, full rollback of the failed transaction
  • wire speed performance
  • data channel encryption, strong checksumming
  • cryptographical authentification
  • ability to work on top of any network protocol
  • barriers support (when, if any, Device Mapper will start support them, DST will not need to be changed)
  • flexible protocol with simple ability to extend it to needed functionality
  • trivial configuration
Project is being written from scratch, but it is actually very simple, and should be quite small, so expect its first release quite soon.
It will be pushed upstream when ready.

/devel/dst :: Link / Comments (8)


Fri, 18 Jul 2008

Completed distributed storage redesign.

I also managed to play second octave F# and sometimes the whole chromatic scale down to small (minor?) octave F on my trumpet, and I belive I started to understand overall trumpet kung-fu, but expect it is not what you wanted to read under DST tag.

So, DST becomes smaller, cleaner and simpler. Notably, I decided to drop userspace target completely for now.
Kernel part now operates on transaction entity, which holds a reference to the node, where data should be sent/received. There can be at most two such nodes if block IO request spans the boundary. In case of mirroring (which will be dropped for the first release) list of nodes to mirror this data to will be maintained by the first node, so transaction will not need to know about them.
In theory block request can be as much as BIO_MAX_PAGES pages, which is 256 for now, but I decided to limit minimum node size to be not smaller than above bio limit, so there will be always at most two nodes per request.
Each node has either block device behind it (so it will just call generic_make_request() with different block device for given bio), or network state machine.
Network state will have two threads: RX and TX. Receive one is used to get replies for the read/write messages, search appropriate transaction and complete it. In case of DST server it will also handle read/write requests and generate replies, but the whole processing will be exactly the same, client node will have a switch to process read/write requests from the network, but they should be only received by server.
Sending thread is tricky. It is used as fallback for non-blocking sockets, which are used first at generic_make_request() time, i.e. when higher level user performed read or write, if block was not fully sent, then it is queued to this thread and it will try to send the rest of the data when polling allows. ->make_request_fn() function returns in this case and higher layer can proceed with own operations.
Transaction is not freed until reply is received from the remote side or resending retry count fires.
Transaction is always allocated (from the appropriate memory pool) and that is actually all allocations in DST itself. In case it works with block devices, it is possible to clone a bio, when it crosses the boundaries (or even always, I have to check it, but it is essentially what device mapper with lots of own additional allocations), but it should be very rare condition.
Network stack will allocate data itself too.

That was a theory. Practice tells me, that essentially 90% of the code should be rewritten from scratch, so I recloned the tree and so far implemented generic bits of registering block device, creating various sysfs files and directories and other similar trivial bits. I still plan to finish it this weekend (without mirroring), but things may turn to me a different side though...

/devel/dst :: Link / Comments (0)


Tue, 15 Jul 2008

Distributed storage development roadmap.

Yes, DST project is alive and will beat out the crap very soon, since I decided to change its underlying architecture, and switch to transaction model just like POHMELFS. This basically means that as long as system has enough RAM writing operations will be extremely fast, reading can be balanced between multiple nodes (in mirror), transactions can be resent, failover mechanism becomes much simpler, and system overall will be much more robust to failures.

Transaction model also means that system requires explicit acknowlege from remote side, and there are two possibilities here: two handle implicit ack which comes with TCP ack packets like I experimented before, and send explicit ack from server for each client's request.
\ The former approach although has smaller performance overhead, still suffers from the fact, that pages sent via DST are always stateless, i.e. at this layer there is no knowledge about who sends this page. We can determine inode page belongs to, can even get a socket when page is about to be released when ack has been received, but we can not know from exactly which PIPE it was submitted into given socket, so when multiple threads send the same page via miltiple sendfile() calls we do not know when and how page will be released. We can put pipes this page belong to into single-linked list (since page has only two unused at this point pointers: LRU list head, and one of them is used to determine that this page belongs to sendfile()/splice codepath), and likely traversing this list will not hurt usual users, but malicios one can create a local DoS with this approach. After some experiments with the splice code today I decided to drop this idea implementation for now.
There is a strong argument in favour of explicit acks from the server: this allows to make asynchronous transaction processing (with implicit acks we can not hook into processing path, since we do not know where exactly skb with our pages is chained), and this does not hurt perfromance (which was proven by POHMELFS benchmarks).

So, overall plan to develop DST is to switch to transaction model and perform async processing of all events (there are only two actually: reading and writing of the given pages to given locations).
This task is not that complex, so I expect some new results later this week. Stay tuned!

/devel/dst :: Link / Comments (5)


Sun, 30 Mar 2008

Continue DST roadmap.

So, I have to admit that I rethought my opinion about mirroring/redundancy at filesystem layer - it is useful for lots of cases, and modulo bugs in DST mirroring (mostly a leak, which I can not find in my lab, and network/block layer race, which exists in sendfile() for years and just strikes DST a lot, which has a workaround though) I decided to rewrite mirroring algorithm in a way it could be used in other projects.

There is also an idea of how to fix abovementioned network/block layer race in a very non-disturbing manner, which was privately called soft DST barriers. Idea is to replace skb destructor with private one, which will commit that pages are no longer used (for example call bio_endio() or release splice buffer), this callback will be installed only for special sockets, which provide it (like DST, sendfile() or any other ->sendpage() users like samba). Idea was not killed on its roots, which is a good start sign.

/devel/dst :: Link / Comments (8)


Thu, 27 Mar 2008

Distributed storage roadmap.

DST project was quiet for a while, but actually it is not.
There is a bug in mirror algorithm, which I consider to rewrite. Not becuase of this bug, but because it will be used in special setup, where its extension required.
Consider a high-available *SQL cluster with multiple storage nodes combined into mirror and several main systems, which operate with database software. Unfortunately only single main system works with queries, other has to be turned on when first one fails. Task is to create a system, which will automatically switch between main nodes and recover if either main nodes or storage nodes become unavailable, so that the whole system does not stop if something wrong happend with machines. It has to scale to tens of nodes as a must and later hundreds without problems.

This is not a performance scalability solution - so far only single node should be able to collect multiple data nodes into storage, and if that node fails it has to be switched, but so far I do not know any working and free solution for the problem. But solution created for the main node switching can be used in cases when any server (for example metadata server in cluster) failed and has to be switched.

It will also force me to finally implement barriers in DST.

As a possible helper for availability messages I consider abandoned CARP-like protocol (in userspace).

/devel/dst :: Link / Comments (0)


Thu, 31 Jan 2008

BTRFS subvolumes.

Chris Mason created a short specification for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks on several devices and use tricky algoritms to distributed the load between storages.
Overall this is excellent idea, but specification rises some questions and I belive it is too heavily tied to ZFS design.

I will drop my thoughts here, which may be completely wrong though.

Here are some features btrfs will support with subvolume implementation: Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents; Checksum failure resolution by using a mirrored copy; Striped data extents and others.

They are clear targets for block layer, but there are following notes on why it is not:

If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but against data itself (i.e. it has to be recalculated after read), since during transfer data can be damaged and it is not that rare condition. Thus checksums from different mirror can be both be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.

If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big address space, it would not have sufficient information to allocate mirrored copies on different devices. Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST supports such iteraction for example.

Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem: there is pool of devices, and there are allocation policies for each one in the following form (just an example): files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3, small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed number of storages.

And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation, since a screenshot of high-level design clearly has Mac's shadows and fonts :)

/devel/dst :: Link / Comments (0)


Tue, 22 Jan 2008

New DST release: Succumbed to live ant.

This is a maintenance release only and it contains only following change:

  • do not allocate big enough address structure on the stack during local export node initialization
Great thanks to Serge Leschinsky and Konstantin Kalin for testing.

As usual one can get the latest version from project homepage or via git tree.

/devel/dst :: Link / Comments (8)


Wed, 26 Dec 2007

New release of the distributed storage: Groundhogs strike back: no New Year for humans!

Short changelog:

  • mirroring algorithm improvements
  • debug cleanups
  • extended mirroring initialization
  • documentation update
  • name is 'Groundhogs strike back: no New Year for humans' now
As usual, one can get patch or pull changes from the project homepage.

/devel/dst :: Link / Comments (2)


Mon, 17 Dec 2007

New release of the distributed storage: Dancing with the smoked neutrino.

Short changelog:

  • new improved mirroring algorithm.
    This algorithm uses sliding window approach for full resync and write log for partial resync.
  • fixed number of typos and debug cleanups
  • update inode size when linear algorithm changes the size of the storage in run time
  • extended number of sysfs files and documentation for them
  • fixed leak in local export node setup
  • name is 'Dancing with the smoked neutrino' now
Overall list of features of the DST can be found on project's homepage.

DST is also exported as a git tree available for clone and pull from here.

Interested reader can test DST with 2.6.23 tree too (it should compile fine, but was not tested).

/devel/dst :: Link / Comments (4)


New distributed storage mirroring algorithm.

Resync logic - sliding window algorithm.

At startup system checks age (unique cookie) of the node and if it does not match first node it resyncs all data from the first node in the mirror to others (non-sync nodes), each non-synced node has a window, which slides from the start of the node to the end. During resync all requests, which enter the window are queued, thus window has to be sufficiently small. When window is synced from the other nodes, queued requests are written and window moves forward, thus subsequent resync is started when previous window is fully completed. When window reaches end of the node, it is marked as synchronized.

If age of the node matches the first one, but log contains different number of write log entries compared to the first node (first node always stands as a clean), then partial resync is scheduled. Partial resync will also be scheduled when log entry pointed by resync index of the node contains error.

Mechanism of this resync type is following: system selects a sync node (checking each node's flags) and fetches a log entry pointed by resync index of the given node and resync data from other nodes to given one. Then it checks the rest of the write log and checks if there are another failed writes, so that next resync block would be fetched for them.

Mirroring log is used to store write request information. It is allocated on disk and in memory (sync happens each time resync work queue fires), and eats about 1% of free RAM or disk (what is less). Each write updates log, so when node goes offline, its log will be updated with error values, so that this entries could be resynced when node will be back online. When number of failed writes becomes equal to number of entries in the write log, recovery becomes impossible (since old log entries were overwritten) and full resync is scheduled.

This does not work well with the situation, when there are multiple writes to the same locations - they are considered as different writes and thus will be resynced multiple times. The right solution is to check log for each write, better if log would be not array, but tree.

/devel/dst :: Link / Comments (0)


Fri, 14 Dec 2007

Linux Test Project on top of DST storage.

# pwd
/mnt/ltp-full-20071130

# ./runltp -p -f fs -d `pwd`/tmp
...
# cat /mnt/ltp-full-20071130/results/results.2007-12-14.11.21.41.17106 
Test Start Time: Fri Dec 14 11:21:41 2007
-----------------------------------------
Testcase                       Result     Exit Value
--------                       ------     ----------
gf01                           PASS       0    
gf02                           PASS       0    
gf03                           PASS       0    
gf04                           PASS       0    
gf05                           PASS       0    
gf06                           PASS       0    
gf07                           PASS       0    

-----------------------------------------------
Total Tests: 57
Total Failures: 0
Kernel Version: 2.6.22-rc5-dst
Machine Architecture: x86_64
Hostname: uganda

# mount | grep mnt
/dev/dst-storage-32 on /mnt type xfs (rw)

# cat /sys/devices/storage/n-0-ffff*/type
R: 192.168.4.81:1025
R: 192.168.4.81:1026
All 'fs' tests completed successfully, although I saw following dump in dmesg:
[ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
[ 8398.609641] turning off the locking correctness validator.
which is XFS bug.

Since DST is quite dumb device, that tests will not find tricky places, but they are good to generate high load on top of given block device.

/devel/dst :: Link / Comments (0)


New mirroring module in the distributed storage.

$ git-diff-index --stat HEAD drivers/block/dst/alg_mirror.c
 drivers/block/dst/alg_mirror.c |  745 ++++++++++++++++++++--------------------
 1 files changed, 364 insertions(+), 381 deletions(-)
It is cool and works good in my environment, but (like previous) it forces total mirror resync after main storage node reboot or crash (if it is required, for example when array was not in sync already and main node rebooted).
I want to extend DST mirroring algorithm not to force full resync, but store a log of the writes on each node, so when new array starts, it would check not only age of the nodes (uique id stored at the end of each node, if it does not match, total resync starts), but also write log, so that the latter does not match, only selected number of regions would be synchronized.

Stay tuned...

/devel/dst :: Link / Comments (0)


Thu, 13 Dec 2007

Why pushing project into the kernel is not a main goal?..

One have to have some courage and do not afraid to throw something out and create new things instead of old, even if it will require a lot of efforts and some problems in a short cycle.
So I've just erased mirroring algorithm from DST and will rewrite it mostly from scratch, since I have a very interesting sync algortihm inmind, which will not require clean/dirty bitmap.
Havind DST in kernel would not allow me to have such flexibility...

/devel/dst :: Link / Comments (0)


Wed, 12 Dec 2007

I was a bit pessimistic about DST design bugs.

Things are only bad when resync of the mirror node is in place...
I fixed both issues, but will spent additional time debugging and testing the them, since I do not like how it was done. I think I will rewrite mirroring resync logic.

Subrata Modak of IBM suggested to use Linux Test Project, which I found to have interesting benchmarks, which while being very useful for filesystem development, still can find some bugs in DST.

/devel/dst :: Link / Comments (0)


Shame on me or how complex are design bugs...

I have to admit, that mirroring in DST is not currently well supported.
First, because of a bug I made in the early development stage: in DST there are two objects, which represent a part of the storage, first one is a node, this object contains information about type of the storage and pointers to structure, which represents low level device itself (like block device or network connection). Network connection in turn is represented as a state structure, which contains socket, state machine for transferred data and so on. Nodes are used when block io request comes from the higher layer and states are used when data is transfeerred via network. The former uses fain grained reference counters: when node is being operated on (request is processed), its reference counter is increased, if operations become asynchronous (for example sending queue is full and thus block can not be sent right now), then block request is queued into state's request list and reference counter for the node is dropped. If it reaches zero, node is being freed, which in turn calls exit callback for the state, which flushes the queue of requests.
Things seem simple and correct, but devil is in details - async processing thread can enter at any point into the game and process state too, which leads to bugs.
Second, DST mirroring can ate all your memory during resync, since it does not check amount of free ram in the system and tries to allocate new pages until all memory is used. This is already fixed in the private tree though.
And the last (known) problem is mirror bitmap - it uses single bit for single sector of the device, and although uses vmalloc(), it is still too much of RAM.

Back to fixing.

/devel/dst :: Link / Comments (0)


Mon, 10 Dec 2007

New distributed storage release: Gamardjoba, genacvale!

Short changelog:

  • wakeup state when mirror detected error to seedup reconnect
  • if connecting in csum mode to no-csum server, do not enable csums
  • do not clean queue until all users are removed
  • allow to increase size of the storage in linear add callback (with this change it is possible to add nodes into linear array in real time without stopping storage. Filesystem has to be prepared for the case when underlying device has changed its size. Real-time addon of mirror nodes is also supported)
  • allow to delete gendisk only after device was started
  • dst debug config option
  • Name: Gamardjoba, genacvale! ('Hi friend' in georgian)
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging!

As usual, one can get new release from the project homepage.

/devel/dst :: Link / Comments (0)


Fri, 07 Dec 2007

Strong checksumms in DST rocks.

Great thanks to person, who suggested me to implement them and Zach Brown, who showed, that Castagnoli crc is a better one than Adler.

I've debugged a setup where system failed to mount XFS filesystem on top of distributed storage, and after turned on strong checksums, system detected they were wrong, so some corruption happend during filesystem setup.
Turning off TSO, RX and TX offload of e1000 nics on machines, which form the storage, fixed the problem.

Strong checksumms rocks!

/devel/dst :: Link / Comments (3)


Distributed storage and long distances.

I've just completed some tests over the distributed system, created on top of usual internet links between machines, located in Moscow, Russia and London, UK.
Remote target was setup, then XFS filesystem created, mounted and some tests ran.
One of the machines (main storage server) is located behind at least one NAT firewall.

/devel/dst :: Link / Comments (4)


Thu, 06 Dec 2007

A simple way to crash machine using XFS and DST.

Let's suppose you want to create an XFS on top of DST array. If you mistakenly will run mkfs.xfs /dev/sda1 (let's suppose you want to create DST storage on top of /dev/sda1 device) and then start DST on top of /dev/sda1:

./dst -n storage -A alg_mirror -d /dev/sda1 -R -s0 -S0
this will overwrite the last sector of the /dev/sda1, where XFS stores its metadata. Mounting XFS after that will lead to almost 100% crash of the machine on 2.6.22 kernels because of some bugs in XFS, which appear when XFS reads corrupted metadata from the last sector.

To work with DST you have to operate with /dev/dst-$storage-$num devices (i.e. run mkfs.xfs /dev/dst-$storage-$num), and not with underlying ones.

/devel/dst :: Link / Comments (0)


Wed, 05 Dec 2007

Storage hotplugging in DST.

For the interested reader: yes, it is possible to add disks into DST storage on the fly, but be sure that your filesystem supports that (in case of linear setup), mirroring is fairly transparent.
Command to add another node into mirror setup is pretty simple:

./dst -n storage -A alg_mirror -S0 -s0 -a kano -p 1026
Just like adding usual node into the storage before it was started.

Please note, that when adding node which is smaller than current device size, device size will be reduced and this can damage your filesystem!
The same applies to linear setup.

/devel/dst :: Link / Comments (0)


Tue, 04 Dec 2007

DST FAQ.

The most frequently asked question about DST is:

Can you give us a summary of how this differs from using device mapper with NBD or iSCSI?
Answer is quite simple:
From the higher point of view it does not, but it operates quite differently: it has async processing of the requests, thus not blocking, it has different protocol with smaller overhead, supports strong checksums, has in-kernel export server, which supports simple security attributes (i.e. allow to connect, to read or write). It uses smaller amount of memory (zero additional allocations in the common path for linear mapping, not including network allocations, it uses smaller amount of additional allocations for mirroring case). DST supports failure recovery in case of dropped connection (core will reconnect to the remote node when it is ready), thus it is possible to turn off and on remote nodes without special administration steps. DST has simple autoconfiguration at the startup time (support checksums and storage size autonegotiation). It is possible to turn one of the mirror nodes off and use it as a offline backup, since dst mirror node stores data at the end of the storage, so it can be mounted locally.

/devel/dst :: Link / Comments (0)


New distributed storage subsystem release.

This is a maintenance release and includes bug fixes and simple feature extensions only.

Short changelog:

  • fixed bug with XFS metadata update (it can provide slab pages to the DST, so it is not allowed to transfer them using ->sendpage())
  • fixed async error completion path
  • extended netlink communication channel to report errors back to userspace
  • DST name is now "The 10'th dynasty of smuggled slothes"
  • number of fixes for userspace DST target
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging and fixes for userspace DST target and preliminary netlink extension patches.

As usual you can download this release from the homepage.

If you want to try distributed storage this release is a really good candidate to start with.

Enjoy!

Update: This release includes bug fixes for all bugs described here, including uninterruptible sync read operations.

/devel/dst :: Link / Comments (2)


Thu, 29 Nov 2007

Astonishingly screwed tapeworm.

New release of the distributed storage subsystem. This is maintenance release and includes bug fixes only.

Short changelog:

  • use node's size in sectors instead of bytes
  • fixed old/new ages for the first node. Error spotted by Matthew Hodgson (matthew_mxtelecom.com)
  • fixed debug printk declaration
  • new name
Overall list of features of the DST can be found on project's homepage.

/devel/dst :: Link / Comments (4)


Tue, 20 Nov 2007

Maintenance release of the distributed storage subsystem.

It contains only following bug fix:

  • Cleanup sysfs files on error path. Patch by Chris Madden (chris_reflexsecurity.com)
You can find the latest release on the project homepage.

/devel/dst :: Link / Comments (0)


Thu, 15 Nov 2007

CEPH distributed storage.

It was announced on LWN and kerneltrap recently.
I already wrote about this filesystem, after that I found (from discussion with Zach Brown) that this filesystem does not have a byte-range locking and when number of threads write to the same file, they become sync writes (i.e. no cache coherence protocols involved). I'm also not sure what this is about: I/O workloads should be done with the client cache off because the writeback is too non-deterministic.

That was my envy comments :), now good news.
First, Sage Weil (an author) works full-time on this project and funds it from own web hosting company, so it is possible to attract developers for money (he even hired someone to write kernel client instead of FUSE one). Second, it has completed design and working implementation (although some design issues are questionable).
So, likely it is a good choice to take a look for you, if you are searching for the solution which should be ready shortly.

/devel/dst :: Link / Comments (0)


Mon, 05 Nov 2007

Squizzed black-out of the dancing back-aching hippo.

This is a name of the 7'th DST release.
Changelog is quite big:

  • added strong checksum support (Castagnoli crc)
  • extended autoconfiguration (added ability to request if remote side supports strong checksum and turn it on if needed)
  • documentation addon - sysfs files
  • added clean/dirty sysfs files which allows to mark node as clean (sinc) or dirty (not sync)
  • fair number of bug fixes (including really tricky bastards, which are unlikely to be found in real setups, but which were still bugs)
  • and the main one - added release name (it clearly shows my condition)
This one is really good release, check this out quickly and enjoy the beast!

/devel/dst :: Link / Comments (15)


Next 40 entries