|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Mon, 13 Oct 2008
Massive documentation update for the distributed storage. New release.
Andrew Morton expresed (somewhat angry imho :) lack of documentation
for the DST
as a review-stopper, so I cleaned up up some simple stuff he reported (like
style changs, kcalloc() instead of kzalloc(),
config dependency and other such things) and wrote about 500 lines of code documentation.
Not that much, but it is a bit more than 10% of the whole DST project:
$ git commit -a -m "Documentation update."
Created commit 4886f36: Documentation update.
7 files changed, 476 insertions(+), 18 deletions(-)
$ git-diff-tree -r --stat origin master
warning: refname 'origin' is ambiguous.
drivers/block/Kconfig | 2 +
drivers/block/Makefile | 2 +
drivers/block/dst/Kconfig | 14 +
drivers/block/dst/Makefile | 3 +
drivers/block/dst/crypto.c | 731 +++++++++++++++++++++++++++++
drivers/block/dst/dcore.c | 963 +++++++++++++++++++++++++++++++++++++++
drivers/block/dst/export.c | 662 ++++++++++++++++++++++++++
drivers/block/dst/state.c | 838 ++++++++++++++++++++++++++++++++++
drivers/block/dst/thread_pool.c | 345 ++++++++++++++
drivers/block/dst/trans.c | 335 ++++++++++++++
include/linux/connector.h | 4 +-
include/linux/dst.h | 572 +++++++++++++++++++++++
12 files changed, 4470 insertions(+), 1 deletions(-)
As usual one can grab new release from the
archive or via GIT
tree.
/devel/dst :: Link / Comments ()
Mon, 06 Oct 2008
New distributed storage release.
New DST release contains following changes:
- Keepalive messages to early detect failed nodes, which are sent if there is no traffic between the nodes.
- Listening socket reuses address now, which speeds up stop/start sequence.
- Fixed bug with wrong debug option, which could read uninitialized memory.
- Change module name from
dst.ko to nst.ko, since the former is used by dvb card.
- Whitespace cleanup.
As usual patch is available from
archive
or via GIT tree.
Enjoy!
Asked for inclusion again. Let's make bets on number of comments for the patch :)
/devel/dst :: Link / Comments ()
Wed, 24 Sep 2008
New DST release.
This is a maintenance release, which contains following changes:
- Use idr to manage minor numbers. Now create/remove/create sequence does not
produce new minor, but uses previous one, which is now freed.
- Added cache name to the node. It is possible to have freed node still
being alive while we register new node with the same name, so its cache name should be different.
- Wait during node removal until there are no pending transaction, so node would be
freed in process context and not in the receiving threads itself.
- Warn user if there is no security permission config file during
export node initialization. No client will be allowed to connect
without explicit security association.
- Tune default size of the page pool for crypto processing a bit.
I want to thank Remy Ritchen (remy.ritchen_gmail.com) for his excellent tests and analysis.
As usual, DST
is available from archive and via
git tree.
/devel/dst :: Link / Comments ()
Sat, 13 Sep 2008
New distributed storage release.
This is maintenance only release of the
DST, which brings us following changes:
- Fixed memory leak in crypto thread initialization error path. Noticed by Sven Wegener (sven.wegener_stealer.net).
- Unprotected tree access (exceptionally stupid bug, I was made blind by the electronic equipment), and tricky bug_on catch in scsi
code caused by incorrect bio flag initialization in the exporting node. 64bit alignment fix.
Bugs reported by Rémy Ritchen(
- Couple of bogus compilation warnings about unintialized variables cought by different compiler.
- Allow both hread and write permission, not only read or write in security config.
Patch can be found in git tree
or archive.
The most tricky bug is scsi's BUG_ON(), which did not even contain any DST related calls.
It was cought at drivers/scsi/scsi_lib.c:1175:
kernel BUG at drivers/scsi/scsi_lib.c:1175!
RIP: 0010:[] [] scsi_setup_fs_cmnd+0x64/0x70
...
[] ? sd_prep_fn+0xa8/0x9b0
[] ? __cfq_slice_expired+0x59/0xb0
[] ? cfq_dispatch_requests+0x8d/0x330
[] ? elv_next_request+0x119/0x250
[] ? scsi_request_fn+0x6b/0x3c0
[] ? generic_unplug_device+0x24/0x30
[] ? blk_unplug_work+0x41/0x80
Which is the following code:
int scsi_setup_fs_cmnd(struct scsi_device *sdev, struct request *req)
{
struct scsi_cmnd *cmd;
int ret = scsi_prep_state_check(sdev, req);
if (ret != BLKPREP_OK)
return ret;
/*
* Filesystem requests must transfer data.
*/
BUG_ON(!req->nr_phys_segments);
Which means that request structure did not contain any segment to process. Origianlly
I thought that it is because of some tricky elevator steps, which selected wrong request queue
because of all debug showed, that sync bio (block IO request with BIO_RW_SYNC bit set)
is handled differently compared to the same request without this flag. But experiments with various
flags showed, that bug occurs no matter how, but just in completely unpredictible place.
Fortunately I managed to catch it in a debug trap in block IO merging path, which showed me, that
block IO requests with very srtange read/write and flags fields was a cause of this error. Looking more
precisely to the block queue allocation path, I found, that its default initialization is not correct,
and my setup happens before it, so it did not contain the right parameters for the maximum request sizes
(hw and phys sectors). This also showed, that one block IO request in the export node had clone and other
local-only fields, which is very wrong for the bio to be submitted, which actually resulted in the seen bug.
Those fields were set by the client bio and should not be transferred to the remote one, so I only limited flag fields
to show that bio is uptodate and have blockable IO bit.
That's the story about how things were hacked this day (its a middle of the night actually, while I'm waiting
for the taxi to move to the airport), so
POHMELFS locking algorithm
was not implemented today, and likely is postponed to the next weekend when I return, since I got a
group theory book and made some prints about numbers theory (after completed reading Vinogradov's book),
so I will have what to read in all four planes (two in each direction) if I will not fall asleep,
and likely I will not have much time in Portland:
we will need to talk/listen to other people and check local pubs (people suggested some coctail places, but I prefer
beer).
See you in Portland.
/devel/dst :: Link / Comments ()
Tue, 09 Sep 2008
New distributed storage release: "There is no spoon, black and white".
This is a very minor
DST update,
which contains following changes:
sector_t compilation warnings removed.
- Debug, init, alloc, whatever cleanups noted by Sven Wegener (sven.wegener_stealer.net).
- S o m e c h e c k p a t c h . p l m a s t u r b a t i o n
- New name: "Linux benevolent dictator said: there is no spoon, black and white"
Actually I fixed only small amount of the crap returned by checkpatch.pl,
particulary I did not fix cases of long lines, when it is actually a comment added after
some variable, or things like
for (i=0; i<n; ++i) and
struct some_name
{
...
}
when checkpatch.pl wants
for (i = 0; i < n; ++i) and
struct some_name {
...
}
But tried to remove more than 80-characters code strings, trailing spaces and
couple of other warnings.
Now I will concentrate on POHMELFS
locking and then distributed facilities. Stay tuned, new version will be extremely cool in this regard!
/devel/dst :: Link / Comments ()
Mon, 08 Sep 2008
New distributed storage release.
It brings us following changes:
- Permission checks in export node. Read-only connections.
- Remove DST node from the global table not only when it is freed,
but also on demand with node del command.
I think project is completed. I added inclusion request
(with grammar error of course, how else) into announcement mail.
Check it out!
/devel/dst :: Link / Comments ()
Wed, 27 Aug 2008
Completely new Distributed STorage (DST) release.
DST is a block layer
network device, which among others has following features:
- Kernel-side client and server. No need for any special tools for data processing (like special userspace applications) except for configuration.
- Bullet-proof memory allocations via memory pools for all temporary objects (transaction and so on).
- Zero-copy sending (except header) if supported by device using
sendpage().
- Failover recovery in case of broken link (reconnection if remote node is down).
- Full transaction support (resending of the failed transactions on timeout of after reconnect to failed node).
- Dynamically resizeable pool of threads used for data receiving and crypto processing.
- Initial autoconfiguration. Ability to extend it with additional attributes if needed.
- Support for any kind of network media (not limited to tcp or inet protocols) higher MAC layer (socket layer).
Out of the box kernel-side IPv6 support (needs to extend configuration utility, check how it was done in
POHMELFS).
- Security attributes for local export nodes (list of allowed to connect addresses with permissions). Not used currently though.
- Ability to use any supported cryptographically strong checksums. Ability to encrypt data channel.
Distributed storage was completely rewritten from scratch recenly. I dropped essentially
mirrored features of teh device mapper in favour of the more robust block io processing
and effective protocol.
One can grab sources (various configuration examples can be found in 'userspace' dir) from
archive,
or via
kernel and
userspace
GIT trees.
/devel/dst :: Link / Comments ()
Sat, 23 Aug 2008
Distributed storage.
Here we go,
DST
got all problems with reference counters fixed, there is somewhat new observation
I made for myself: block device has to provide open and release callbacks to block device
operation structure, which have to increase and decrease appropriate reference counters
of the underlying object, since otherwise it is possible to remove it
with proper del_gendis(), blk_cleanup_queue() and put_disk(),
but some references will exist in the mapping (like in the block device info structure),
so subsequent sync will crash the machine. Also tested lots of reconnection stuff, transaction
resending and timeout and so on.
Actually I would make a new release, but decided to test crypto stuff first. It was copied
from POHMELFS
and should work out of the box, but this requires an additional check of course.
Since tomorrow I will have an almost minute free fall from the several kilometers high
if weather permits, checks, bug fixes and release
are postponed for the start of the week.
Obviously if there will be no 'issues' with landing...
Stay tuned!
/devel/dst :: Link / Comments ()
Tue, 19 Aug 2008
Completed DST protocol implementation.
I did not yet test crypto processing (and there is no crypto autonegotiation
yet, I will extend automatic configuration protocol to allow nested attributes,
so it could be extended in the future if there will be any need for new parameters to be
synced between client and server). Also server side does not check security attributes
during connection (like read/write per-address permissions).
Because of excessive logging there were no possibility to check performance issues,
which in turn resulted in a too frequent stale transaction timer fires, excessive resending
and so on, so I introduced maximum amount of work to be done in each scan. With this
change I was able to successfully create ext3 filesystem on 8Gb storage connected
over 3 MB/s link to the remote node.
There is also an issue with broken connection: system tries to reconnect to the server
and does not allow to unload module if there are pending block requests,
each transaction has maximum number of retries, so system waits until each one reaches zero,
which may take too long. This may or may not be a good idea actually, but I think I will
implement transaction flushing during module unloading. Server node has the same issues if there
are pending blocks requested by client and yet not sent.
And although it sounds like a lot of work, it actually is not. I just need not to digress,
which is the most complex part :)
/devel/dst :: Link / Comments ()
Sun, 17 Aug 2008
DST. BIO. Barriers. Alignment.
I actually was wrong when
talked
about problems distributed storage
may have with non-page-aligned vectors inside single block rquest. Actually neither client
nor server should not know about how another peer works with given block request. Then only
thing which should be transferred between the peers is start of the request and its size.
And of course flags and operation mode (i.e. automatic sync/barrier support and read/write operation).
That's it. Server will allocate as many pages for the request as needed for own page size,
client will also process them just like a contigous flow of bytes coming out of the network pipe,
which then are placed into number of pages client was asked to read or write. Simple.
BIO can not have holes inside it, but it can have multiple pages
to be partially filled. And this information should not be shared between the nodes at all.
Which basically means that I do not need to even think about how to handle this problems and just
complete protocol between the peers. Stay tuned!
/devel/dst :: Link / Comments ()
Thu, 14 Aug 2008
Distributed storage debugging.
DST
testing revealed number of bugs, which could be easily fixed if I would not need
to debug essentially two separated subsystems of the DST: client and server. Both
share lots of code, but it is quite problematic to find who broke the protocol,
when one of them starts complaining.
So far peers can connect and start initial data exchange, but there is a major problem,
if page size differs on the nodes. Block IO request (bio structure) operates on pages
(stored in the bio_vec structure), and has a size and offset attached to each page.
If page size differs, and server node has smaller page, then it should somehow store information
about how to split own set of pages allocated for given bio size into chunks expected by the
client (since we need to transfer size/offser pair for each page).
There is no such mechanism right now. It is possible to implement naive approach, when
server node will allocate bio pages with sizes requested by the client, but this will break
just after short time, since the only guaranteed kernel allocation in Linux VM is single page.
Another approach is to allocate the whole block request (bio) on the server for each page
of client data (bio_vec structure), but this will have too big overhead on sequential
access and in common case, when page size is equal on both sides of the network channel.
Network block device does not have this problem, since its server lives in userspace and can allocate
arbitrary amount of ram, which will be contiguous in virtual memory. Using virtual memory is very slow,
although it is possible to just allocate needed buffers using vmalloc(). iSCSI uses single
command per block request.
So far I plan to implement following scheme for reading command (which is only one which has described problem):
client will iterate over all block requests in each bio it is about to send, and will send as many commands
as number of non contiguous blocks in given block request. Server will receive that blocks as separate subcommands,
and will allocate a new bio for each such request. Client will need to increment transaction reference counter
to the number of such commands, since server can reply to them in arbitrary order.
In the common case this actually should not happen, and I did not see it in practice either, since most
reading bios come either from readahead (where they are contiguous) or single block requests (which if bigger
than page size will also be contiguous), but nevertheless in theory such bios, where there is number of non-contiguous blocks,
can exist and DST should be ready for them.
/devel/dst :: Link / Comments ()
Tue, 12 Aug 2008
Distributed storage testing revealed first bugs. Also some Japanese notes.
So, we need to celebrate its birth.

Japan does can produce a very tasty whiskey! I recommend 'Suntory Old Whisky' label, although I got the last bottle
in my favourite shop, so I would not be surprised, if it is not that popular drink.
/devel/dst :: Link / Comments ()
Wed, 06 Aug 2008
Distributed storage, POHMELFS, netchannels development.
While a lot of action around filesystems rised recently, I made a short delay
there and concentrated on lower block layer:
DST.
Distributed storage essentially got export capabilities, i.e. data receiving, crypto processing,
block layer request allocation and submitting, reply generation and so on,
although it is more like a proof-of-concept right now, since requires lots and lots
of testing. There are also plans for some additional features, but it is not that lot of work.
So project completion is very close.
POHMELFS
priorities have been switched a bit. After number of talks with people I decided
first to implement the right locking semantics (probably will be turned on/off by mount option),
which would allow simultaneous read/write to be performed the way people expect from local filesystem.
Currently it uses a bit tricky cache coherency protocol, which in some cases can end up with different
results than expected from local filesystem.
Next will be distributed server-side hash table development.
Netchannels will also
get new release very soon. It will be simplified and some unneded funtionality (like netchannels NAT) will be removed.
I will also run some new tests with userspace network stack,
namely latency measurements.
/devel/dst :: Link / Comments ()
Wed, 30 Jul 2008
Distributed storage development progress report.
DST
got full transaction support (resending, timeout completion, error recovery,
memory pool allocation for all kinds of transactions, single transaction
allocation per IO request),
socket processing (initialization of the connected and listened sockets,
failover recovery of the connection, receiving thread, network helpers),
crypto processing of the requests (thread pool utilization for crypto operations,
cipher/hash initialization, cached pages for sending crypto processing).
Thinking of moving receiving and listen/accepted sockets processing to the thread pool too,
likely it is a way to go, right now they have own threads.
Missing bits include the actual data sending/receiving and client accepting by
listened socket (and appropriate initalization of the all needed infrastructure).
This is a quite major part, but likely it will be completed sooner than later.
/devel/dst :: Link / Comments ()
Mon, 28 Jul 2008
Distributed storage development progress. Thread pools.
Today I implemented simple thread pool subsystem, which allows
to create set of threads, to add/remove them them from this set
in run-time, and to schedule a work to be done by them. Work
is specified as to functions: setup() - it is called when
system has selected a thread for execution, so caller can
setup needed data, and action() - it is called by thread itself,
it has access to the data, provided at initialization time.
Work scheduling has a timeout parameter, which corresponds to
time system will wait for free thread, otherwise error is returned.
System is generic enough not to contain any notion about DST or crypto,
only two new data types: struct thread_pool and
struct thread_pool_worker, only the former is visible to the user.
API looks like this:
void thread_pool_del_worker(struct thread_pool *p);
struct thread_pool_worker *thread_pool_add_worker(struct thread_pool *p,
char *name,
int (* init)(void *private),
void (* cleanup)(void *private),
void *private);
void thread_pool_destroy(struct thread_pool *p);
struct thread_pool *thread_pool_create(int num, char *name,
int (* init)(void *private),
void (* cleanup)(void *private),
void *private);
int thread_pool_schedule(struct thread_pool *p,
int (* setup)(void *private, void *data),
int (* action)(void *private),
void *data, long timeout);
init() and cleanup() callbacks above are used after
new thread is created, so that user could initialize per-thread data,
for example it is used to allocate some cached pages and initialize
crypto algorithms.
This thread pool system is used by the crypto processing code in
the distributed subsystem: when block io request is about to be sent,
or when system has received reply for the read request, it schedules
crypto processing work to the pool, initialized at DST node setup time.
Crypto processing does not yet work in DST as long as some other bits,
so far I only played a bit with its initlialization sequence, so it was
split to network, crypto, security initializations and node start, which
registers new storage in the block layer subsytem. This steps allow to introduce
later additional initialization steps if needed without breaking backward
compatibility.
Next steps include proper network initialization and processing and transaction
management helpers. Then I will combine all existing code and make a first
renewed release.
Stay tuned!
/devel/dst :: Link / Comments ()
Sat, 19 Jul 2008
Disributed storage is dead, long live the Distributed storage!
As you may know, DST
project was an attempt to implement redundant, failover resistant, flexible block level storage
subsytem. Among other features it supported ability to map multiple remote nodes via linear
or mirroring algorithms to single node, reconnect to failed node, reading balancing and
parallel writing to multiple nodes (in case of mirroring) and
so on.
Now it has gone. There is no more distributed storage you knew before, instead there is
completely new project being developed, which main goal is to provide a transport layer for
the block requests only. Consider it as Network Block Device on huge steroids. Consider it
as iSCSI on huge steroids. Consider it as ATA-over-Ethernet on even more huge steroids.
It is just an example of what all those protocols should have. And only that.
An it does not sound very ambitious, previous DST versions already supported lots of features,
which never existed (and in some cases were impossible to be added) in another block level
network storages.
DST moves further.
There will be no mirroring and overall ability to map multiple devices into single one,
instead one should use Device Mapper for this goal, since its features were simply mirrored
(although I tried to optimize them sometimes) in DST, and amount of targets was noticebly smaller.
Now DST is just a simple block device which operates on top of network connection. With just a
single exception: its done right.
Features planned for the new Distributed Storage:
- kernelspace client and server
- initial autoconfiguration between client and server nodes
- automatic reconnect to failed target
- transaction model: resending, timeout error completion, full rollback of the failed transaction
- wire speed performance
- data channel encryption, strong checksumming
- cryptographical authentification
- ability to work on top of any network protocol
- barriers support (when, if any, Device Mapper will start support them, DST will not need to be changed)
- flexible protocol with simple ability to extend it to needed functionality
- trivial configuration
Project is being written from scratch, but it is actually very simple,
and should be quite small, so expect its first release quite soon.
It will be pushed upstream when ready.
/devel/dst :: Link / Comments ()
Fri, 18 Jul 2008
Completed distributed storage redesign.
I also managed to play second octave F# and sometimes the whole chromatic scale
down to small (minor?) octave F on my trumpet, and I belive I started to understand
overall trumpet kung-fu, but expect it is not what you wanted to read under
DST tag.
So, DST becomes smaller, cleaner and simpler. Notably, I decided to drop userspace
target completely for now.
Kernel part now operates on transaction entity, which holds a reference to the node,
where data should be sent/received. There can be at most two such nodes if block IO
request spans the boundary. In case of mirroring (which will be dropped for the first release)
list of nodes to mirror this data to will be maintained by the first node, so transaction
will not need to know about them.
In theory block request can be as much as BIO_MAX_PAGES pages,
which is 256 for now, but I decided to limit minimum node size to be not smaller than
above bio limit, so there will be always at most two nodes per request.
Each node has either block device behind it (so it will just call generic_make_request()
with different block device for given bio), or network state machine.
Network state will have two threads: RX and TX. Receive one is used to get replies for the
read/write messages, search appropriate transaction and complete it.
In case of DST server it will also handle read/write requests and generate replies, but the whole
processing will be exactly the same, client node will have a switch to process read/write requests from
the network, but they should be only received by server.
Sending thread is tricky.
It is used as fallback for non-blocking sockets, which are used first at generic_make_request()
time, i.e. when higher level user performed read or write, if block was not fully sent,
then it is queued to this thread and it will try to send the rest of the data when
polling allows. ->make_request_fn() function returns in this case and higher
layer can proceed with own operations.
Transaction is not freed until reply is received from the remote side or resending retry
count fires.
Transaction is always allocated (from the appropriate memory pool) and that is actually
all allocations in DST itself. In case it works with block devices, it is possible to clone a bio,
when it crosses the boundaries (or even always, I have to check it, but it is essentially
what device mapper with lots of own additional allocations), but it should be very rare condition.
Network stack will allocate data itself too.
That was a theory. Practice tells me, that essentially 90% of the code should be rewritten
from scratch, so I recloned the tree and so far implemented generic bits of registering
block device, creating various sysfs files and directories and other similar trivial bits.
I still plan to finish it this weekend (without mirroring), but things may turn to me a different side though...
/devel/dst :: Link / Comments ()
Tue, 15 Jul 2008
Distributed storage development roadmap.
Yes, DST
project is alive and will beat out the crap very soon, since I decided to change its
underlying architecture, and switch to transaction model just like
POHMELFS.
This basically means that as long as system has enough RAM writing operations will be
extremely fast, reading can be balanced between multiple nodes (in mirror), transactions
can be resent, failover mechanism becomes much simpler,
and system overall will be much more robust to failures.
Transaction model also means that system requires explicit acknowlege from remote side,
and there are two possibilities here: two handle implicit ack which comes with TCP ack
packets like I experimented
before, and send explicit ack from server for each client's request. \
The former approach although has smaller performance overhead, still suffers from
the fact, that pages sent via DST are always stateless, i.e. at this layer there is
no knowledge about who sends this page. We can determine inode page belongs to, can
even get a socket when page is about to be released when ack has been received,
but we can not know from exactly which PIPE it was submitted into given socket,
so when multiple threads send the same page via miltiple sendfile()
calls we do not know when and how page will be released. We can put pipes this page belong
to into single-linked list (since page has only two unused at this point pointers: LRU
list head, and one of them is used to determine that this page belongs to sendfile()/splice codepath),
and likely traversing this list will not hurt usual users, but malicios one can
create a local DoS with this approach. After some experiments with the splice code
today I decided to drop this idea implementation for now.
There is a strong argument in favour of explicit acks from the server: this allows to make asynchronous transaction
processing (with implicit acks we can not hook into processing path, since we do not know where exactly
skb with our pages is chained), and this does not hurt perfromance (which was proven by
POHMELFS benchmarks).
So, overall plan to develop DST is to switch to transaction model and perform async processing
of all events (there are only two actually: reading and writing of the given pages to given
locations).
This task is not that complex, so I expect some new results later this week. Stay tuned!
/devel/dst :: Link / Comments ()
Sun, 30 Mar 2008
Continue DST roadmap.
So, I have to admit that I rethought my
opinion
about mirroring/redundancy at filesystem layer - it is useful for lots of cases,
and modulo bugs in
DST
mirroring (mostly a leak, which I can not find in my lab,
and network/block layer race,
which exists in sendfile() for years and just strikes DST a lot,
which has a workaround though) I decided to rewrite mirroring algorithm in a way
it could be used in other projects.
There is also an idea of how to fix abovementioned network/block layer race in a
very non-disturbing manner, which was privately called soft
DST barriers.
Idea is to replace skb destructor with private one, which will commit that
pages are no longer used (for example call bio_endio() or
release splice buffer), this callback will be installed only for special sockets,
which provide it (like DST, sendfile() or any other
->sendpage() users like samba). Idea was
not killed on its roots,
which is a good start sign.
/devel/dst :: Link / Comments ()
Thu, 27 Mar 2008
Distributed storage roadmap.
DST
project was quiet for a while, but actually it is not.
There is a bug in mirror algorithm, which I consider to rewrite. Not becuase of
this bug, but because it will be used in special setup, where its extension required.
Consider a high-available *SQL cluster with multiple storage nodes combined into
mirror and several main systems, which operate with database software. Unfortunately
only single main system works with queries, other has to be turned on when first one fails.
Task is to create a system, which will automatically switch between main nodes and
recover if either main nodes or storage nodes become unavailable, so that the whole
system does not stop if something wrong happend with machines. It has to scale
to tens of nodes as a must and later hundreds without problems.
This is not a performance scalability solution - so far only single node should be able to
collect multiple data nodes into storage, and if that node fails it has to be switched,
but so far I do not know any working and free solution for the problem. But solution created
for the main node switching can be used in cases when any server (for example metadata server
in cluster) failed and has to be switched.
It will also force me to finally implement barriers in DST.
As a possible helper for availability messages
I consider abandoned CARP-like
protocol (in userspace).
/devel/dst :: Link / Comments ()
Thu, 31 Jan 2008
BTRFS subvolumes.
Chris Mason created a short specification
for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks
on several devices and use tricky algoritms to distributed the load
between storages.
Overall this is excellent idea, but specification rises some questions and I belive
it is too heavily tied to ZFS design.
I will drop my thoughts here, which may be completely wrong though.
Here are some features btrfs will support with subvolume implementation:
Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents;
Checksum failure resolution by using a mirrored copy; Striped data extents and others.
They are clear targets for block layer, but there are following notes on why it is not:
If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum
failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of
the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but
against data itself (i.e. it has to be recalculated after read), since during transfer data can
be damaged and it is not that rare condition. Thus checksums from different mirror can be both
be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.
If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big
address space, it would not have sufficient information to allocate mirrored copies on different devices.
Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST
supports such iteraction for example.
Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem:
there is pool of devices, and there are allocation policies for each one in the following form (just an example):
files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3,
small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed
number of storages.
And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation,
since a screenshot of high-level design
clearly has Mac's shadows and fonts :)
/devel/dst :: Link / Comments ()
Tue, 22 Jan 2008
New DST release: Succumbed to live ant.
This is a maintenance release only and it contains
only following change:
- do not allocate big enough address structure on the stack
during local export node initialization
Great thanks to Serge Leschinsky and Konstantin Kalin for testing.
As usual one can get the latest version from
project homepage
or via git tree.
/devel/dst :: Link / Comments ()
|