|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Mon, 13 Oct 2008
Massive documentation update for the distributed storage. New release.
Andrew Morton expresed (somewhat angry imho :) lack of documentation
for the DST
as a review-stopper, so I cleaned up up some simple stuff he reported (like
style changs, kcalloc() instead of kzalloc(),
config dependency and other such things) and wrote about 500 lines of code documentation.
Not that much, but it is a bit more than 10% of the whole DST project:
$ git commit -a -m "Documentation update."
Created commit 4886f36: Documentation update.
7 files changed, 476 insertions(+), 18 deletions(-)
$ git-diff-tree -r --stat origin master
warning: refname 'origin' is ambiguous.
drivers/block/Kconfig | 2 +
drivers/block/Makefile | 2 +
drivers/block/dst/Kconfig | 14 +
drivers/block/dst/Makefile | 3 +
drivers/block/dst/crypto.c | 731 +++++++++++++++++++++++++++++
drivers/block/dst/dcore.c | 963 +++++++++++++++++++++++++++++++++++++++
drivers/block/dst/export.c | 662 ++++++++++++++++++++++++++
drivers/block/dst/state.c | 838 ++++++++++++++++++++++++++++++++++
drivers/block/dst/thread_pool.c | 345 ++++++++++++++
drivers/block/dst/trans.c | 335 ++++++++++++++
include/linux/connector.h | 4 +-
include/linux/dst.h | 572 +++++++++++++++++++++++
12 files changed, 4470 insertions(+), 1 deletions(-)
As usual one can grab new release from the
archive or via GIT
tree.
/devel/dst :: Link / Comments ()
Mon, 06 Oct 2008
New distributed storage release.
New DST release contains following changes:
- Keepalive messages to early detect failed nodes, which are sent if there is no traffic between the nodes.
- Listening socket reuses address now, which speeds up stop/start sequence.
- Fixed bug with wrong debug option, which could read uninitialized memory.
- Change module name from
dst.ko to nst.ko, since the former is used by dvb card.
- Whitespace cleanup.
As usual patch is available from
archive
or via GIT tree.
Enjoy!
Asked for inclusion again. Let's make bets on number of comments for the patch :)
/devel/dst :: Link / Comments ()
Wed, 24 Sep 2008
New DST release.
This is a maintenance release, which contains following changes:
- Use idr to manage minor numbers. Now create/remove/create sequence does not
produce new minor, but uses previous one, which is now freed.
- Added cache name to the node. It is possible to have freed node still
being alive while we register new node with the same name, so its cache name should be different.
- Wait during node removal until there are no pending transaction, so node would be
freed in process context and not in the receiving threads itself.
- Warn user if there is no security permission config file during
export node initialization. No client will be allowed to connect
without explicit security association.
- Tune default size of the page pool for crypto processing a bit.
I want to thank Remy Ritchen (remy.ritchen_gmail.com) for his excellent tests and analysis.
As usual, DST
is available from archive and via
git tree.
/devel/dst :: Link / Comments ()
Sat, 13 Sep 2008
New distributed storage release.
This is maintenance only release of the
DST, which brings us following changes:
- Fixed memory leak in crypto thread initialization error path. Noticed by Sven Wegener (sven.wegener_stealer.net).
- Unprotected tree access (exceptionally stupid bug, I was made blind by the electronic equipment), and tricky bug_on catch in scsi
code caused by incorrect bio flag initialization in the exporting node. 64bit alignment fix.
Bugs reported by Rémy Ritchen(
- Couple of bogus compilation warnings about unintialized variables cought by different compiler.
- Allow both hread and write permission, not only read or write in security config.
Patch can be found in git tree
or archive.
The most tricky bug is scsi's BUG_ON(), which did not even contain any DST related calls.
It was cought at drivers/scsi/scsi_lib.c:1175:
kernel BUG at drivers/scsi/scsi_lib.c:1175!
RIP: 0010:[] [] scsi_setup_fs_cmnd+0x64/0x70
...
[] ? sd_prep_fn+0xa8/0x9b0
[] ? __cfq_slice_expired+0x59/0xb0
[] ? cfq_dispatch_requests+0x8d/0x330
[] ? elv_next_request+0x119/0x250
[] ? scsi_request_fn+0x6b/0x3c0
[] ? generic_unplug_device+0x24/0x30
[] ? blk_unplug_work+0x41/0x80
Which is the following code:
int scsi_setup_fs_cmnd(struct scsi_device *sdev, struct request *req)
{
struct scsi_cmnd *cmd;
int ret = scsi_prep_state_check(sdev, req);
if (ret != BLKPREP_OK)
return ret;
/*
* Filesystem requests must transfer data.
*/
BUG_ON(!req->nr_phys_segments);
Which means that request structure did not contain any segment to process. Origianlly
I thought that it is because of some tricky elevator steps, which selected wrong request queue
because of all debug showed, that sync bio (block IO request with BIO_RW_SYNC bit set)
is handled differently compared to the same request without this flag. But experiments with various
flags showed, that bug occurs no matter how, but just in completely unpredictible place.
Fortunately I managed to catch it in a debug trap in block IO merging path, which showed me, that
block IO requests with very srtange read/write and flags fields was a cause of this error. Looking more
precisely to the block queue allocation path, I found, that its default initialization is not correct,
and my setup happens before it, so it did not contain the right parameters for the maximum request sizes
(hw and phys sectors). This also showed, that one block IO request in the export node had clone and other
local-only fields, which is very wrong for the bio to be submitted, which actually resulted in the seen bug.
Those fields were set by the client bio and should not be transferred to the remote one, so I only limited flag fields
to show that bio is uptodate and have blockable IO bit.
That's the story about how things were hacked this day (its a middle of the night actually, while I'm waiting
for the taxi to move to the airport), so
POHMELFS locking algorithm
was not implemented today, and likely is postponed to the next weekend when I return, since I got a
group theory book and made some prints about numbers theory (after completed reading Vinogradov's book),
so I will have what to read in all four planes (two in each direction) if I will not fall asleep,
and likely I will not have much time in Portland:
we will need to talk/listen to other people and check local pubs (people suggested some coctail places, but I prefer
beer).
See you in Portland.
/devel/dst :: Link / Comments ()
Tue, 09 Sep 2008
New distributed storage release: "There is no spoon, black and white".
This is a very minor
DST update,
which contains following changes:
sector_t compilation warnings removed.
- Debug, init, alloc, whatever cleanups noted by Sven Wegener (sven.wegener_stealer.net).
- S o m e c h e c k p a t c h . p l m a s t u r b a t i o n
- New name: "Linux benevolent dictator said: there is no spoon, black and white"
Actually I fixed only small amount of the crap returned by checkpatch.pl,
particulary I did not fix cases of long lines, when it is actually a comment added after
some variable, or things like
for (i=0; i<n; ++i) and
struct some_name
{
...
}
when checkpatch.pl wants
for (i = 0; i < n; ++i) and
struct some_name {
...
}
But tried to remove more than 80-characters code strings, trailing spaces and
couple of other warnings.
Now I will concentrate on POHMELFS
locking and then distributed facilities. Stay tuned, new version will be extremely cool in this regard!
/devel/dst :: Link / Comments ()
Mon, 08 Sep 2008
New distributed storage release.
It brings us following changes:
- Permission checks in export node. Read-only connections.
- Remove DST node from the global table not only when it is freed,
but also on demand with node del command.
I think project is completed. I added inclusion request
(with grammar error of course, how else) into announcement mail.
Check it out!
/devel/dst :: Link / Comments ()
Wed, 27 Aug 2008
Completely new Distributed STorage (DST) release.
DST is a block layer
network device, which among others has following features:
- Kernel-side client and server. No need for any special tools for data processing (like special userspace applications) except for configuration.
- Bullet-proof memory allocations via memory pools for all temporary objects (transaction and so on).
- Zero-copy sending (except header) if supported by device using
sendpage().
- Failover recovery in case of broken link (reconnection if remote node is down).
- Full transaction support (resending of the failed transactions on timeout of after reconnect to failed node).
- Dynamically resizeable pool of threads used for data receiving and crypto processing.
- Initial autoconfiguration. Ability to extend it with additional attributes if needed.
- Support for any kind of network media (not limited to tcp or inet protocols) higher MAC layer (socket layer).
Out of the box kernel-side IPv6 support (needs to extend configuration utility, check how it was done in
POHMELFS).
- Security attributes for local export nodes (list of allowed to connect addresses with permissions). Not used currently though.
- Ability to use any supported cryptographically strong checksums. Ability to encrypt data channel.
Distributed storage was completely rewritten from scratch recenly. I dropped essentially
mirrored features of teh device mapper in favour of the more robust block io processing
and effective protocol.
One can grab sources (various configuration examples can be found in 'userspace' dir) from
archive,
or via
kernel and
userspace
GIT trees.
/devel/dst :: Link / Comments ()
Sat, 23 Aug 2008
Distributed storage.
Here we go,
DST
got all problems with reference counters fixed, there is somewhat new observation
I made for myself: block device has to provide open and release callbacks to block device
operation structure, which have to increase and decrease appropriate reference counters
of the underlying object, since otherwise it is possible to remove it
with proper del_gendis(), blk_cleanup_queue() and put_disk(),
but some references will exist in the mapping (like in the block device info structure),
so subsequent sync will crash the machine. Also tested lots of reconnection stuff, transaction
resending and timeout and so on.
Actually I would make a new release, but decided to test crypto stuff first. It was copied
from POHMELFS
and should work out of the box, but this requires an additional check of course.
Since tomorrow I will have an almost minute free fall from the several kilometers high
if weather permits, checks, bug fixes and release
are postponed for the start of the week.
Obviously if there will be no 'issues' with landing...
Stay tuned!
/devel/dst :: Link / Comments ()
Tue, 19 Aug 2008
Completed DST protocol implementation.
I did not yet test crypto processing (and there is no crypto autonegotiation
yet, I will extend automatic configuration protocol to allow nested attributes,
so it could be extended in the future if there will be any need for new parameters to be
synced between client and server). Also server side does not check security attributes
during connection (like read/write per-address permissions).
Because of excessive logging there were no possibility to check performance issues,
which in turn resulted in a too frequent stale transaction timer fires, excessive resending
and so on, so I introduced maximum amount of work to be done in each scan. With this
change I was able to successfully create ext3 filesystem on 8Gb storage connected
over 3 MB/s link to the remote node.
There is also an issue with broken connection: system tries to reconnect to the server
and does not allow to unload module if there are pending block requests,
each transaction has maximum number of retries, so system waits until each one reaches zero,
which may take too long. This may or may not be a good idea actually, but I think I will
implement transaction flushing during module unloading. Server node has the same issues if there
are pending blocks requested by client and yet not sent.
And although it sounds like a lot of work, it actually is not. I just need not to digress,
which is the most complex part :)
/devel/dst :: Link / Comments ()
Sun, 17 Aug 2008
DST. BIO. Barriers. Alignment.
I actually was wrong when
talked
about problems distributed storage
may have with non-page-aligned vectors inside single block rquest. Actually neither client
nor server should not know about how another peer works with given block request. Then only
thing which should be transferred between the peers is start of the request and its size.
And of course flags and operation mode (i.e. automatic sync/barrier support and read/write operation).
That's it. Server will allocate as many pages for the request as needed for own page size,
client will also process them just like a contigous flow of bytes coming out of the network pipe,
which then are placed into number of pages client was asked to read or write. Simple.
BIO can not have holes inside it, but it can have multiple pages
to be partially filled. And this information should not be shared between the nodes at all.
Which basically means that I do not need to even think about how to handle this problems and just
complete protocol between the peers. Stay tuned!
/devel/dst :: Link / Comments ()
Thu, 14 Aug 2008
Distributed storage debugging.
DST
testing revealed number of bugs, which could be easily fixed if I would not need
to debug essentially two separated subsystems of the DST: client and server. Both
share lots of code, but it is quite problematic to find who broke the protocol,
when one of them starts complaining.
So far peers can connect and start initial data exchange, but there is a major problem,
if page size differs on the nodes. Block IO request (bio structure) operates on pages
(stored in the bio_vec structure), and has a size and offset attached to each page.
If page size differs, and server node has smaller page, then it should somehow store information
about how to split own set of pages allocated for given bio size into chunks expected by the
client (since we need to transfer size/offser pair for each page).
There is no such mechanism right now. It is possible to implement naive approach, when
server node will allocate bio pages with sizes requested by the client, but this will break
just after short time, since the only guaranteed kernel allocation in Linux VM is single page.
Another approach is to allocate the whole block request (bio) on the server for each page
of client data (bio_vec structure), but this will have too big overhead on sequential
access and in common case, when page size is equal on both sides of the network channel.
Network block device does not have this problem, since its server lives in userspace and can allocate
arbitrary amount of ram, which will be contiguous in virtual memory. Using virtual memory is very slow,
although it is possible to just allocate needed buffers using vmalloc(). iSCSI uses single
command per block request.
So far I plan to implement following scheme for reading command (which is only one which has described problem):
client will iterate over all block requests in each bio it is about to send, and will send as many commands
as number of non contiguous blocks in given block request. Server will receive that blocks as separate subcommands,
and will allocate a new bio for each such request. Client will need to increment transaction reference counter
to the number of such commands, since server can reply to them in arbitrary order.
In the common case this actually should not happen, and I did not see it in practice either, since most
reading bios come either from readahead (where they are contiguous) or single block requests (which if bigger
than page size will also be contiguous), but nevertheless in theory such bios, where there is number of non-contiguous blocks,
can exist and DST should be ready for them.
/devel/dst :: Link / Comments ()
Tue, 12 Aug 2008
Distributed storage testing revealed first bugs. Also some Japanese notes.
So, we need to celebrate its birth.

Japan does can produce a very tasty whiskey! I recommend 'Suntory Old Whisky' label, although I got the last bottle
in my favourite shop, so I would not be surprised, if it is not that popular drink.
/devel/dst :: Link / Comments ()
Wed, 06 Aug 2008
Distributed storage, POHMELFS, netchannels development.
While a lot of action around filesystems rised recently, I made a short delay
there and concentrated on lower block layer:
DST.
Distributed storage essentially got export capabilities, i.e. data receiving, crypto processing,
block layer request allocation and submitting, reply generation and so on,
although it is more like a proof-of-concept right now, since requires lots and lots
of testing. There are also plans for some additional features, but it is not that lot of work.
So project completion is very close.
POHMELFS
priorities have been switched a bit. After number of talks with people I decided
first to implement the right locking semantics (probably will be turned on/off by mount option),
which would allow simultaneous read/write to be performed the way people expect from local filesystem.
Currently it uses a bit tricky cache coherency protocol, which in some cases can end up with different
results than expected from local filesystem.
Next will be distributed server-side hash table development.
Netchannels will also
get new release very soon. It will be simplified and some unneded funtionality (like netchannels NAT) will be removed.
I will also run some new tests with userspace network stack,
namely latency measurements.
/devel/dst :: Link / Comments ()
Wed, 30 Jul 2008
Distributed storage development progress report.
DST
got full transaction support (resending, timeout completion, error recovery,
memory pool allocation for all kinds of transactions, single transaction
allocation per IO request),
socket processing (initialization of the connected and listened sockets,
failover recovery of the connection, receiving thread, network helpers),
crypto processing of the requests (thread pool utilization for crypto operations,
cipher/hash initialization, cached pages for sending crypto processing).
Thinking of moving receiving and listen/accepted sockets processing to the thread pool too,
likely it is a way to go, right now they have own threads.
Missing bits include the actual data sending/receiving and client accepting by
listened socket (and appropriate initalization of the all needed infrastructure).
This is a quite major part, but likely it will be completed sooner than later.
/devel/dst :: Link / Comments ()
Mon, 28 Jul 2008
Distributed storage development progress. Thread pools.
Today I implemented simple thread pool subsystem, which allows
to create set of threads, to add/remove them them from this set
in run-time, and to schedule a work to be done by them. Work
is specified as to functions: setup() - it is called when
system has selected a thread for execution, so caller can
setup needed data, and action() - it is called by thread itself,
it has access to the data, provided at initialization time.
Work scheduling has a timeout parameter, which corresponds to
time system will wait for free thread, otherwise error is returned.
System is generic enough not to contain any notion about DST or crypto,
only two new data types: struct thread_pool and
struct thread_pool_worker, only the former is visible to the user.
API looks like this:
void thread_pool_del_worker(struct thread_pool *p);
struct thread_pool_worker *thread_pool_add_worker(struct thread_pool *p,
char *name,
int (* init)(void *private),
void (* cleanup)(void *private),
void *private);
void thread_pool_destroy(struct thread_pool *p);
struct thread_pool *thread_pool_create(int num, char *name,
int (* init)(void *private),
void (* cleanup)(void *private),
void *private);
int thread_pool_schedule(struct thread_pool *p,
int (* setup)(void *private, void *data),
int (* action)(void *private),
void *data, long timeout);
init() and cleanup() callbacks above are used after
new thread is created, so that user could initialize per-thread data,
for example it is used to allocate some cached pages and initialize
crypto algorithms.
This thread pool system is used by the crypto processing code in
the distributed subsystem: when block io request is about to be sent,
or when system has received reply for the read request, it schedules
crypto processing work to the pool, initialized at DST node setup time.
Crypto processing does not yet work in DST as long as some other bits,
so far I only played a bit with its initlialization sequence, so it was
split to network, crypto, security initializations and node start, which
registers new storage in the block layer subsytem. This steps allow to introduce
later additional initialization steps if needed without breaking backward
compatibility.
Next steps include proper network initialization and processing and transaction
management helpers. Then I will combine all existing code and make a first
renewed release.
Stay tuned!
/devel/dst :: Link / Comments ()
Sat, 19 Jul 2008
Disributed storage is dead, long live the Distributed storage!
As you may know, DST
project was an attempt to implement redundant, failover resistant, flexible block level storage
subsytem. Among other features it supported ability to map multiple remote nodes via linear
or mirroring algorithms to single node, reconnect to failed node, reading balancing and
parallel writing to multiple nodes (in case of mirroring) and
so on.
Now it has gone. There is no more distributed storage you knew before, instead there is
completely new project being developed, which main goal is to provide a transport layer for
the block requests only. Consider it as Network Block Device on huge steroids. Consider it
as iSCSI on huge steroids. Consider it as ATA-over-Ethernet on even more huge steroids.
It is just an example of what all those protocols should have. And only that.
An it does not sound very ambitious, previous DST versions already supported lots of features,
which never existed (and in some cases were impossible to be added) in another block level
network storages.
DST moves further.
There will be no mirroring and overall ability to map multiple devices into single one,
instead one should use Device Mapper for this goal, since its features were simply mirrored
(although I tried to optimize them sometimes) in DST, and amount of targets was noticebly smaller.
Now DST is just a simple block device which operates on top of network connection. With just a
single exception: its done right.
Features planned for the new Distributed Storage:
- kernelspace client and server
- initial autoconfiguration between client and server nodes
- automatic reconnect to failed target
- transaction model: resending, timeout error completion, full rollback of the failed transaction
- wire speed performance
- data channel encryption, strong checksumming
- cryptographical authentification
- ability to work on top of any network protocol
- barriers support (when, if any, Device Mapper will start support them, DST will not need to be changed)
- flexible protocol with simple ability to extend it to needed functionality
- trivial configuration
Project is being written from scratch, but it is actually very simple,
and should be quite small, so expect its first release quite soon.
It will be pushed upstream when ready.
/devel/dst :: Link / Comments ()
Fri, 18 Jul 2008
Completed distributed storage redesign.
I also managed to play second octave F# and sometimes the whole chromatic scale
down to small (minor?) octave F on my trumpet, and I belive I started to understand
overall trumpet kung-fu, but expect it is not what you wanted to read under
DST tag.
So, DST becomes smaller, cleaner and simpler. Notably, I decided to drop userspace
target completely for now.
Kernel part now operates on transaction entity, which holds a reference to the node,
where data should be sent/received. There can be at most two such nodes if block IO
request spans the boundary. In case of mirroring (which will be dropped for the first release)
list of nodes to mirror this data to will be maintained by the first node, so transaction
will not need to know about them.
In theory block request can be as much as BIO_MAX_PAGES pages,
which is 256 for now, but I decided to limit minimum node size to be not smaller than
above bio limit, so there will be always at most two nodes per request.
Each node has either block device behind it (so it will just call generic_make_request()
with different block device for given bio), or network state machine.
Network state will have two threads: RX and TX. Receive one is used to get replies for the
read/write messages, search appropriate transaction and complete it.
In case of DST server it will also handle read/write requests and generate replies, but the whole
processing will be exactly the same, client node will have a switch to process read/write requests from
the network, but they should be only received by server.
Sending thread is tricky.
It is used as fallback for non-blocking sockets, which are used first at generic_make_request()
time, i.e. when higher level user performed read or write, if block was not fully sent,
then it is queued to this thread and it will try to send the rest of the data when
polling allows. ->make_request_fn() function returns in this case and higher
layer can proceed with own operations.
Transaction is not freed until reply is received from the remote side or resending retry
count fires.
Transaction is always allocated (from the appropriate memory pool) and that is actually
all allocations in DST itself. In case it works with block devices, it is possible to clone a bio,
when it crosses the boundaries (or even always, I have to check it, but it is essentially
what device mapper with lots of own additional allocations), but it should be very rare condition.
Network stack will allocate data itself too.
That was a theory. Practice tells me, that essentially 90% of the code should be rewritten
from scratch, so I recloned the tree and so far implemented generic bits of registering
block device, creating various sysfs files and directories and other similar trivial bits.
I still plan to finish it this weekend (without mirroring), but things may turn to me a different side though...
/devel/dst :: Link / Comments ()
Tue, 15 Jul 2008
Distributed storage development roadmap.
Yes, DST
project is alive and will beat out the crap very soon, since I decided to change its
underlying architecture, and switch to transaction model just like
POHMELFS.
This basically means that as long as system has enough RAM writing operations will be
extremely fast, reading can be balanced between multiple nodes (in mirror), transactions
can be resent, failover mechanism becomes much simpler,
and system overall will be much more robust to failures.
Transaction model also means that system requires explicit acknowlege from remote side,
and there are two possibilities here: two handle implicit ack which comes with TCP ack
packets like I experimented
before, and send explicit ack from server for each client's request. \
The former approach although has smaller performance overhead, still suffers from
the fact, that pages sent via DST are always stateless, i.e. at this layer there is
no knowledge about who sends this page. We can determine inode page belongs to, can
even get a socket when page is about to be released when ack has been received,
but we can not know from exactly which PIPE it was submitted into given socket,
so when multiple threads send the same page via miltiple sendfile()
calls we do not know when and how page will be released. We can put pipes this page belong
to into single-linked list (since page has only two unused at this point pointers: LRU
list head, and one of them is used to determine that this page belongs to sendfile()/splice codepath),
and likely traversing this list will not hurt usual users, but malicios one can
create a local DoS with this approach. After some experiments with the splice code
today I decided to drop this idea implementation for now.
There is a strong argument in favour of explicit acks from the server: this allows to make asynchronous transaction
processing (with implicit acks we can not hook into processing path, since we do not know where exactly
skb with our pages is chained), and this does not hurt perfromance (which was proven by
POHMELFS benchmarks).
So, overall plan to develop DST is to switch to transaction model and perform async processing
of all events (there are only two actually: reading and writing of the given pages to given
locations).
This task is not that complex, so I expect some new results later this week. Stay tuned!
/devel/dst :: Link / Comments ()
Sun, 30 Mar 2008
Continue DST roadmap.
So, I have to admit that I rethought my
opinion
about mirroring/redundancy at filesystem layer - it is useful for lots of cases,
and modulo bugs in
DST
mirroring (mostly a leak, which I can not find in my lab,
and network/block layer race,
which exists in sendfile() for years and just strikes DST a lot,
which has a workaround though) I decided to rewrite mirroring algorithm in a way
it could be used in other projects.
There is also an idea of how to fix abovementioned network/block layer race in a
very non-disturbing manner, which was privately called soft
DST barriers.
Idea is to replace skb destructor with private one, which will commit that
pages are no longer used (for example call bio_endio() or
release splice buffer), this callback will be installed only for special sockets,
which provide it (like DST, sendfile() or any other
->sendpage() users like samba). Idea was
not killed on its roots,
which is a good start sign.
/devel/dst :: Link / Comments ()
Thu, 27 Mar 2008
Distributed storage roadmap.
DST
project was quiet for a while, but actually it is not.
There is a bug in mirror algorithm, which I consider to rewrite. Not becuase of
this bug, but because it will be used in special setup, where its extension required.
Consider a high-available *SQL cluster with multiple storage nodes combined into
mirror and several main systems, which operate with database software. Unfortunately
only single main system works with queries, other has to be turned on when first one fails.
Task is to create a system, which will automatically switch between main nodes and
recover if either main nodes or storage nodes become unavailable, so that the whole
system does not stop if something wrong happend with machines. It has to scale
to tens of nodes as a must and later hundreds without problems.
This is not a performance scalability solution - so far only single node should be able to
collect multiple data nodes into storage, and if that node fails it has to be switched,
but so far I do not know any working and free solution for the problem. But solution created
for the main node switching can be used in cases when any server (for example metadata server
in cluster) failed and has to be switched.
It will also force me to finally implement barriers in DST.
As a possible helper for availability messages
I consider abandoned CARP-like
protocol (in userspace).
/devel/dst :: Link / Comments ()
Thu, 31 Jan 2008
BTRFS subvolumes.
Chris Mason created a short specification
for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks
on several devices and use tricky algoritms to distributed the load
between storages.
Overall this is excellent idea, but specification rises some questions and I belive
it is too heavily tied to ZFS design.
I will drop my thoughts here, which may be completely wrong though.
Here are some features btrfs will support with subvolume implementation:
Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents;
Checksum failure resolution by using a mirrored copy; Striped data extents and others.
They are clear targets for block layer, but there are following notes on why it is not:
If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum
failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of
the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but
against data itself (i.e. it has to be recalculated after read), since during transfer data can
be damaged and it is not that rare condition. Thus checksums from different mirror can be both
be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.
If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big
address space, it would not have sufficient information to allocate mirrored copies on different devices.
Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST
supports such iteraction for example.
Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem:
there is pool of devices, and there are allocation policies for each one in the following form (just an example):
files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3,
small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed
number of storages.
And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation,
since a screenshot of high-level design
clearly has Mac's shadows and fonts :)
/devel/dst :: Link / Comments ()
Tue, 22 Jan 2008
New DST release: Succumbed to live ant.
This is a maintenance release only and it contains
only following change:
- do not allocate big enough address structure on the stack
during local export node initialization
Great thanks to Serge Leschinsky and Konstantin Kalin for testing.
As usual one can get the latest version from
project homepage
or via git tree.
/devel/dst :: Link / Comments ()
Wed, 26 Dec 2007
New release of the distributed storage: Groundhogs strike back: no New Year for humans!
Short changelog:
- mirroring algorithm improvements
- debug cleanups
- extended mirroring initialization
- documentation update
- name is 'Groundhogs strike back: no New Year for humans' now
As usual, one can get patch or pull changes from the project
homepage.
/devel/dst :: Link / Comments ()
Mon, 17 Dec 2007
New release of the distributed storage: Dancing with the smoked neutrino.
Short changelog:
- new improved mirroring algorithm.
This algorithm uses sliding window approach for full resync
and write log for partial resync.
- fixed number of typos and debug cleanups
- update inode size when linear algorithm changes the size of the
storage in run time
- extended number of sysfs files and documentation for them
- fixed leak in local export node setup
- name is 'Dancing with the smoked neutrino' now
Overall list of features of the DST can be found on project's
homepage.
DST is also exported as a git tree available for clone and pull from
here.
Interested reader can test DST with 2.6.23 tree too
(it should compile fine, but was not tested).
/devel/dst :: Link / Comments ()
New distributed storage mirroring algorithm.
Resync logic - sliding window algorithm.
At startup system checks age (unique cookie) of the node and if it
does not match first node it resyncs all data from the first node in
the mirror to others (non-sync nodes), each non-synced node has a
window, which slides from the start of the node to the end.
During resync all requests, which enter the window are queued, thus
window has to be sufficiently small. When window is synced from the
other nodes, queued requests are written and window moves forward,
thus subsequent resync is started when previous window is fully completed.
When window reaches end of the node, it is marked as synchronized.
If age of the node matches the first one, but log contains different
number of write log entries compared to the first node (first node always
stands as a clean), then partial resync is scheduled.
Partial resync will also be scheduled when log entry pointed by resync
index of the node contains error.
Mechanism of this resync type is following: system selects a sync node
(checking each node's flags) and fetches a log entry pointed by resync
index of the given node and resync data from other nodes to given one.
Then it checks the rest of the write log and checks if there are
another failed writes, so that next resync block would be fetched for
them.
Mirroring log is used to store write request information.
It is allocated on disk and in memory (sync happens each time
resync work queue fires), and eats about 1% of free RAM or disk
(what is less). Each write updates log, so when node goes offline,
its log will be updated with error values, so that this entries
could be resynced when node will be back online. When number of
failed writes becomes equal to number of entries in the write log,
recovery becomes impossible (since old log entries were overwritten)
and full resync is scheduled.
This does not work well with the situation, when there are multiple
writes to the same locations - they are considered as different
writes and thus will be resynced multiple times.
The right solution is to check log for each write, better if log
would be not array, but tree.
/devel/dst :: Link / Comments ()
Fri, 14 Dec 2007
Linux Test Project on top of DST storage.
# pwd
/mnt/ltp-full-20071130
# ./runltp -p -f fs -d `pwd`/tmp
...
# cat /mnt/ltp-full-20071130/results/results.2007-12-14.11.21.41.17106
Test Start Time: Fri Dec 14 11:21:41 2007
-----------------------------------------
Testcase Result Exit Value
-------- ------ ----------
gf01 PASS 0
gf02 PASS 0
gf03 PASS 0
gf04 PASS 0
gf05 PASS 0
gf06 PASS 0
gf07 PASS 0
-----------------------------------------------
Total Tests: 57
Total Failures: 0
Kernel Version: 2.6.22-rc5-dst
Machine Architecture: x86_64
Hostname: uganda
# mount | grep mnt
/dev/dst-storage-32 on /mnt type xfs (rw)
# cat /sys/devices/storage/n-0-ffff*/type
R: 192.168.4.81:1025
R: 192.168.4.81:1026
All 'fs' tests completed successfully, although I saw following dump in dmesg:
[ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
[ 8398.609641] turning off the locking correctness validator.
which is XFS bug.
Since DST is quite dumb device, that tests will not find tricky places, but they are good
to generate high load on top of given block device.
/devel/dst :: Link / Comments ()
New mirroring module in the distributed storage.
$ git-diff-index --stat HEAD drivers/block/dst/alg_mirror.c
drivers/block/dst/alg_mirror.c | 745 ++++++++++++++++++++--------------------
1 files changed, 364 insertions(+), 381 deletions(-)
It is cool and works good in my environment, but (like previous) it
forces total mirror resync after main storage node reboot or crash (if it is
required, for example when array was not in sync already and main node rebooted).
I want to extend DST mirroring algorithm not to force full resync, but store a log
of the writes on each node, so when new array starts, it would check not only
age of the nodes (uique id stored at the end of each node, if it does not match,
total resync starts), but also write log, so that the latter does not match, only
selected number of regions would be synchronized.
Stay tuned...
/devel/dst :: Link / Comments ()
Thu, 13 Dec 2007
Why pushing project into the kernel is not a main goal?..
One have to have some courage and do not afraid to throw something
out and create new things instead of old, even if it will require a lot
of efforts and some problems in a short cycle.
So I've just erased mirroring algorithm from DST and will rewrite it mostly
from scratch, since I have a very interesting sync algortihm inmind,
which will not require clean/dirty bitmap.
Havind DST in kernel would not allow me to have such flexibility...
/devel/dst :: Link / Comments ()
Wed, 12 Dec 2007
I was a bit pessimistic about DST design bugs.
Things are only bad when resync of the mirror node is in place...
I fixed both issues, but will spent additional time debugging and testing
the them, since I do not like how it was done. I think I will rewrite mirroring
resync logic.
Subrata Modak of IBM suggested to use
Linux Test Project, which I found to
have interesting benchmarks, which while being very useful for filesystem
development, still can find some bugs in DST.
/devel/dst :: Link / Comments ()
Shame on me or how complex are design bugs...
I have to admit, that mirroring in DST is not currently well supported.
First, because of a bug I made in the early development stage: in DST there
are two objects, which represent a part of the storage, first one is a node,
this object contains information about type of the storage and pointers to
structure, which represents low level device itself (like block device or network
connection). Network connection in turn is represented as a state structure,
which contains socket, state machine for transferred data and so on.
Nodes are used when block io request comes from the higher layer and
states are used when data is transfeerred via network. The former uses
fain grained reference counters: when node is being operated on (request is processed),
its reference counter is increased, if operations become asynchronous
(for example sending queue is full and thus block can not be sent right now),
then block request is queued into state's request list and reference counter for
the node is dropped. If it reaches zero, node is being freed, which in turn
calls exit callback for the state, which flushes the queue of requests.
Things seem simple and correct, but devil is in details - async processing thread
can enter at any point into the game and process state too, which leads to bugs.
Second, DST mirroring can ate all your memory during resync, since it does not check
amount of free ram in the system and tries to allocate new pages until all memory is used.
This is already fixed in the private tree though.
And the last (known) problem is mirror bitmap - it uses single bit for single sector
of the device, and although uses vmalloc(), it is still too much of RAM.
Back to fixing.
/devel/dst :: Link / Comments ()
Mon, 10 Dec 2007
New distributed storage release: Gamardjoba, genacvale!
Short changelog:
- wakeup state when mirror detected error to seedup reconnect
- if connecting in csum mode to no-csum server, do not enable csums
- do not clean queue until all users are removed
- allow to increase size of the storage in linear add callback
(with this change it is possible to add nodes into linear array
in real time without stopping storage. Filesystem has to be prepared
for the case when underlying device has changed its size.
Real-time addon of mirror nodes is also supported)
- allow to delete gendisk only after device was started
- dst debug config option
- Name: Gamardjoba, genacvale! ('Hi friend' in georgian)
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging!
As usual, one can get new release from the project homepage.
/devel/dst :: Link / Comments ()
Fri, 07 Dec 2007
Strong checksumms in DST rocks.
Great thanks to person, who suggested me
to implement them and Zach Brown, who showed, that
Castagnoli crc is a better one than Adler.
I've debugged a setup where system failed to mount XFS filesystem on top of distributed storage,
and after turned on strong checksums, system detected they were wrong, so some corruption
happend during filesystem setup.
Turning off TSO, RX and TX offload of e1000 nics on machines, which form the storage, fixed the problem.
Strong checksumms rocks!
/devel/dst :: Link / Comments ()
Distributed storage and long distances.
I've just completed some tests over the distributed system,
created on top of usual internet links between machines,
located in Moscow, Russia and London, UK.
Remote target was setup, then XFS filesystem created, mounted
and some tests ran.
One of the machines (main storage server) is located behind at least
one NAT firewall.
/devel/dst :: Link / Comments ()
Thu, 06 Dec 2007
A simple way to crash machine using XFS and DST.
Let's suppose you want to create an XFS on top of DST array.
If you mistakenly will run mkfs.xfs /dev/sda1 (let's suppose
you want to create DST storage on top of /dev/sda1 device)
and then start DST on top of /dev/sda1:
./dst -n storage -A alg_mirror -d /dev/sda1 -R -s0 -S0
this will overwrite the last sector of the /dev/sda1,
where XFS stores its metadata. Mounting XFS after that will lead
to almost 100% crash of the machine on 2.6.22 kernels because of some
bugs in XFS, which appear when XFS reads corrupted metadata from the
last sector.
To work with DST you have to operate with /dev/dst-$storage-$num
devices (i.e. run mkfs.xfs /dev/dst-$storage-$num), and not with
underlying ones.
/devel/dst :: Link / Comments ()
Wed, 05 Dec 2007
Storage hotplugging in DST.
For the interested reader: yes, it is possible to add disks
into DST storage on the fly, but be sure that your filesystem supports that
(in case of linear setup), mirroring is fairly transparent.
Command to add another node into mirror setup is pretty simple:
./dst -n storage -A alg_mirror -S0 -s0 -a kano -p 1026
Just like adding usual node into the storage before it was started.
Please note, that when adding node which is smaller than current device size,
device size will be reduced and this can damage your filesystem!
The same applies to linear setup.
/devel/dst :: Link / Comments ()
Tue, 04 Dec 2007
DST FAQ.
The most frequently asked question about DST is:
Can you give us a summary of how this differs from using device mapper with NBD or iSCSI?
Answer is quite simple:
From the higher point of view it does not, but it operates quite differently:
it has async processing of the requests, thus not blocking, it has
different protocol with smaller overhead, supports strong checksums, has
in-kernel export server, which supports simple security attributes (i.e.
allow to connect, to read or write). It uses smaller amount of memory
(zero additional allocations in the common path for linear mapping,
not including network allocations, it uses smaller amount of additional
allocations for mirroring case).
DST supports failure recovery in case of dropped connection (core will
reconnect to the remote node when it is ready), thus it is possible to
turn off and on remote nodes without special administration steps. DST
has simple autoconfiguration at the startup time (support checksums and
storage size autonegotiation). It is possible to turn one of the mirror
nodes off and use it as a offline backup, since dst mirror node stores
data at the end of the storage, so it can be mounted locally.
/devel/dst :: Link / Comments ()
New distributed storage subsystem release.
This is a maintenance release and includes
bug fixes and simple feature extensions only.
Short changelog:
- fixed bug with XFS metadata update (it can provide slab pages to the
DST, so it is not allowed to transfer them using
->sendpage())
- fixed async error completion path
- extended netlink communication channel to report errors back to userspace
- DST name is now "The 10'th dynasty of smuggled slothes"
- number of fixes for userspace DST target
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging and
fixes for userspace DST target and preliminary netlink extension patches.
As usual you can download this release from the homepage.
If you want to try distributed storage this release is a really good candidate to start with.
Enjoy!
Update: This release includes bug fixes for all bugs described
here,
including uninterruptible sync read operations.
/devel/dst :: Link / Comments ()
Thu, 29 Nov 2007
Astonishingly screwed tapeworm.
New release of the distributed storage
subsystem. This is maintenance release and includes bug fixes only.
Short changelog:
- use node's size in sectors instead of bytes
- fixed old/new ages for the first node. Error spotted by Matthew Hodgson (matthew_mxtelecom.com)
- fixed debug printk declaration
- new name
Overall list of features of the DST can be found on project's
homepage.
/devel/dst :: Link / Comments ()
Tue, 20 Nov 2007
Maintenance release of the distributed storage subsystem.
It contains only following bug fix:
- Cleanup sysfs files on error path. Patch by Chris Madden (chris_reflexsecurity.com)
You can find the latest release on the project homepage.
/devel/dst :: Link / Comments ()
Thu, 15 Nov 2007
CEPH distributed storage.
It was announced on LWN and kerneltrap recently.
I already wrote
about this filesystem, after that I found
(from discussion with Zach Brown)
that this filesystem does not have a byte-range locking and when number
of threads write to the same file, they become sync writes
(i.e. no cache coherence protocols involved). I'm also not
sure what this is about: I/O workloads should be done with the client
cache off because the writeback is too non-deterministic.
That was my envy comments :), now good news.
First, Sage Weil (an author) works full-time on this project
and funds it from own web hosting company, so it is possible to attract
developers for money (he even hired someone to write kernel client
instead of FUSE one). Second, it has completed design and working
implementation (although some design issues are questionable).
So, likely it is a good choice to take a look for you, if you are searching
for the solution which should be ready shortly.
/devel/dst :: Link / Comments ()
|