|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Sat, 19 Jul 2008
Disributed storage is dead, long live the Distributed storage!
As you may know, DST
project was an attempt to implement redundant, failover resistant, flexible block level storage
subsytem. Among other features it supported ability to map multiple remote nodes via linear
or mirroring algorithms to single node, reconnect to failed node, reading balancing and
parallel writing to multiple nodes (in case of mirroring) and
so on.
Now it has gone. There is no more distributed storage you knew before, instead there is
completely new project being developed, which main goal is to provide a transport layer for
the block requests only. Consider it as Network Block Device on huge steroids. Consider it
as iSCSI on huge steroids. Consider it as ATA-over-Ethernet on even more huge steroids.
It is just an example of what all those protocols should have. And only that.
An it does not sound very ambitious, previous DST versions already supported lots of features,
which never existed (and in some cases were impossible to be added) in another block level
network storages.
DST moves further.
There will be no mirroring and overall ability to map multiple devices into single one,
instead one should use Device Mapper for this goal, since its features were simply mirrored
(although I tried to optimize them sometimes) in DST, and amount of targets was noticebly smaller.
Now DST is just a simple block device which operates on top of network connection. With just a
single exception: its done right.
Features planned for the new Distributed Storage:
- kernelspace client and server
- initial autoconfiguration between client and server nodes
- automatic reconnect to failed target
- transaction model: resending, timeout error completion, full rollback of the failed transaction
- wire speed performance
- data channel encryption, strong checksumming
- cryptographical authentification
- ability to work on top of any network protocol
- barriers support (when, if any, Device Mapper will start support them, DST will not need to be changed)
- flexible protocol with simple ability to extend it to needed functionality
- trivial configuration
Project is being written from scratch, but it is actually very simple,
and should be quite small, so expect its first release quite soon.
It will be pushed upstream when ready.
/devel/dst :: Link / Comments (8)
Fri, 18 Jul 2008
Completed distributed storage redesign.
I also managed to play second octave F# and sometimes the whole chromatic scale
down to small (minor?) octave F on my trumpet, and I belive I started to understand
overall trumpet kung-fu, but expect it is not what you wanted to read under
DST tag.
So, DST becomes smaller, cleaner and simpler. Notably, I decided to drop userspace
target completely for now.
Kernel part now operates on transaction entity, which holds a reference to the node,
where data should be sent/received. There can be at most two such nodes if block IO
request spans the boundary. In case of mirroring (which will be dropped for the first release)
list of nodes to mirror this data to will be maintained by the first node, so transaction
will not need to know about them.
In theory block request can be as much as BIO_MAX_PAGES pages,
which is 256 for now, but I decided to limit minimum node size to be not smaller than
above bio limit, so there will be always at most two nodes per request.
Each node has either block device behind it (so it will just call generic_make_request()
with different block device for given bio), or network state machine.
Network state will have two threads: RX and TX. Receive one is used to get replies for the
read/write messages, search appropriate transaction and complete it.
In case of DST server it will also handle read/write requests and generate replies, but the whole
processing will be exactly the same, client node will have a switch to process read/write requests from
the network, but they should be only received by server.
Sending thread is tricky.
It is used as fallback for non-blocking sockets, which are used first at generic_make_request()
time, i.e. when higher level user performed read or write, if block was not fully sent,
then it is queued to this thread and it will try to send the rest of the data when
polling allows. ->make_request_fn() function returns in this case and higher
layer can proceed with own operations.
Transaction is not freed until reply is received from the remote side or resending retry
count fires.
Transaction is always allocated (from the appropriate memory pool) and that is actually
all allocations in DST itself. In case it works with block devices, it is possible to clone a bio,
when it crosses the boundaries (or even always, I have to check it, but it is essentially
what device mapper with lots of own additional allocations), but it should be very rare condition.
Network stack will allocate data itself too.
That was a theory. Practice tells me, that essentially 90% of the code should be rewritten
from scratch, so I recloned the tree and so far implemented generic bits of registering
block device, creating various sysfs files and directories and other similar trivial bits.
I still plan to finish it this weekend (without mirroring), but things may turn to me a different side though...
/devel/dst :: Link / Comments (0)
Tue, 15 Jul 2008
Distributed storage development roadmap.
Yes, DST
project is alive and will beat out the crap very soon, since I decided to change its
underlying architecture, and switch to transaction model just like
POHMELFS.
This basically means that as long as system has enough RAM writing operations will be
extremely fast, reading can be balanced between multiple nodes (in mirror), transactions
can be resent, failover mechanism becomes much simpler,
and system overall will be much more robust to failures.
Transaction model also means that system requires explicit acknowlege from remote side,
and there are two possibilities here: two handle implicit ack which comes with TCP ack
packets like I experimented
before, and send explicit ack from server for each client's request. \
The former approach although has smaller performance overhead, still suffers from
the fact, that pages sent via DST are always stateless, i.e. at this layer there is
no knowledge about who sends this page. We can determine inode page belongs to, can
even get a socket when page is about to be released when ack has been received,
but we can not know from exactly which PIPE it was submitted into given socket,
so when multiple threads send the same page via miltiple sendfile()
calls we do not know when and how page will be released. We can put pipes this page belong
to into single-linked list (since page has only two unused at this point pointers: LRU
list head, and one of them is used to determine that this page belongs to sendfile()/splice codepath),
and likely traversing this list will not hurt usual users, but malicios one can
create a local DoS with this approach. After some experiments with the splice code
today I decided to drop this idea implementation for now.
There is a strong argument in favour of explicit acks from the server: this allows to make asynchronous transaction
processing (with implicit acks we can not hook into processing path, since we do not know where exactly
skb with our pages is chained), and this does not hurt perfromance (which was proven by
POHMELFS benchmarks).
So, overall plan to develop DST is to switch to transaction model and perform async processing
of all events (there are only two actually: reading and writing of the given pages to given
locations).
This task is not that complex, so I expect some new results later this week. Stay tuned!
/devel/dst :: Link / Comments (5)
Sun, 30 Mar 2008
Continue DST roadmap.
So, I have to admit that I rethought my
opinion
about mirroring/redundancy at filesystem layer - it is useful for lots of cases,
and modulo bugs in
DST
mirroring (mostly a leak, which I can not find in my lab,
and network/block layer race,
which exists in sendfile() for years and just strikes DST a lot,
which has a workaround though) I decided to rewrite mirroring algorithm in a way
it could be used in other projects.
There is also an idea of how to fix abovementioned network/block layer race in a
very non-disturbing manner, which was privately called soft
DST barriers.
Idea is to replace skb destructor with private one, which will commit that
pages are no longer used (for example call bio_endio() or
release splice buffer), this callback will be installed only for special sockets,
which provide it (like DST, sendfile() or any other
->sendpage() users like samba). Idea was
not killed on its roots,
which is a good start sign.
/devel/dst :: Link / Comments (8)
Thu, 27 Mar 2008
Distributed storage roadmap.
DST
project was quiet for a while, but actually it is not.
There is a bug in mirror algorithm, which I consider to rewrite. Not becuase of
this bug, but because it will be used in special setup, where its extension required.
Consider a high-available *SQL cluster with multiple storage nodes combined into
mirror and several main systems, which operate with database software. Unfortunately
only single main system works with queries, other has to be turned on when first one fails.
Task is to create a system, which will automatically switch between main nodes and
recover if either main nodes or storage nodes become unavailable, so that the whole
system does not stop if something wrong happend with machines. It has to scale
to tens of nodes as a must and later hundreds without problems.
This is not a performance scalability solution - so far only single node should be able to
collect multiple data nodes into storage, and if that node fails it has to be switched,
but so far I do not know any working and free solution for the problem. But solution created
for the main node switching can be used in cases when any server (for example metadata server
in cluster) failed and has to be switched.
It will also force me to finally implement barriers in DST.
As a possible helper for availability messages
I consider abandoned CARP-like
protocol (in userspace).
/devel/dst :: Link / Comments (0)
Thu, 31 Jan 2008
BTRFS subvolumes.
Chris Mason created a short specification
for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks
on several devices and use tricky algoritms to distributed the load
between storages.
Overall this is excellent idea, but specification rises some questions and I belive
it is too heavily tied to ZFS design.
I will drop my thoughts here, which may be completely wrong though.
Here are some features btrfs will support with subvolume implementation:
Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents;
Checksum failure resolution by using a mirrored copy; Striped data extents and others.
They are clear targets for block layer, but there are following notes on why it is not:
If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum
failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of
the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but
against data itself (i.e. it has to be recalculated after read), since during transfer data can
be damaged and it is not that rare condition. Thus checksums from different mirror can be both
be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.
If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big
address space, it would not have sufficient information to allocate mirrored copies on different devices.
Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST
supports such iteraction for example.
Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem:
there is pool of devices, and there are allocation policies for each one in the following form (just an example):
files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3,
small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed
number of storages.
And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation,
since a screenshot of high-level design
clearly has Mac's shadows and fonts :)
/devel/dst :: Link / Comments (0)
Tue, 22 Jan 2008
New DST release: Succumbed to live ant.
This is a maintenance release only and it contains
only following change:
- do not allocate big enough address structure on the stack
during local export node initialization
Great thanks to Serge Leschinsky and Konstantin Kalin for testing.
As usual one can get the latest version from
project homepage
or via git tree.
/devel/dst :: Link / Comments (8)
Wed, 26 Dec 2007
New release of the distributed storage: Groundhogs strike back: no New Year for humans!
Short changelog:
- mirroring algorithm improvements
- debug cleanups
- extended mirroring initialization
- documentation update
- name is 'Groundhogs strike back: no New Year for humans' now
As usual, one can get patch or pull changes from the project
homepage.
/devel/dst :: Link / Comments (2)
Mon, 17 Dec 2007
New release of the distributed storage: Dancing with the smoked neutrino.
Short changelog:
- new improved mirroring algorithm.
This algorithm uses sliding window approach for full resync
and write log for partial resync.
- fixed number of typos and debug cleanups
- update inode size when linear algorithm changes the size of the
storage in run time
- extended number of sysfs files and documentation for them
- fixed leak in local export node setup
- name is 'Dancing with the smoked neutrino' now
Overall list of features of the DST can be found on project's
homepage.
DST is also exported as a git tree available for clone and pull from
here.
Interested reader can test DST with 2.6.23 tree too
(it should compile fine, but was not tested).
/devel/dst :: Link / Comments (4)
New distributed storage mirroring algorithm.
Resync logic - sliding window algorithm.
At startup system checks age (unique cookie) of the node and if it
does not match first node it resyncs all data from the first node in
the mirror to others (non-sync nodes), each non-synced node has a
window, which slides from the start of the node to the end.
During resync all requests, which enter the window are queued, thus
window has to be sufficiently small. When window is synced from the
other nodes, queued requests are written and window moves forward,
thus subsequent resync is started when previous window is fully completed.
When window reaches end of the node, it is marked as synchronized.
If age of the node matches the first one, but log contains different
number of write log entries compared to the first node (first node always
stands as a clean), then partial resync is scheduled.
Partial resync will also be scheduled when log entry pointed by resync
index of the node contains error.
Mechanism of this resync type is following: system selects a sync node
(checking each node's flags) and fetches a log entry pointed by resync
index of the given node and resync data from other nodes to given one.
Then it checks the rest of the write log and checks if there are
another failed writes, so that next resync block would be fetched for
them.
Mirroring log is used to store write request information.
It is allocated on disk and in memory (sync happens each time
resync work queue fires), and eats about 1% of free RAM or disk
(what is less). Each write updates log, so when node goes offline,
its log will be updated with error values, so that this entries
could be resynced when node will be back online. When number of
failed writes becomes equal to number of entries in the write log,
recovery becomes impossible (since old log entries were overwritten)
and full resync is scheduled.
This does not work well with the situation, when there are multiple
writes to the same locations - they are considered as different
writes and thus will be resynced multiple times.
The right solution is to check log for each write, better if log
would be not array, but tree.
/devel/dst :: Link / Comments (0)
Fri, 14 Dec 2007
Linux Test Project on top of DST storage.
# pwd
/mnt/ltp-full-20071130
# ./runltp -p -f fs -d `pwd`/tmp
...
# cat /mnt/ltp-full-20071130/results/results.2007-12-14.11.21.41.17106
Test Start Time: Fri Dec 14 11:21:41 2007
-----------------------------------------
Testcase Result Exit Value
-------- ------ ----------
gf01 PASS 0
gf02 PASS 0
gf03 PASS 0
gf04 PASS 0
gf05 PASS 0
gf06 PASS 0
gf07 PASS 0
-----------------------------------------------
Total Tests: 57
Total Failures: 0
Kernel Version: 2.6.22-rc5-dst
Machine Architecture: x86_64
Hostname: uganda
# mount | grep mnt
/dev/dst-storage-32 on /mnt type xfs (rw)
# cat /sys/devices/storage/n-0-ffff*/type
R: 192.168.4.81:1025
R: 192.168.4.81:1026
All 'fs' tests completed successfully, although I saw following dump in dmesg:
[ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
[ 8398.609641] turning off the locking correctness validator.
which is XFS bug.
Since DST is quite dumb device, that tests will not find tricky places, but they are good
to generate high load on top of given block device.
/devel/dst :: Link / Comments (0)
New mirroring module in the distributed storage.
$ git-diff-index --stat HEAD drivers/block/dst/alg_mirror.c
drivers/block/dst/alg_mirror.c | 745 ++++++++++++++++++++--------------------
1 files changed, 364 insertions(+), 381 deletions(-)
It is cool and works good in my environment, but (like previous) it
forces total mirror resync after main storage node reboot or crash (if it is
required, for example when array was not in sync already and main node rebooted).
I want to extend DST mirroring algorithm not to force full resync, but store a log
of the writes on each node, so when new array starts, it would check not only
age of the nodes (uique id stored at the end of each node, if it does not match,
total resync starts), but also write log, so that the latter does not match, only
selected number of regions would be synchronized.
Stay tuned...
/devel/dst :: Link / Comments (0)
Thu, 13 Dec 2007
Why pushing project into the kernel is not a main goal?..
One have to have some courage and do not afraid to throw something
out and create new things instead of old, even if it will require a lot
of efforts and some problems in a short cycle.
So I've just erased mirroring algorithm from DST and will rewrite it mostly
from scratch, since I have a very interesting sync algortihm inmind,
which will not require clean/dirty bitmap.
Havind DST in kernel would not allow me to have such flexibility...
/devel/dst :: Link / Comments (0)
Wed, 12 Dec 2007
I was a bit pessimistic about DST design bugs.
Things are only bad when resync of the mirror node is in place...
I fixed both issues, but will spent additional time debugging and testing
the them, since I do not like how it was done. I think I will rewrite mirroring
resync logic.
Subrata Modak of IBM suggested to use
Linux Test Project, which I found to
have interesting benchmarks, which while being very useful for filesystem
development, still can find some bugs in DST.
/devel/dst :: Link / Comments (0)
Shame on me or how complex are design bugs...
I have to admit, that mirroring in DST is not currently well supported.
First, because of a bug I made in the early development stage: in DST there
are two objects, which represent a part of the storage, first one is a node,
this object contains information about type of the storage and pointers to
structure, which represents low level device itself (like block device or network
connection). Network connection in turn is represented as a state structure,
which contains socket, state machine for transferred data and so on.
Nodes are used when block io request comes from the higher layer and
states are used when data is transfeerred via network. The former uses
fain grained reference counters: when node is being operated on (request is processed),
its reference counter is increased, if operations become asynchronous
(for example sending queue is full and thus block can not be sent right now),
then block request is queued into state's request list and reference counter for
the node is dropped. If it reaches zero, node is being freed, which in turn
calls exit callback for the state, which flushes the queue of requests.
Things seem simple and correct, but devil is in details - async processing thread
can enter at any point into the game and process state too, which leads to bugs.
Second, DST mirroring can ate all your memory during resync, since it does not check
amount of free ram in the system and tries to allocate new pages until all memory is used.
This is already fixed in the private tree though.
And the last (known) problem is mirror bitmap - it uses single bit for single sector
of the device, and although uses vmalloc(), it is still too much of RAM.
Back to fixing.
/devel/dst :: Link / Comments (0)
Mon, 10 Dec 2007
New distributed storage release: Gamardjoba, genacvale!
Short changelog:
- wakeup state when mirror detected error to seedup reconnect
- if connecting in csum mode to no-csum server, do not enable csums
- do not clean queue until all users are removed
- allow to increase size of the storage in linear add callback
(with this change it is possible to add nodes into linear array
in real time without stopping storage. Filesystem has to be prepared
for the case when underlying device has changed its size.
Real-time addon of mirror nodes is also supported)
- allow to delete gendisk only after device was started
- dst debug config option
- Name: Gamardjoba, genacvale! ('Hi friend' in georgian)
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging!
As usual, one can get new release from the project homepage.
/devel/dst :: Link / Comments (0)
Fri, 07 Dec 2007
Strong checksumms in DST rocks.
Great thanks to person, who suggested me
to implement them and Zach Brown, who showed, that
Castagnoli crc is a better one than Adler.
I've debugged a setup where system failed to mount XFS filesystem on top of distributed storage,
and after turned on strong checksums, system detected they were wrong, so some corruption
happend during filesystem setup.
Turning off TSO, RX and TX offload of e1000 nics on machines, which form the storage, fixed the problem.
Strong checksumms rocks!
/devel/dst :: Link / Comments (3)
Distributed storage and long distances.
I've just completed some tests over the distributed system,
created on top of usual internet links between machines,
located in Moscow, Russia and London, UK.
Remote target was setup, then XFS filesystem created, mounted
and some tests ran.
One of the machines (main storage server) is located behind at least
one NAT firewall.
/devel/dst :: Link / Comments (4)
Thu, 06 Dec 2007
A simple way to crash machine using XFS and DST.
Let's suppose you want to create an XFS on top of DST array.
If you mistakenly will run mkfs.xfs /dev/sda1 (let's suppose
you want to create DST storage on top of /dev/sda1 device)
and then start DST on top of /dev/sda1:
./dst -n storage -A alg_mirror -d /dev/sda1 -R -s0 -S0
this will overwrite the last sector of the /dev/sda1,
where XFS stores its metadata. Mounting XFS after that will lead
to almost 100% crash of the machine on 2.6.22 kernels because of some
bugs in XFS, which appear when XFS reads corrupted metadata from the
last sector.
To work with DST you have to operate with /dev/dst-$storage-$num
devices (i.e. run mkfs.xfs /dev/dst-$storage-$num), and not with
underlying ones.
/devel/dst :: Link / Comments (0)
Wed, 05 Dec 2007
Storage hotplugging in DST.
For the interested reader: yes, it is possible to add disks
into DST storage on the fly, but be sure that your filesystem supports that
(in case of linear setup), mirroring is fairly transparent.
Command to add another node into mirror setup is pretty simple:
./dst -n storage -A alg_mirror -S0 -s0 -a kano -p 1026
Just like adding usual node into the storage before it was started.
Please note, that when adding node which is smaller than current device size,
device size will be reduced and this can damage your filesystem!
The same applies to linear setup.
/devel/dst :: Link / Comments (0)
Tue, 04 Dec 2007
DST FAQ.
The most frequently asked question about DST is:
Can you give us a summary of how this differs from using device mapper with NBD or iSCSI?
Answer is quite simple:
From the higher point of view it does not, but it operates quite differently:
it has async processing of the requests, thus not blocking, it has
different protocol with smaller overhead, supports strong checksums, has
in-kernel export server, which supports simple security attributes (i.e.
allow to connect, to read or write). It uses smaller amount of memory
(zero additional allocations in the common path for linear mapping,
not including network allocations, it uses smaller amount of additional
allocations for mirroring case).
DST supports failure recovery in case of dropped connection (core will
reconnect to the remote node when it is ready), thus it is possible to
turn off and on remote nodes without special administration steps. DST
has simple autoconfiguration at the startup time (support checksums and
storage size autonegotiation). It is possible to turn one of the mirror
nodes off and use it as a offline backup, since dst mirror node stores
data at the end of the storage, so it can be mounted locally.
/devel/dst :: Link / Comments (0)
New distributed storage subsystem release.
This is a maintenance release and includes
bug fixes and simple feature extensions only.
Short changelog:
- fixed bug with XFS metadata update (it can provide slab pages to the
DST, so it is not allowed to transfer them using
->sendpage())
- fixed async error completion path
- extended netlink communication channel to report errors back to userspace
- DST name is now "The 10'th dynasty of smuggled slothes"
- number of fixes for userspace DST target
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging and
fixes for userspace DST target and preliminary netlink extension patches.
As usual you can download this release from the homepage.
If you want to try distributed storage this release is a really good candidate to start with.
Enjoy!
Update: This release includes bug fixes for all bugs described
here,
including uninterruptible sync read operations.
/devel/dst :: Link / Comments (2)
Thu, 29 Nov 2007
Astonishingly screwed tapeworm.
New release of the distributed storage
subsystem. This is maintenance release and includes bug fixes only.
Short changelog:
- use node's size in sectors instead of bytes
- fixed old/new ages for the first node. Error spotted by Matthew Hodgson (matthew_mxtelecom.com)
- fixed debug printk declaration
- new name
Overall list of features of the DST can be found on project's
homepage.
/devel/dst :: Link / Comments (4)
Tue, 20 Nov 2007
Maintenance release of the distributed storage subsystem.
It contains only following bug fix:
- Cleanup sysfs files on error path. Patch by Chris Madden (chris_reflexsecurity.com)
You can find the latest release on the project homepage.
/devel/dst :: Link / Comments (0)
Thu, 15 Nov 2007
CEPH distributed storage.
It was announced on LWN and kerneltrap recently.
I already wrote
about this filesystem, after that I found
(from discussion with Zach Brown)
that this filesystem does not have a byte-range locking and when number
of threads write to the same file, they become sync writes
(i.e. no cache coherence protocols involved). I'm also not
sure what this is about: I/O workloads should be done with the client
cache off because the writeback is too non-deterministic.
That was my envy comments :), now good news.
First, Sage Weil (an author) works full-time on this project
and funds it from own web hosting company, so it is possible to attract
developers for money (he even hired someone to write kernel client
instead of FUSE one). Second, it has completed design and working
implementation (although some design issues are questionable).
So, likely it is a good choice to take a look for you, if you are searching
for the solution which should be ready shortly.
/devel/dst :: Link / Comments (0)
Mon, 05 Nov 2007
Squizzed black-out of the dancing back-aching hippo.
This is a name of the 7'th DST
release.
Changelog is quite big:
- added strong checksum support (Castagnoli crc)
- extended autoconfiguration (added ability to request if remote side supports
strong checksum and turn it on if needed)
- documentation addon - sysfs files
- added clean/dirty sysfs files which allows to mark node as clean (sinc) or dirty (not sync)
- fair number of bug fixes (including really tricky bastards, which are unlikely
to be found in real setups, but which were still bugs)
- and the main one - added release name (it clearly shows my condition)
This one is really good release, check this out quickly and enjoy the beast!
/devel/dst :: Link / Comments (15)
Thu, 01 Nov 2007
DST status.
Checksumming found to be not that trivial problem - I implemented
it with sync operation, which might become a bottleneck in case of reading
(current implementation after sending read request sleeps
and waits for the reply header with checksum instead of processing
other requests), so I'm about to add another case into network state machine -
header, data and checksum if needed.
Also fixed number of bugs, which I previously did not notice because
of not scrutinized enough testing (like problem with single-node
mirroring or single client userspace server, which already has a client).
Also found an interesting moment with network ->sendpage()
method and buffers - it is not allowed to provide there SLAB pages (i.e.
those with zero reference counter), although I'm pretty sure it was allowed some time ago.
So right now there is no way to zerocopy a header via ->sendpage().
Operation where this trick was used is initial mirror setup when node reads
its age from the storage, which can be remotely exported node.
There is another issue with the protocol - block layer requires to have
per page header 9actually per bio vector), since it is possible to have
a non-contiguous block request (or not? I never saw a single request,
which consisted of several bio vectors, where each one (or at least another
one except the first and the last) has a not full page). Actually
if such setup is not allowed (i.e. all pages in bio form a contiguous region
on device), it would be possible to noticebly reduce size of the bio vector array
and have better protocol for distributed storage.
After state machine extension got implemented, I will make a new release,
very likely this will happen tomorrow.
This one will be really good release, so I will likely add a name for it.
Stay tuned!
/devel/dst :: Link / Comments (4)
Wed, 31 Oct 2007
Distributed storage checksums.
I've changed code to use crc32c Castagnoli checksum instead
of Adler - now it is both used in kernel and userspace target.
Code is being tested in various interesting configurations
(like linear array is being exported to remote
node, where it is added to mirror setup with another remote
node with userspace target) and bug was fixed
especially in unusual cases (like I described yesterday when
mirroring with only one node failed to setup).
Checksumming also is being tested by injecting errors after data
was written by test client.
I would like to finish this today but have to go...
/devel/dst :: Link / Comments (0)
Tue, 30 Oct 2007
Simplified DST protocol.
Also added flags for future extensions, fixed bug
in mirror setup with only single node, added more aggressive
protocol checks, overall cleanups.
This is a good release candidate, but I want more testing before
final commit and new release, especially because of strong checksums,
which require protocol changes. I also expects to have Windows target tomorrow.
Although tomorrow will be a bit shorter day, I expect to make a new release.
Stay tuned.
/devel/dst :: Link / Comments (0)
Mon, 29 Oct 2007
Strong checksumming in DST.
I've implemented strong checksumming
(I use Fletcher algorithm from
RFC 1146,
which is TCP alternate checksum options RFC) in DST and also fixed number of bugs,
so this is going to be a new release tomorrow.
I also hope, that windows DST target will be completed tomorrow (this does
not depend on me - I'm not an author).
Stay tuned.
/devel/dst :: Link / Comments (2)
Sat, 27 Oct 2007
DST merging plans.
Andrew Morton asked
about status of the distributed storage
and noted that actually there are no reasons not thinking about merging this.
Although he concerned about quite active development via my blog, but DST itself
is essentially completed.
Likely it can require some additional features after I start distributed filesystem
development, but right now it does not. Maybe I will add optional strong checksumming
of the transferred data though.
/devel/dst :: Link / Comments (7)
Tue, 23 Oct 2007
DST as shared disk storage.
Yes, it is possible. It is a transport layer
for high-performance parallel filesystem, everything
is already completed. Consider the case, when multiple filesystem nodes,
i.e. nodes which does not contain data, but only metadata, connects
to the same storage nodes (which contains storage itself), it is possible
to connect several remote nodes to single local export nodes and perform
concurrent read/write access. Similar storage is a base for likely all
computing cluster filesystems (for example GPFS,
PVFS, GFS).
This requires of course a higher layer filessytem to manage locks
and concurrency to preserve filesystem state, but storage layer itself is fully
implemented in DST already.
/devel/dst :: Link / Comments (0)
Fri, 19 Oct 2007
Sub release of the DST with turned off debug.
The latest release
has debug turned on in mirroring algorithm, so I've released 'nodebug' version
and put it into archive too.
Enjoy!
/devel/dst :: Link / Comments (0)
New article at kerneltrap.org about DST.
I have serious problems with english articles...
/devel/dst :: Link / Comments (0)
Thu, 18 Oct 2007
Added ability to store age of the node to the disk in mirror algorithm. New DST release.
This is used for the cases, when mirror node was updated (disks changed
or something like that), so that media for failed node does not contain data, which
was there previously. In this case dst
core will read 'age' of the failed node (unique id stored at the end of
the node, which is assigned to the whole storage during its initialization time),
so if it does not match current one, the whole node will be marked
as dirty and will force total resync.
The same applies to the initial setup - if id for the second or any subsequent node
does not match id of the first one, nodes will be marked as dirty and will resync eventually.
This is a good step forward, I think. The only missing bit I'm thinking about right now
is on-demand resync, i.e. when node found to be dirty. Right now resync only happens,
when there are operations on top of the storage. This is quite minor priority though,
as long as new redundancy algorithm.
/devel/dst :: Link / Comments (10)
Thu, 11 Oct 2007
New release of the distributed storage subsystem.
This is a maintenance release, which includes small number
of bug fixes, which sit in the tree, but were not yet released.
Check the homepage
to get the latest release.
/devel/dst :: Link / Comments (2)
Sat, 29 Sep 2007
Design of the WEAVER redundancy code module in distributed storage.
Here I will briefly describe ideas of the WEAVER implementation for DST.
First, it requires some extension for configuration (although it is possible
with existing commands to send all related information, it is a bit ugly),
so I will introduce private algorithm's command which will allow to transfer
all needed information. This will not even break backward compatibility.
Next is WEAVER implementation itself.
I will create 50% efficiency codes, which means half of each node will contain
data (first half, so that later it could be used for linear algorithm
for recovery for example, but that is quite ugly usage) and another half will contain
parity information (and optionally checksum of the whole node or data/parity checksums
separately). Reading is trivial (just like in case of usual mirroring) and will not be described
here. With such nodes layout each write to the distributed storage easily allows to find
a node where data should be written, private structure for each node will contain
a list of nodes, which parity blocks have to be updated when given node is being written to.
Each private structure will have the same per-block bitmap like mirror algo maintains, which will
track clean/dirty blocks of the data. So, when number of blocks are being written to node A,
number of parity blocks will be updated on the appropriate nodes. Parity update
is quite trivial - it requires a read of the given parity block, xor with old data block
(which will be read just before writing is performed) and xor with new data block,
then parity block has to be written back. Dirty bitmap for data and parity will be updated
after data and parity are written.
Complex part starts when number of nodes is turned off.
Let's first describe the case, when one node is turned off.
In this case things are just the same like above, but data is not written (and parity
calculated, when write happens for other nodes, linked to given one by algorithm),
so appropriate bitmap will contain not uptodate blocks.
When node is turned on, resync starts.
Just like in mirror algorithm, it will be performed only when some operation happens (which probalby
should be extended to be done even if there are no operations in-flight) - I will share as much code
as possible.
Each data block can be constructed by using some parity and data blocks from different nodes,
so instead of reading block from remote node and writing it to given one (like in mirror), system
will read number of data and parity blocks from different nodes, when they are ready, new data block
will be reconstructed and written to not-synced node. When data block is reconstructed, all nodes,
which parity depends on this block, will be updated accordingly.
For one failed node this looks quite simple, in theory multiple failed node is just the same, since
until number of failed nodes reached fault tolerance number used in algorithm calculations,
there is always number of 'good' nodes, which can be used to reconstruct at least one failed node,
with newly recovered node it is possible to reconstruct missed parity bits needed to recover another
node and so on. But that is theory, practice, I'm sure, will show me big number of surprises and complexities,
but that is exactly why I started the project.
Right now I'm thinking about userspace implementation, I hope to complete it quite soon (maybe next week),
so that kernel part would not be that hard to be done (mostly because kernel debugging is usually
much longer because of compilations and reboots). When it is ready, I will mark project as completed.
I'm not sure I will push it into the kernel at that point, I will see.
After it is ready I will likely start to recall filesystem bits I
collected and probably will
implement simple (read: initially trivial) distributed filesystem.
Or maybe not, I think this project will flush me enough, so I might start something completely
different, like process migration over the network,
Markov chains text analysis,
letter describing language for captcha solving
or network protocol over laser link
or something completely new.
Stay tuned and you will not miss interesting things :)
/devel/dst :: Link / Comments (1)
Tue, 25 Sep 2007
WEAVER redundancy code.
I've started initial userspace implmentation of the
WEAVER codes.
This is very simple XOR-based code and thus it is possible
to miss the double bit error, so I will extend it with strong
checksum. Essentially system splits the whole pool
of available nodes (storages) into so called strips,
which contains number of data and parity blocks (I will use sector-sized chunks).
The latter is just a XOR of some data blocks, each data block which
participate in given parity block is defined by parity defining sets of WEAVER
code, which depends on selected number of fault tolerance, number of nodes used
and selected step.
So far I plan to implement simple case for 3 fault tolerance (i.e. any 3 nodes
can be turned off) in userspace in the way which will allow to construct
storages with different fault tolerance number.
Stay tuned.
/devel/dst :: Link / Comments (0)
Sun, 23 Sep 2007
Windows DST target.
There is a port of userspace DST target to windows, which
works with usual files (work with device nodes is in TODO list) in userspace,
so I've started a stress testing utility implementation,
which will randomly read and write blocks over the net. After
this testing is completed I will test this utility with real
DST system and put on the homepage.
Windows DST target was not written by me.
/devel/dst :: Link / Comments (0)
Tue, 18 Sep 2007
New article about distributed storage on kerneltrap.
Article
describes a short debate about place of the DST in the storage stack, its relation
to filesystem and various ideas of creating distributed systems, which I had with
Jeff Garzik (SATA and network drivers maintainer), Andreas Dilger (Lustre/Clusterfs
principal software engineer) and other people.
There were two points of view - all existing distributed systems suck (too overbloated,
too oriented to very expensive storage hardware, badly scaled and so on) and
that existing (Lustre) is very good, so no need to reinvent the wheel, since
distributed FS is quite complex, and likely will not succeed, but block device
is simple and thus appealing.
Maybe it is right, we will see...
/devel/dst :: Link / Comments (4)
|