|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Wed, 08 Oct 2008
This day has come.
One month of waiting.
One week of real work.
Seven releases of different projects.
One idea.
One implementation.
One project.
This day has come: the new, completely rewritten locking subsystem in
POHMELFS.
The release day!
Following changes were made:
- The new distributed locking subsystem. Locks were prepared to be byte-range,
but since all Linux filesystems lock the whole inode, it was decided to lock the whole
object during writing. Actual messages being sent for locking/cache coherency protocol
are byte-range, but because the whole inode is locked, lock is cached, so range actually
is equal to
inode->i_size. One can simultaneously write into the same page
via different offsets from different client, and every time file will be coherent on all
clients which do it and on the server itself.
- Documentation update. Fixed by Adam Langley (agl_imperialviolet.org)
- Add/del/show commands patch from Varun Chandramohan (varunc_linux.vnet.ibm.com)
- Bug fixes and cleanups.
Get the latest version from
archive
or via GIT tree.
Enjoy!
/devel/fs :: Link / Comments ()
Tue, 07 Oct 2008
POHMELFS locking testing.
So far it produced not bad results. But not good either.
I see locking messages and they are in the right order and file content
is not damaged, but clients frequently give up on timeout waiting for
lock to be granted. Since locking release process requires inode to be
unlocked (so it could be found and locked by the thread, which received
network packet), this indeed may take too long on slow media and disks,
since locking has to wait until data is written, for example wait for writeback
completion or page reading, if they were note in the cache yet.
I tested
POHMELFS
locks in Xen domains, where network speed is limited
by 3 MB/s and writing one million (or ten millions, that may be the point)
8-byte entries at different offsets (sequential step of 128 bytes) took more
than 50 seconds, so 5 seconds default lock timeout could be not enough.
That's the theory, in practice I need to test different timeouts and actually
run on real machines, but here comes another problem. I have three quite fast
SMP machines with lots of RAM connected over gigabit ethernet, which can be used
whatever I like. But...
The first one was essentially killed by
tbench regression
testing. They all have long history of problems with disks or SCSI controllers,
now it happens again: the first machine boots only with single 2.6.22 Debian kernel,
anything else (including vanilla 2.6.22) fails to read data from the software raid
partition, although disks are detected correctly.
Another machine is actively used by aforementioned tbench regression testing,
it takes quite long time to boot it and run tests, so things are slow enough.
And the last one is used to control IPMI, since it is the only way I have to reboot them,
so when I managed to freeze all three, I needed to contact people who needed to contact people,
who needed to hard reset machines in datacenter and put them into BIOS, since existing
KVM switches are stupid enough not to respond to keyboard when machine died.
So, I'm a bit forced to spread efforts in several different directions,
but nevertheless there is a little bit of time for the new things:
$ ./elliptics -c ./elliptics.conf
2008-10-07 01:03:39.430198 12778 Logging has been started.
2008-10-07 01:03:39.430559 12778 Successfully initialized 'sha1' hash.
2008-10-07 01:03:39.430641 12778 Node id: b551803fd74ff5590ed38f6ce8a10a2e577b2a9e
2008-10-07 01:03:39.431076 12778 Server is now listening at 127.0.0.1:1025.
$ cat elliptics.conf
#
# This is a simple config file for the elliptics network.
# Note, that spaces are skipped before and after the '=' delimiter.
#
log = /dev/stdout
hash = sha1
id = This is id string
#numeric_id = 1234567890abcdefffffffffffffffffffffffa
root = /tmp
addr = 127.0.0.1:1025:2
#addr = ::1:1025:10
That will be an excellent project (maybe even my best one to date :),
which will be used in... More details when things are ready.
I like the idea, so maybe it will give a name for my new site, like
noelliptics.net. Not yet though.
/devel/fs :: Link / Comments ()
Thu, 02 Oct 2008
POHMELFS got new locking subsystem.
I've completed a small rewrite of the distributed locks in
POHMELFS.
They can be byte-range, but since Linux VFS locks the whole inode
during writing, I decided first to implement simpler apporach,
so although clients send byte-range locks, server locks the whole
object.
If there is a simultaneous writing to the object, only one writer is allowed
at a time. Write locks are grabbed at write time, read locks at read time. Writing
is still handled via writeback, so all caching facilities persist. Locks are 'cached',
i.e. if inode was locked and no one else tried to update it, no new lock messages
are sent between server and client. Lock release message (initiated by another client,
who wants to start writing into the same file) forces inode writeback on the current
lock owner.
I've started a testing process, so far quite trivial, but I plan to write a simple
application, which will simultaneously write into the same file from different clients
into different offsets (like first client writes each second byte, second client writes
each third byte and so on) and check the result. If everything is ok, I will release a new
version this weekend and start implementation of the really cool distributed facilities
I plan to have in POHMELFS. It will be first implemented as a library, so that anyone could
use it to create a distributed storage without patching a kernel (but with own API though,
I do not want to mess with FUSE).
/devel/fs :: Link / Comments ()
Sat, 27 Sep 2008
POHMELFS cache coherency protocol.
Finally it looks like there are no killing bugs
or noticebly bad features in the distributed storage,
yesterday I pushed a change to drop wrong debug, which may resulted in a crash, also couple
of comment cleanups are waiting to be pushed, and likely that's it. It will be the last release,
if there will be no new feature requests or bugs found.
So, I switched back to the POHMELFS
development from DST.
To be really cool in cache coherency collisions, POHMELFS requires new locking/coherency mechanism, which
I implement similar to MOESI cache coherency protocol.
Which basically means a floating lock for given object,
which may be owned by only one client at a time not counting readers, they just receive a message, that theirs
data is not valid anymore.
First, I changed userspace management of the inode cache: now there is only single tree of all objects,
which were ever opened by any client. When client disconnects or drop inode locally, it is removed from the
server's cache also.
Next, there will be a special command to acicure grab/release a lock, which is only being sent by writers.
When writer starts its dirty job of damaging shared data, it sends a lock grab message to the server with
requested range, which in turn is broadcasted to the other writers, only single writer is allowed to own given area.
Then server proceeds with its usual tasks of cooking or waiting for IO. Eventually owner of the lock
decides to release it, for example after above message from the server it can flush data to the server
and send lock release message or just on its own. So server checks if given area is now free and sends lock
comepltion message to the requester. New owner receives the message, mark inode as own and starts writing there.
Any subsequent writing, if inode is marked as owned, does not end up with additional lock message.
So far looks doable, but I only completed what is called 'first' above :)
If there will be no major problems with other project, I plan to complete this part quickly and move furward.
/devel/fs :: Link / Comments ()
Thu, 11 Sep 2008
POHMELFS development process.
I completed design (without implementation yet :) of the new
locking (or cache coherency mechanism, it does not really matter)
for the shared objects in the
POHMELFS.
It is somewhat close to the MOSI (even MOESI) cache coherency protocol,
used in modern CPUs, although also differs a bit because of the nature
of the POHMELFS server. It can provide byte-range locking for any object,
but so far I will only implement per-file locking (i.e. the whole file will
be 'locked' or 'owned' when client performs a write, even if another client
could write to the different location in the same object), and if scalability
will not be good enough, it can be extended (not that complex though). Since
all in-kernel filesystems lock the whole inode when performing a write,
this should not be a big problem.
This approach requires to change POHMELFS server's directory cache, but I
never liked existing one, since it looks a bit over engineered.
If things will go smooth, I will complete it tomorrow before flying to the kernel summit
(saturday early morning), since idea is really not very complex as long as I expect
implementation to be.
Meanwhile, DST
got a fix for the incredibly stupid bug I made, even do not want to call this
'bug', it is likely a tricky created blindness by the electrons moving things around in my
monitor. They forced me not to see an obvious place to lock access to, which resulted
in a nasty oopses. Patch is already in the
git tree.
There is another one though: when some SCSI device is being exported and client performs a
write, request has somehow zero req->nr_phys_segments field (which should
be initialized from block IO request), which catches a BUG_ON() in the scsi
code. I'm working on it right now.
/devel/fs :: Link / Comments ()
Mon, 25 Aug 2008
POHMELFS configuration extension.
I've committed changes from Varun Chandramohan (varunc_linux.vnet.ibm.com)
which extends POHMELFS
to support ADD/REMOVE/SHOW configuration groups.
Configuration group is a global object inside pohmelfs core, which contains information about
servers to work with and various configuration parameters. When administrator mounts new pohmelfs
filesystem, he or she has to setup appropriate configuration group and use its index as mount
option parameter. There is special configuration utility for this purpose inside
POHMELFS userspace package.
Now it is possible not only to add or remove groups, but also to show them to the administrator.
I've pushed chages into the kernel
and userspace
GIT trees.
/devel/fs :: Link / Comments ()
Sat, 26 Jul 2008
New POHMELFS release.
This release was fully made by other developers. Thanks a lot for your work.
I only updated some trivial bits and fixed bug in the server.
Short changelog:
- Documentation update by Adam Langley (agl_imperialviolet.org).
Now one can read properly spelled POHMELFS design.
- Server and configuration utility IPv6 support by
Varun Chandramohan (varunc_linux.vnet.ibm.com). Kernel client
does not need this changes, since it supports any protocol.
Now one can create POHMELFS cluster over IPv6.
- Server bug fix and small documentation update by me.
One can get more detail about POHMELFS at its
homepage.
Sources can be downloaded from archive
or via GIT tree.
/devel/fs :: Link / Comments ()
Fri, 25 Jul 2008
This was supposed to be a new POHMELFS release day.
I accumulated patches from Varun Chandramohan of IBM Linux center,
which add IPv6 support to the POHMELFS
server and configuration utility. Kernel client does not need it, since it works
with any kind of addresses (by design).
I also wanted to add documentation update from Adam Langley, but apparently
I accidentally deleted his patches, so release is being postponed a bit.
Meanwhile I made some little progress at DST
development side. Added trivial configuration bits and started to develop cryptography part,
mainly configuration (which I will copy from POHMELFS) and thread pool subsystem.
The latter is rather simple patch, which will allow to create a thread pool, to add/remove
threads on demand and to queueu a work to the pool. In theory this can be a generic
enough patch to be used by other users (I even saw some kind of topic proposal for
kernel summit), but so far I'm not going to push it separately from DST. Main goal
of this system is crypto processing of the BIOs for the distributed storage.
/devel/fs :: Link / Comments ()
Tue, 22 Jul 2008
POHMELFS distributed facilities design notes.
Since I'm quite busy with VISA/hotel/tickets and overall preparations
for Kernel Summit, there is no development progress, but it should be
completed very soon I think, and so I will write here some design notes
I have in mind about how POHMELFS server will be designed. It is not a
finished draft, but somewhat a rough direction paint.
POHMELFS will utilize distributed hash table approach, i.e. storage
will support ability to get an obect based on some key attached to it.
In a local filessytem we already work with hash table: directory
lookup is no more than lookup for inode object based on its name, i.e.
lookup for the value based on attached key. And although key in this
case is not created based on object itself (like hash of the content or
some other function), it still is a (turn on your imagination here) table lookup.
Cloud of POHMELFS servers will utilize similar approach. Consider a single
server in the system. When it joins the cloud (I ommit this proccess for now,
and will describe it below) first time, it is empty, so it gets some unique
id, either via administrator steps or randomly, or it just waits in the queue
to be filled with new data, so it will get id at that time, it does not matter
for now how it gets its id, but this id is propagated to some cloud of its
neighbours (or if it would be a bittorrent or napster to the main server).
There are two ideas on how to treat this ID: either as a part of the filename,
or as a nameless pointer in the abstract namespace, I will show below that actually
it does not matter.
Now, let's check what will happen when user wants to perform some IO on given file.
Every file access actually happen to inode, stored on disk. In our case it can be stored
somewhere we do not know yet where, so we need to perform a lookup to get address
of the node in cluster which contains our data. In existing schemas like bittorrent
or Lustre there is a server (or small cloud of servers) which contain mapping information
about where this or that object is placed in data cloud, so simple lookup to this server(s)
return needed info. This approach does not scale to really lots of nodes and is failure-prone.
Instead I consider completely distributed metadata storage. Let's check how system will lookup
the whole path in our case.
Each path starts from the root directory, which is '/', which in turn is a id in the global
namespace (or hash from this string or whatever else mapping), so we first need to lookup
a node, which is responsible to content of this directory. Each node contains routes only
to the very limited set neighbour nodes (in various designs this number varys, but idea
lays in the fact, that node, performing lookup, does not know which node contains needed info).
Gnutella system just broadcasted this lookup request to all of its neighbours, so each one
broadcasted it to its neighbours and so on until one of the system replied, that it contains
needed info. Amount of unneded broadcasts killed Gnutella next day after Napster was closed.
So, this approach does not scale, and instead we need to map needed directory into node address
in a more intelligent way. There are at least two the most appealing design choices: ring-based
structure implemted in CHORD and multidimensional torus implemented in CAN.
Right now it does not matter, let's assume that we found a node, which has information
about content of the needed directory. When we have that data, we can find next node (or this
info can be cached on 'parent' directory node) and so on until get node, which is resposible for
storing content of the needed object.
When new node joins the cloud it connects to one or another known node (provided either in public
service or by administrator) and sends there information about its available space, gets ID
and just waits until some client connects to it and start writing a data.
When node joins with some content, which was written to it by the system before, or written by
local users bypassing distributed mechanism, node has to tell this information to the node, which
holds parent directory. This information should be stored in each directory it exports, or it
can be provided by administrator, for example this node exports dir '/zbr' which is actually a subdir
of '/home', so node will lookup '/home' directory content owner and update its records, that now
it contains new dir. There is a problem here: what if there is already another node, which also
claims to have dir '/zbr' in '/home'? This can be handled via attached to each object extended attribute,
which will tell us the last modification date, so system can select either the last modified '/zbr'
dir or that node, which contains dir with the biggest number of the same replicas. It can be setup by
administrator.
Main advantage of this joining scheme is the fact, that we actually do not need to know content of any
object in the exported directory, we publish only high-level object, which may or may not contain some
inner file or dir. Thus we do not need to hash millions of files in the exported directory and publish
them one by one, we do not need to store information about each inner object,
no need attach full path to each object and so on.
When we will decide to split the same object between multiple node, we will need to introduce not only
name based lookup, but also extend it to the offset inside the object. This can be done by introducing
ssytem wide 'block size', so each file is actually set of blocks of given size, so when we found a node,
resposible for storing information about directory, where it is located, this node can also contain
information where each part of the object was stored.
Looks quite simple, but... Devil is in the details.
I obviously missed some bits in the design (and I created it in mind during talk being
under 'impression' of the greece spirit while talking with asm@, who suggested to look
at Kademlia project), like redundancy management of the nodes, splitting of the node content between
multiple nodes and other bits, but it is one of the first drafts, so things can be changed if needed.
Stay tuned, I will be very soon back to development process
(DST first :), since paper work for kernel summit travel
seems to reach its end.
/devel/fs :: Link / Comments ()
Mon, 07 Jul 2008
New POHMELFS release.
Irish 'Clontarf' and Scotch 'Grant's' helped to rule this release out.
This POHMELFS release features
include:
- Strong cryptography support. One can encrypt whole data channel (except headers) and/or hash/digest it.
System will try to autoconfigure itself and if server does not support requested algorithms, mount will either
fail (if special mount option is specified) or disable appropriate algorithm usage.
- Bug fixes.
Cryptography support is essential addition to the POHMELFS core. It was implemented with performance
in mind, so that processing speeds would not drop noticeble even in case of very CPU-hungry operations
(one can check performance graphs).
POHMELFS utilizes pool of crypto threads (its number can be specified via mount option), which perform data crypto
processing and submit it either to network or VFS layer.
Now I will concentrate mostly on userspace server features, mainly its distributed facilities, current ability
to write data to multiple servers and balance reading among them is not enough for POHMELFS, but it will be an
essential building block of the fully distributed fault-tolerant paralllel filesystem.
If this development will require some changes in kernel side (namely network protocol extension), it will be
don in the upcoming releases with possible found bug fixes.
As usual, you can grab sources from
archive or via
GIT tree.
You can also check POHMELFS homepage
to get more details on its design and supported features.
P.S. I think I will have some rest out of this project for several days, which will allow me to concentrate on
main POHMELFS features and work out rough edges. I will switch to DST
and netchannels (main to make a new releases)
and then will devote some time to captcha cracking algorithms.
/devel/fs :: Link / Comments ()
POHMELFS crypto processing performance.
If you expected a miracle, it did not happen, so I just present a picture, where
I compared plain async in-kernel NFS server (no encryption, no checksumming)
versus POHMELFS, which performed SHA1 hashing and AES-128-CBC encryption of the whole
data channel.
Block size used in iozone test is 8KB, filesize - 8GB, 1GB of RAM.
/devel/fs :: Link / Comments ()
Sun, 06 Jul 2008
Multithreaded POHMELFS crypto processing.
Meanwhile having a rest from various celebrations, I managed
to complete receiving multhreaded crypto processing
in POHMELFS.
So far it was only tested in debug environment (i.e. zillions
of logs and overall miserable performance), but it shows, that
different threads pick up the work, both on sending and receiving
directions.
There is a limitation though: the same crypto threads are used both
for receiving and transmit pathes, so it is possible to saturate them
all for example for receiving, so sending will stall. If there are
unsufficient crypto threads, waiting for RX crypto processing can take
too long, so watchdog transmit scanner will fire up and complete transactions
with errors. One can work this around by specifying big enough number of
crypto threads or long enough transaction scanning timeout, both are provided
via mount option.
I would like to test it in more production-like environment and perform various
stresses on it, but I'm far from my working place, so can not do it right now.
Which means release will be postponed for tomorrow (if testing will not show
regressions or bugs).
This will not be last feature release though: for example POHMELFS does not support
extended attributes and ACLs, there is no header checksum (although there is a reserved
32-but field) there may be some features in different areas too,
but I do not hurry to implement them, since I need something to put into future
POHMELFS changelogs. I think sending the same kernel patch with different words
about userspace server changes is not the way to go, so there should be some kernel
changes too :)
I will draw up some design notes on how I plan to implement POHMELFS server, and namely
how distributed facilities will be done, so far I have quite clear picture in mind,
but it needs to be worked out 'on paper' to find rough corners.
Stay tuned!
/devel/fs :: Link / Comments ()
Thu, 03 Jul 2008
POHMELFS crypto support has been completed.
kernel$ git commit -a
Created commit b07e3ed: Added crypto support.
9 files changed, 1534 insertions(+), 221 deletions(-)
create mode 100644 fs/pohmelfs/crypto.c
fserver$ git commit -a -m "Aded crypto support."
Created commit f916b2f: Aded crypto support.
3 files changed, 788 insertions(+), 94 deletions(-)
I implemented pool of crypto processing threads (number of them
is mount option parameter), each of which has pool of pages to
encrypt data into, so crypto thread is not released until server
returns acknowledge that data was successfully written, so one
should tune number of threads and page pool (number of pages
in each thread is maximum number of pages per transaction,
this limit has own mount option too) according to desired behaviour.
Testing shows that writing performance was reduced with this approach
noticebly: with 4 encryption threads and 4 receiving thread in server
perfromance dropped by around 30% from 65+ MB/s down to 46+ MB/s,
but I think it can be improved with larger number of encryption threads.
During iozone write/rewrite test each of 4 crypto threads ate about 20-30%
of CPU, while server ate about 130% (4 threads totally). In all previous iozone tests
the larger number of userspace was used, the worse results were
(this is somewhat expected, since iozone is singlethreaded benchmark,
so larger number of threads lead only to performance degradation),
so I will test different setups (namely larger number of crypto threads
and smaller number of server threads).
But this behaviour is not a problem, and I expect it to be tuned, real
problem is reading performance. Right now there is only single thread,
which reads from one socket: it was done intentionally, since reading
data from socket is longer operation than searching page in radix tree
or any other operation performed by that thread, so there is no way
to saturate its capabilities. Until we start encryption, which is slow,
so any subsequent data reading from the socket can not be done in parallel
with crypto processing, and overall reading performance drops to ground.
This problem has to be fixed, so I plan to use the same crypto
processing threads to decrypt and/or perform hash check for received data
and push it up to the VFS stack.
/devel/fs :: Link / Comments ()
Wed, 02 Jul 2008
POHMELFS crypto: feel incredibly stupid.
First,
POHMELFS
does need to have encryption. Because I plan to use
distributed hash table approach in server (well, consider POHMELFS
kernel client as a kind of bittorrent filesystem client), and as in any
non-centralized system, content transferred via uncontrolled data channels
has to be encrypted.
But... I'm incredibly stupid: I implemented encryption and decryption in place,
i.e. VFS page is being encrypted prior to be written to the servers, so
subsequent reading leads to... Yes, it reads encrypted content.
To fix this issue I plan to encrypt data into different pages and send them,
leaving VFS ones as is. There are two approaches I consider:
- allocate and send pages at writeback time - we want to send 5 pages, so allocate
5 pages, encrypt data into them and broadcast them to all needed servers.
- allocate (potentially large) pool of pages at mount time per crypto thread
and encrypt data into them. This will have about zero run-time overhead for VFS,
except slightly delayed because of encryption write completion.
/devel/fs :: Link / Comments ()
Mon, 30 Jun 2008
Filesystem development rumors.
Rumor number one. SWsoft
aka Parallels actively searches for Linux kernel hackers in
lead Moscow universities, namely MSU and MIPT. I saw theirs
posters, where among other (wanted) requirements there is
distributed filesystem knowledge.
Rumor number two. Alexey Kuznetsov (if you do not know,
its the guy who wrote major part of linux network stack,
namely TCP/UDP/IP and socket implementations, and although
there was lots of changes in the stack since then, I think it will not
be an exaggeration to call him the author), who also worked
on Virtuozzo and OpenVZ (and its interesting VFS parts, which
AFAICS are not in kernel, maybe yet), so he works on some
filesystem too. The last time we 'confronted' was couple
of years ago, when I first time implemented
netchannels
and tried to convince network community (and namely Alexey Kuznetsov
and David Miller)
that netchannel idea worth further investigation and implementation.
IIRC I did not succeed, although results were very
impressive.
Let's see what will happen with filesystems :)
Rumor number three. SWsoft recently started to actively search
for kernel hacker for 'new interesting open source project'. They
always searched for kernel programmers, but never told anything
about projects, now something changed.
Rumor number four. OpenVZ and Virtuozzo have serious problems with NFS
(especially when server dies), probably because of very ugly NFS protocol
(yes it is), so its hard to properly virtualize it (or not?). There are
no alternatives for NFS right now in major productions, but you all know about
POHMELFS
which right now can be used as really good replacement.
Rumor number five. SWsoft has long history of PHD defences (at least in MIPT) based on
theoretical FS called TorFS (namely Tormasov FileSystem), year ago it was still
not very alive project in practice,
but I heard that it was very impressive in theory. This rumor exists
really many years.
So, I have a quite clear picture, that SWsoft started development of the new
distributed filesystem, which is aimed at first to replace NFS in virtualized
environments. I can also imagine very interesting distributed parallel facilities
needed for virtualized systems. And they try to attract lots of people to the
project as long as really heavy artillery like Alexey Kuznetsov.
Which basically means, that sooner or later my development will meet strong
concurency from this company, which has lots of really good professionals.
And that's very interesting and cool :)
P.S. or it may be a complete bullshit and delirium of my fevered consciousness.
And one fact about
POHMELFS:
today I finished client support for padded crypto processing of all requests
and started to work out server bits, I expect to finish it in a day or around,
so new release is very close.
/devel/fs :: Link / Comments ()
Sat, 28 Jun 2008
Need to rethink POHMELFS crypto a bit.
1. Because of encryption problem - data to be encrypted has to be
blocksize aligned, so some informaion about padding has to
be added into network command as long as crypto data size.
2. IV generation. I decided to extend network command and put there
64 bit IV for given packet. using simple sequence number is enough
to protect against repeat message attack.
3. Encryption/hashing data. I decided not to ecnrypt/hash network headers,
and only do it for transmitted data. If transaction contains several
commands, data for all commands will be encrypted/hashed, in case of hash,
signle digest/hmac will be generated and placed into transaction header.
4. It is possible, that I will add strong header checksum, which will be generated
only for header and placed into special field. It will be calculated
assuming checksum field is zero. This step is optional so far, but network header
has 32 reserved bits, which can be used for it.
Right now hashing and encryption work, but are not checked on server (although generated),
because of crypto alignment ugliness I decided to rethink approach a bit.
Evolution process in action...
/devel/fs :: Link / Comments ()
Thu, 26 Jun 2008
POHMELFS server got initial crypto processing capabilities.
POHMELFS server is able to handshake hash/cipher names and operation
modes, to initialize appropriate algorithms and perfrom basic operations
(like more generic hash_update() instead of different
functions with different arguments used to hash data depending on operation mode,
either simple digest or hmac: EVP_DigestUpdate()/HMAC_Update().
I'm working on the right way of doing crypto processing, since how it is done right now is a bit hairy,
i.e. without serious changes in the code.
I already hate OpenSSL API: EVP_get_cipherbyname(), EVP_MD_CTX, EVP_DigestFinal_ex().
It looks like above functions were written by three different persons and they
never actually talked to each other about how to make them look similar... But it is
a minor issue of course.
So, when things are settled down, I will make a new release, likely it will see the light this week.
/devel/fs :: Link / Comments ()
Wed, 25 Jun 2008
POHMELFS input crypto processing engine is ready for testing.
But testing can not be done without appropriate server support, which
is now the main task. POHMELFS uses lazy crypto engine - each network state
(it represents connection between client and one server) contains
number of fields used exclusively for semi-lockless input data processing
(it locks state when performs actual reading, but does not
hold that lock when processing incoming messages, since it is the only
path, which receives data), now it also has crypto information about
how to manage reply messages (they include read page reply for example),
so it does not queue work to be done by crypto threads, but does that itself
instead. It may or may not be the bottleneck of the input path, tests will
provide facts, so far I do not have plans to change it, but it can be done
of course if performance will suck.
After I finish crypto processing in both client (it has been written, but requires lots
of testing with server) and server (just have started to recall how to work with
OpenSSL. Well, I've read how HMAC works in OpenSSL, found it to be simple enough
and then started to read how to parse binary data in LISP :)
But anything which is interesting for me now, ends up in good results for all other
projects), I will switch to something different for a while.
Some voices in the brain ask to be spread it in lots of interesting directions :)
/devel/fs :: Link / Comments ()
POHMELFS crypto performance.
I've ran read/reread and write/rewrite tests as described
in previous run,
now with HMAC(SHA1) of all outgoing transactions (note, that reading response data is not yet
encrypted and does not contain digital signature, server also does not support neither operation),
essentially only writing should be affected by this, but I also ran reading tests for compelteness.
Results show zero performance overhead of the full data SHA1 hashing, but note that quite fast
machines were used (2 3Ghz Xeons (2 physical and 2 logical CPUs, HT enabled) with 1 GB of RAM). All the time only
two crypto threads were actively hashing data, since there are only two pdflush threads on this machine.


Writing is even faster with hashing, but results drifted around, so essentially performance is the same.
/devel/fs :: Link / Comments ()
Mon, 23 Jun 2008
POHMELFS client got initial part of multithreaded crypto/checksum processing.
So far it only includes encryption and hash calculation for outgoing
transactions. System has (mount option) number of threads per superblock,
which are responsible for encryption/hashing (each thread has own crypto structure,
so there are no additional allocations in the fast path, although I think
they would not harm performance since should be small enough
fraction on top of crypto processing overhead) and subsequent data sending,
so original caller (like writeback/readahead code) will not block if there
are ready threads, otherwise it will wait until some thread finishes its current crypto work.
I decided to implement kind of continuation for such transactions, when network sending
code (which is supposed to be started after crypto processing) will be invoked from those threads,
which performed crypto operations, and not returning back to originall caller context.
For massively multiqueue NICs that should be a benefit, but so far I did not test its performance.
Next step is receiving crypto support and userspace changes.
/devel/fs :: Link / Comments ()
Crypto processing in POHMELFS. OpenSSL vs GNU TLS.
If I did not miss something,
GNU TLS (I never worked with it)
supports very limited amount of ciphers and hashes, so it is not appropriate for
filesystem data protection layer.
According to its
documentation
GNU TLS only supports AES, RC4 and 3DES ciphers and SHA1 and MD5 hashes. There is also only CBC
chaining mode and several hash/cipher schemes.
So, POHMELFS server will use OpenSSL for data protection. Sooner or later OpenSSL
will get hardware crypto support on Linux too (well, Linux crypto stack should first
implement userspace API, which does not exist yet, although there is a
work
by Loc Ho from AMCC to add such support).
So far I decided to implement following protection scheme: checksumm or encryption
will cover full transaction data, but will be applied by chunks:
- Transaction 'first-level' data, i.e. header and data immediately placed after transaction
header. For all commands except page writing it will be finish.
- For write pages command, each header is generated dynamically and does not exist
until data is really being sent, so crypto code will run over all pages and update checksum
processing headers and data pages separately. Checkum update should be simple enough, since
there are crypto helpers to update and finalize checksum, but encryption is more complex:
I requires all chunks to be setup in advance in single scatterlist chain, with dynamic header
generation it is too big overhead (it requires not only scatterlist allocation, but also
header allocation just for encryption), so encryption will be done separately for headers and pages,
and I will have to create some IV propagation scheme (like last bytes of previous unencrypted chunk
will become IV for the next chunk, or something like that). I understand, that it may be not very
secure approach though.
- Reading data back from server is simpler, since there are no transactions,
and data will be encrypted/checksummed like in the first step above. It is possible, that it will
force to increase network header structure a bit (32 or 16 bits to store size of the attached checksumm).
/devel/fs :: Link / Comments ()
Thu, 19 Jun 2008
POHMELFS and HMAC/crypto operations.
As I found with
distributed storage
project, any communication channels, which involve huge amount of data transfers,
have to have additional strong checksum embedded in the protocol, since TCP one is not
enough in some cases. There are some options, like TCP MD5 signatures or IPsec transformations,
but it is not always available.
POHMELFS
will include ability to both encrypt whole data channel and/or only digitally
sign all messages. This will be implemented on transaction level, so no higher layer code
(like reading/writing data functions) will ever be affected.
POHMELFS will also have mount time self-configuration, i.e. client will send to server
information about supported capabilities, requested by administrator, and if server does not
support some of them (for example it can only do HMAC and not encryption, and both operations were
requested at mount time), they will be dropped (and mount failed optionally).
In the future it will be possible to extend it with additional flags if needed.
mount is not very convenient command to transfer crypto information (like binary keys)
to kernel, so I use the same infrastructure as initial server group initialization (i.e. using
POHMELFS existing configuration utility).
Support for HMAC and encryption will force server to depend on OpenSSL,
but I do not think it is a problem. In some future time I can write autoconfiguration, which will
allow to compile server without crypto support (and thus do not accept encrypted clients and
do not check signatures) if there is no OpenSSL.
After crypto operations are implemented (I expect it to be finished this week), I will release as promised
new netchannel
version (and will remove unneded functionality like NAT), and add some interesting bits (like async
processing) into distributed storage,
so expect its new release soon too.
Stay tuned!
/devel/fs :: Link / Comments ()
POHMELFS, NFS, Ext4 and XFS in iozone benchmark. Graphs.
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs)
3Ghz 32 bit Xeons with 1gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with SEAGATE
ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed is about 90 MB/s.
Software used in testing: 2.6.25 kernels (on server and client), in-kernel async NFS server,
userspace POHMELFS server.
Tests were performed with 8gb files (amount of ram was reduced to 1gb to eliminate caching
influence) with different (from 8 to 1024 KB) record size. I ran write/rewrite, read/reread and
random read and write tests.


/devel/fs :: Link / Comments ()
CRFS got metadata cache coherency support.
Zach Brown has
committed
cache coherency support into CRFS repository.
Cache coherency protocol works by broadcasting special messages from
server, and each client invalidates appropriate inodes (and dentries if needed)
before sending back a reply.
POHMELFS
uses a bit different mechanism: client does not send acks back to server,
so all such messages are kind of advisory-only, but I did not yet complete (well,
I did not even think about this problem this week) locking design, so it can change.
Main problem with sync cache coherency support is its absolute non-scalability.
While number of sage cases might require such behaviour, I expect that if not major,
but noticeble part of users do not want perfromance degradation as a price for
posix-like coherency expectation. This approach is worse that write-through cache,
since there is whole round-trip of the cache coherency request instead of just
data sending during its writing. Single direction sending is faster than sending+waiting,
so for me it is still a questionable approach.
I will think a lot of this problem later this week(end), so that solution would
satisfy both high-perfomance and safety camps (although at some degree only I think).
/devel/fs :: Link / Comments ()
Fri, 13 Jun 2008
The latest iozone benchmark of POHMELFS, NFS, XFS and Ext4.
1Gb of RAM, 8Gb files. SEAGATE ST3300007LC 10k rpm 300 Gb on Adaptec AIC7902 Ultra320 SCSI adapter.
Performance in KB/s.
NFS:
random random
KB reclen write rewrite read reread read write
8388608 8 53210 57769 24304 24448 1360 4775
8388608 16 54577 57481 23871 24080 2592 7937
8388608 32 54736 56203 24015 24114 4738 12637
8388608 64 52075 54051 23653 23555 7610 18475
8388608 128 52307 54636 23305 23375 13017 26584
8388608 256 52189 53030 23585 23531 15615 34390
8388608 512 52938 54063 23709 23882 17524 42781
8388608 1024 57458 57006 24187 24292 29701 43892
POHMELFS:
random random
KB reclen write rewrite read reread read write
8388608 8 66473 63721 74232 74288 1103 4953
8388608 16 52604 62339 73423 74259 2001 8438
8388608 32 53278 62283 73497 74115 3360 13849
8388608 64 56931 61370 73135 74077 5076 21063
8388608 128 59419 62743 72736 74122 8068 30279
8388608 256 60861 63094 73284 74554 10848 38869
8388608 512 59438 62081 73329 74441 17290 48722
8388608 1024 62790 62130 73322 74100 27741 46470
POHMELFS write speed about 10% faster, read speed 3-3.5 times faster
(essentially disk/local fs IO limit, see below).
POHMELFS random read speed is smaller, and that is task with the highest priority now,
especially compared to local FS results.POHMELFS random write is slightly faster than NFS.
For comparison, local filesystem, used for tests.
mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/:
random random
KB reclen write rewrite read reread read write
8388608 8 75124 60560 77672 77797 1860 5059
8388608 16 75044 60036 77754 77775 3601 8772
8388608 32 75958 62038 77593 77765 6821 14781
8388608 64 74728 59384 77688 77782 12475 23228
8388608 128 74889 59676 77731 77736 21734 32241
8388608 256 75022 59285 77676 77718 28833 40324
8388608 512 74885 59187 77653 77713 40013 48057
8388608 1024 74838 64217 77796 77765 55100 46104
And Ext4 to the group (mount options: rw,noatime,data=writeback,extents):
random random
KB reclen write rewrite read reread read write
8388608 8 72107 73017 77276 77335 1849 5015
8388608 16 72276 73849 77304 77287 3577 8666
8388608 32 72680 73647 77284 77326 6755 14394
8388608 64 71965 74287 77327 77288 12366 22513
8388608 128 72660 73864 77207 77343 21617 31160
8388608 256 72813 74058 77296 77338 28652 42003
8388608 512 72985 73317 77284 77343 40572 50619
8388608 1024 72184 74131 77264 77250 55649 50365
Nice graphs will be done, when I will write Lisp (no less :) parser for it.
Stay tuned!
/devel/fs :: Link / Comments ()
New POHMELFS release: doing it wrong fast is at least better than doing it wrong slowly.
Via Ashleigh Brilliant and bits of Tullamore Dew.
Here we go, short changelog for this release:
- Read requests (data read, directory listing, lookup requests) balancing between multiple servers.
- Write requests are sent to multiple servers and completed only when all of them sent an ack.
- Ability to add and/or remove servers from working set at run-time from userspace (via netlink,
so the same command can be processed from real network though, but since server does not support it
yet, I dropped network part).
- Documentation (overall view and protocol commands)!
- Rename command (oops, forgot it in previous releases :)
- Several new mount options to control client behaviour instead of hardcoded numbers.
- Bug fixes.
I will complete documentation in a few moments and send this release to the mail lists.
Very likely it is last non-bug-fixing release of the kernel client side, next release will incorporate
features, needed for distributed parallel data processing (like ability to add new servers via network
command from another servers), so most of the work will be devoted to server code.
/devel/fs :: Link / Comments ()
Wed, 11 Jun 2008
Preparing for the next (last non-bug-fixing?) release.
Essnetially that's it, I belive really most of the features I wanted
from network distributed parallel filesystem, which should live
in client, are already implemented in POHMELFS.
Client has following (if did not forget something interesting,
listed only interesting from parallel point of view) features:
- Automatic failover reconnect to the same server.
- Run-time addition/removal of the servers from the working set
(only via userspace command, since server does not support that yet,
but addition is trivial).
- Coherent data and metadata cache
- Transactions support. Full failover for all operations. Resending transactions to different servers on timeout or error.
- Load balancing of reading (directory reading and lookups inclusive) requests and
simultaneous writing to all servers in current working set.
It is damn fast (but remember, that random reading
is no yet optimal enough, and in
the last tests it was slower NFS).
Userspace server meantime does not support lots of features it has to support
to be called complete parallel distributed solution, and main work should now
be concentrated on it.
Main missing (and the most complex) features are:
- Distributed data coherency protocol like PAXOS for server data, stored on multiple machines.
- Ability to mirror data itself on multiple machines.
So, likely release will see the light tomorrow or Friday.
/devel/fs :: Link / Comments ()
Fri, 06 Jun 2008
POHMELFS development status.
POHMELFS
got ability to add/remove servers in run-time (although not via network command,
since I do not know, how to test it yet), but via netlink interface. The same
message can be passed via network though, so it will be simple to extend.
Also, POHMELFS got readahead support via ->readpages()
callback. I removed AIO reading from POHMELFS in favour of readahead
and got excellent result in sequential reading: 3-3.5 times faster than NFS
and essentially reaching disk IO bandwidth (a bit less though),
but random reading dropped to miserable numbers.
Also rewritten reading method should provide better balanced between multiple servers
capabilities for the system, but it will not show any benefit in single-threaded
iozone benchmark, since it reads data via single call to read(),
which gets sequential data access, which in turn is faster than network bandwidth.
So multithreaded load should greatly benefit from read balancing, but I did not
yet test that.
I ran sequential read/reread, write/rewrite and random read/write tests for
XFS, Ext4, NFS (over XFS) and POHMELFS (over XFS) with 1Gb of RAM and 8Gb
of test files (to eliminate VFS caching influence) with 8Kb to 1Mb record size.
Results exist in text files in standard iozone output format, but since I'm learning
LISP I decided to write a graph generator (via gnuplot) using my very basic
knowledge of this language, so nice graph results can take a while...
Also, tomorrow morning I will flight away to my friends marriage and will only
return monday 9. I will not have internet access there, only lots of fun.
/devel/fs :: Link / Comments ()
Wed, 04 Jun 2008
Optimized POHMELFS transactions.
Now they eat less memory, and single writing transaction can accumulate
up to 1024 pages. This can be further tuned especially for small requests
mixed with sync. Currently writing transaction is allocated for its maximum
size, and then pages pointers are written to the allocated area, so
if number of dirty pages requiring writeback is small, quite lots of
space will be wasted.
It is a task for the next optimization, nevertheless currently sequential
writing is only limited by disk throughput or network bandwidth in case of
multiple servers, since link
is shared between machines, so effective bandwidth becomes equal
to GigE/number of servers, or about 60 MB/s in my environment with two servers
and single client.
Also, reading path was not changed at all (only transaction
internals) - there is still no readahead
and new transaction is allocated for each page to be read. Nevertheless,
see how reading was improved: POHMELFS not only outperformed NFS again,
but reached disk bandwidth limit already for 16Kb requsts (almost two
times faster than NFS). Table shows IO throughput in KB/s.
random random
KB reclen write rewrite read reread read write
8388608 8 74058 68392 40130 79509 43588 4818
8388608 16 62332 66978 73714 122074 42160 8434
8388608 32 64775 67073 109357 171139 145416 14183
8388608 64 66962 66602 147350 217323 227962 22257
8388608 128 67724 67133 185574 266855 321060 32681
8388608 256 68233 67922 201591 283567 474657 40944
8388608 512 68339 66514 213513 295995 646897 50303
8388608 1024 67744 67384 220858 297748 676582 48796
I will create nice graphs out of this tables and also will include
optimized reading tests (tomorrow likely) and two data server results.
What also should be done, is testing with either bigger files or smaller
amount of ram and thus smaller VFS cache size. As you saw in all tests, when
lots of reads start to hit the cache, picture becomes completely non-informative
for filesystem behaviour. So I want to limit all three testing machines
to 1Gb of RAM (booting with mem=1G parameter) and perform the same iozone
bench for 8Gb file. Results should be more realistic.
In parallel I will implement userspace run-time server addition/removal
command, which will also be used as-is for network message from one
or another server, connected before. With optimized reading transactions
it will be a good ground for the next POHMELFS release. So I plan to schedule
it to thursday or middle of the next week, since I will be on small vacation
jun 6-9.
/devel/fs :: Link / Comments ()
Mon, 02 Jun 2008
AppArmor and path-based security approaches vs object bound policies.
- So again, can you offer an alternative?
- Just give up on this dumb idea completely.
It is not about AppArmor in general (although maybe about it too), but about security hooks which provide
path information into inode callbacks. There are pros and cons for this decision,
but things look like path based security hooks will not be accepted.
There is a really trivial way to fix it. No kidding, it is simple: create own
name cache and do not bind it to dentries, but instead index it by inode number.
This allows you to have whatever you want callbacks and information in stricktly
bound VFS operations. Need to have path info in ->inode_create()?
Put it into own tree indexed by inode number for parent inode, lookup that data in
security hook and make a decision. Yes, it is slower, but active security was never
a fast solution. It is still against the rules others created for security based
systems, but still formally it in the all boundaries of the created (maybe ugly
for someone) interfaces.
And I will not point to project, which already uses such approach in different area
though :)
It is interesting to implement your ideas not by breaking something (although sometimes
it is need, but that's likely an exeption or when you are hacking deeply internal kernel
part), but instead by hacking around existing limitations.
/devel/fs :: Link / Comments ()
As promised, let's see shadowed miserable POHMELFS results.
Usually you will not see bad benchmark results for developing
technology, but any such result is actually a _very_ good result
for work-in-progress and not yet completed system. It allows
to see how new proof-of-concept code can be comparable
with already completed tuned and optimized system.
Conclusions from such test results in a really superior decisions.
Let's compare iozone read/reread, write/rewrite and random
read and write for POHMELFS and NFS with 8Gb test files
different record size (from 8Kb to 1Mb) on XFS over the GigE link.
I described hardware and local iozone benchmark results in details
previously.
Now its time for network tests.
Async NFS in-kernel server results.
random random
KB reclen write rewrite read reread read write
8388608 8 60969 57743 39705 97031 464898 5160
8388608 16 59925 57402 39045 98269 641388 8827
8388608 32 58094 55263 39075 94654 775064 14389
8388608 64 58168 57156 40306 98639 868796 22360
8388608 128 58908 56573 40392 100018 941509 33211
8388608 256 59444 56446 40842 102503 1030451 41576
8388608 512 60280 57686 39835 97879 1042570 49858
8388608 1024 60817 57886 40886 96646 851175 47993
And now POHMELFS results.
random random
KB reclen write rewrite read reread read write
8388608 8 70073 64232 12518 14817 40334 5079
8388608 16 63984 67948 31976 19106 41462 8702
8388608 32 67250 63440 47506 38657 75908 14357
8388608 64 69970 66198 41899 29566 136294 21385
8388608 128 69838 68523 76232 33971 222909 30946
8388608 256 70012 66439 69125 58223 330886 40685
8388608 512 70946 68291 76460 58738 428881 51001
8388608 1024 70985 64958 76317 59561 421973 48531
Sequential writing is 10-15% faster for POHMELFS (and limited by underlying
fs speed), while random writing
is essentially the same and is limited by disk speed. But sequential reading
is _much_ worse for small requests. THe reason is simple: POHMELFS does not support readahead,
since it does not have ->readpages() callback, so any
sequential access ends up with set of ->readpage() callbacks,
which waits for theirs completion, which is slow, so currently readahead
is not invoked from reading path.
I could not resist to highlight, that big
sized requests are 1.5-2 times faster for POHMELFS than NFS :)
and is also limited by underlying filesystem.
One can note, that
NFS random reading results are actually better than local filesystem behaviour,
and its is better very noticebly. Why does local filesystem behave worse than
being mounted via NFS in random reading?
I believe that's because in a network case we actually have double buffering:
on client, where the most active pages are in RAM, and on server, where
readahead populated pages, which are not active (since active pages are being
read from client's cache, so they will be evicted from server's page cache,
since client will not try to read them from server), but those server pages,
which are not active currently will be accessed soon by client, when it will read
next portion of the random data, and it will be very fast access to RAM.
So we have really good caching scheme, where the most actively used pages are
in client RAM, and they are flushed to disk on server, and isntead server populated
other less active pages via readahead.
This reading behaviour is just a result of yet not completed VFS callback implementation
of the POHMELFS. With ->readpages() in place it will be faster than
NFS even in this bench. Also POHMELFS has multiple-server parallel read balancing and
simultaneous writing to them, but there are no results yet.
I already created a mind model of the optimized read and write transactions (based
on memory pools for the maximum OOM-robustness and small memory usage overhead), so
in a day or two it will be implemented in code.
Stay tuned, now its time for excellent POHMELFS results!
/devel/fs :: Link / Comments ()
Fri, 30 May 2008
Local filesystem randomg read/write performance. POHMELFS parallel testing.
I promised to publish POHMELFS parallel processing results yesterday,
even if they are miserable. Unfortunately there are no interesting results
at all. In the released version POHMELFS is 32bit only, since it does
not have special ->open() callback which forces to open files
with O_LARGEFILE flag to support more than 4Gb (actually only 2Gb,
since kernel uses signed size_t, which is only 31 bit large) sizes and
superblock maximum size is set to 32 bits,
so all 32 bit results are not very interesting, since having 2Gb/s random
read speed is really stupid sentence, since all reading happend from the cache.
While results with more than 2Gb are... Let me first show you how XFS and Ext3 behave
in case of random writes.
A short preface.
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs)
3Ghz 32 bit Xeons with 8gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with
SEAGATE ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed
is about 90 MB/s. Dmesg:
scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
<Adaptec AIC7902 Ultra320 SCSI adapter>
aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs
scsi 0:0:2:0: Direct-Access SEAGATE ST3300007LC 0003 PQ: 0 ANSI: 3
target0:0:2: asynchronous
scsi0:A:2:0: Tagged Queuing enabled. Depth 32
target0:0:2: Beginning Domain Validation
target0:0:2: wide asynchronous
target0:0:2: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM RTI WRFLOW PCOMP (6.25 ns, offset 63)
target0:0:2: Ending Domain Validation
Kernel version is 2.6.25 (and 2.6.24 for the first ext3 test).
I used two such machines as servers for iozone
read/reread, write/rewrite and random read/write testing. File size is limited to 8Gb only,
since it is the only interesting fair case, record size varies from 8Kb to 1Mb.
Before I started 8Gb POHMELFS testing, I decided to check how local filesystem behave in such scenario.
XFS was tuned this way: (mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/)
Ext3 was created and mounted with default options on machine with only 4Gb of RAM though.
So, testing.
Here is a results table from iozone (before I interrupted it) with read/reread, write/rewrite
and random read/write tests for XFS (either default, or tuned like on link above).
random random
KB reclen write rewrite read reread read write
8388608 8 73671 64052 77565 80107 35281 5085
8388608 16 74437 66095 77611 80065 66854 8808
8388608 32 74683 66780 77564 80202 121442 14576
8388608 64 74936 66908 77537 80372 215377 22583
8388608 128 74928 68598 77542 80247 339304 32280
8388608 256 73609 69615 77534 80143 365081 40571
8388608 512 73763 69830 77547 80317 420704 48501
8388608 1024 73940 69474 77602 80065 406266 47295
I.e. 5 MB/s random write speed for 8kb record!
Do you really want to know ext3 speed? Pregnant kids and women should skip next paragraph.
I interrupted test after almost 2 (!) hours or random writing
of 8Gb file with 8Kb records on default ext3. Test was not completed and I do not really
know its performance (note, that this machine has only 4Gb of ram, other hardware details were
described above), but it will be less than 1 MB/s.
Ext4 behaves much better in this aspect (ount options: rw,noatime,data=writeback,extents):
random random
KB reclen write rewrite read reread read write
8388608 8 69593 74200 77324 81340 35538 5088
8388608 16 66745 70038 73676 77271 65715 8704
8388608 32 68253 70320 73652 77258 121690 14469
8388608 64 68421 71291 73653 77042 209629 22005
8388608 128 68438 71340 73658 76988 332021 30381
8388608 256 68921 71254 73651 76912 435586 40683
8388608 512 69079 71728 73551 76815 549136 49298
8388608 1024 66611 71217 73683 76581 552459 49220
POHMELFS results are coming...
/devel/fs :: Link / Comments ()
Wed, 28 May 2008
POHMELFS got read balancing between multiple server and simultaneous write to them.
I hate laziness, but sometimes drop into that hole... So last couple of days
I just stupidly wasted by time (well, I read Lisp and failed to find GTK binding for CLISP,
made some code and kernel bug fix, but that does not count).
Today lazyness started to be really boring, so I made some small progress in
POHMELFS
parallel processing.
It got ability to send transactions to multiple servers by default and balance reading
between them (so far it does it always from the first server, in case of error it switches
to second, but it is trivial to change). This was implemented via special routes for each
transaction, which are stored per network state, so if one of the servers did not answer,
we would not resend data to others. It also makes trees smaller, which should allow faster
reading in case of lots pending writing transactions.
Code is in testing stage currently, I will complete read balancing tomorrow and test it against
multiple servers on different machines, when data is placed on disk, so that random access
would be slow. Having two servers I exect to get linear speed increase. If test will be disk
IO bound, it is possible to add multiple servers on the same machine, so that each server would
run on its own disk (I have two resonable fast SCSI disks on each testing machine).
Results will be published here of course (well, even if they are miserable :).
/devel/fs :: Link / Comments ()
Sun, 25 May 2008
New POHMELFS release. Full transaction support. Data and metadata cache coherency.
Irish Tullamore Dew helped this
POHMELFS
release to see the light.
Short changelog:
- Full transaction support for all operations (object creation/removal, data reading and writing).
Data reading transactions are not optimal yet and will be improved in the next release (although fast).
- Data and metadata cache coherency support. More details on how this is implemented
one can find in appropriate
section.
- Transaction timeout based resending. If given transaction did not receive reply after specified
timeout, transaction will be resent (possibly to different server).
- Switched writepage path to
->sendpage() which improved performance and robustness
of the writing.
- Preliminary support for parallel data processing. Code to write data to multiple servers in parallel
and balance reading between them was imported, but is not used right now.
- Fair number of bugfixes.
Next release is scheduled for the beginning of the next month, and will likely include following features:
- Improved reading transactions.
- Server redundancy extensions (ability to store data in multiple locations according to regexp rules,
like '*.txt' in /root1 and '*.jpg' in /root1 and /root2.
- Client parallel extensions: ability to write to multiple servers and balance reading between them.
Code was imported to the current version, but not enabled yet.
- Client dynamical server reconfiguration: ability to add/remove servers from working set by server command
and from userspace.
- Start generic server distribution development.
As usual one can grab the latest source from
archive or
GIT tree.
/devel/fs :: Link / Comments ()
Sat, 24 May 2008
This was supposed to be POHMELFS release day.
But no, it is scheduled for tomorrow because of the very interesting way I decided
to implement reading transactions. The way it works right now is quite miserable,
so I want to clean things up and make a really good patch.
Page reading code will create single transaction for the bunch of pages and will schedule
next one if pages are not yet received instead of waiting for transaction to be completed,
and only wait at the very end (if needed). With addition of
async copy
from receiving kernel thread into reading userspace via copy_to_user() (in todo),
this will became the fastest possible way of doing reading over the net I think.
So far changelog contains following items:
- Full transaction support for all operations (object creation/removal, data reading and writing).
Data reading transactions are not optimal yet and will be improved in the next release.
- Data and metadata cache coherency support. More details on how this is implemented
one can find in devel
section.
- Transaction timeout based resending. If given transaction did not receive reply after specified
timeout, transaction will be resent (possibly to different server).
- Switched writepage path to
->sendpage() which improved performance and robustness
of the writing.
- Fair number of bugfixes.
/devel/fs :: Link / Comments ()
Wed, 21 May 2008
iput() locking in POHMELFS.
iput() is a very tricky call in Linux VFS,
besides the fact that it drops inode when its reference counter
reached zero, it also waits until all associated pages are
flushed to storage too.
POHMELFS uses singler per network state (network connection structure)
thread, which only reads async replies from the server, so it is possible,
that reply which requres iput() (for example create command
reply) will happend in parallel with object removal, so inode will be deleted,
but yet not freed. When reply is received and iput() called,
it will try to free inode and wait until all associated to its mapping pages
are synced. But page sync happens on reply to another command (consider for
example several writeback transactions), which can not be processed, since thread
is waiting them to be completed. This problem can not be fixed by introducing
multiple threads, since each one can be exactly in the same situation simultaneously.
In turn we should not allow to grab inode and free it in the receiving path.
This is ok for writeback transactions, since inode can not be freed until pages are synced,
so just by holding pages we are able not to lock, but object creation for empty files
or directories does not have pages attached, so they have to be synced with special
transaction. There still can be a problem with empty file though - some pages can be
attached and it can be removed while system waits for creation transaction complete,
but actually we do not need to know about that - we shuold not grab inode it all,
since transaction already contains all needed into, namely inode number, so we can lookup
inode (if it still exist) and mark it as created without need for lock-prone grab/put.
This bit took me last three days, during which POHMELFS moved to non-blocking receiving and
timeout-based sending (and returned back), it got scanning 'watchdog' which resends trasactions
if they were not acked after some time and eventually dropes them if they still does not get
a reply, POHMELFS got couple of new operations supported and likely something else to existing set
of features implemented to date (full transaction support for all operations
and data and metadata coherency protool were added for the next release).
New release is scheduled for the end of the week, and there is no readpage transaction support yet...
So, stay tuned!
/devel/fs :: Link / Comments ()
Sat, 17 May 2008
POHMELFS got full data and metadata cache coherency support. Transaction support for majority of the commands.
linux-2.6.pohmelfs$ git-diff-tree -r --stat 21549d0a101 master
fs/pohmelfs/dir.c | 108 ++++++--------------
fs/pohmelfs/inode.c | 279 ++++++++++++++++++++++++++++++++++++--------------
fs/pohmelfs/net.c | 216 ++++++++++++++++++++++++++++++---------
fs/pohmelfs/netfs.h | 43 +++++++-
fs/pohmelfs/trans.c | 55 +++++++++-
5 files changed, 484 insertions(+), 217 deletions(-)
It was rather simple task due to async event processing support.
Each time client creates, reads or writes object to server, information about
its interest is stored on server. When any other client updates the same
object (like changing attributes or writes data), all interested clients
get notifications with new data (new attributes, or in case of writing
possibly new size and flag, which page has to be fetched from the server,
since it is not valid anymore). Writing happens during writeback as before,
so commands like "echo Some_message > /mnt/file" immediately
syncs size of the file to zero and after some time writes there actual data,
when system will decide to start writeback.
Also ported all but one commands to transaction mechanism, which means
they all will be resent if currently active network connection goes down.
Although most of the commands are not synchronous, and thus will not be resent after
timeout, this can be trivially changed if there will be major demand on that.
Only reading has not yet been ported to transaction model, which is a next task
to complete. This transactions have to be synchronous, since we do want to read
data, while do not actually care about full directory content.
This changes have to be seriously tested and all problematic places to be resolved,
for example they slow metadata operations noticebly, since now system
sends a message each time new object is created, although kernel archive
untarring now takes about 5 seconds against previous 2-3 including sync
on 4-way machine with 8gb of RAM and it is still not comparable to 30+ seconds
for async NFS, it has to be investigated further.
After full move to transaction model and cache coherency testing (that model
may be not complete for some usage, since locks are not yet supported),
POHMELFS
will make its first steps into distributed area...
Stay tuned!
/devel/fs :: Link / Comments ()
Fri, 16 May 2008
Metadata cache coherency support in POHMELFS.
Client:
$ ls -lai /mnt/test
3 -rw-r--r-- 1 root root 94208 2008-05-16 22:27 test
$ sudo chown zbr.zbr /mnt/test
$ ls -lain /mnt/test
3 -rw-r--r-- 1 2319 1002 94208 2008-05-16 22:27 /mnt/test
Server:
fserver_get_client_data: thread: 3085847440, cmd: 8, id: 0, start: 2, size: 94, ext: 0.
fserver_transaction: thread: 3085847440, trans: 0, size: 94, sub: cmd: 10, id: 3, start: 0, size: 70, ext: 6.
fserver_inode_info: path: '/test', size: 94208, mode: 100644, uid: 2319, gid: 1002.
So, server now contains all metadata information about updated object on client,
pohmelfs_setattr() is synchronous for remotely read inodes
and for already synced indoes, created originally locally. It does nothing,
if object is not yet synced to server, since syncing will provide that info
itself.
The only missing thing is to asynchronously broadcast that data to other clients, which requires
to create a cache of objects to be interesting for given client, each client will be automatically
added into group of interests when it lookups object, so when attribute for given object is being
set, update will be sent to interested parties. Client will be dropped from group of interests, when
it drops appropriate inode locally (which will force sending a special message).
/devel/fs :: Link / Comments ()
Thu, 15 May 2008
POHMELFS distributed plans.
After healthy discussion
started after my announcement of the second POHMELFS release,
its time to highlight main ideas settled in the thread.
First, POHMELFS will be moved into parallel distributed filesystems, but still
being very good as network filesystem. In particular, that will include ability
to read data from one of the connected server (not particulary from currently active,
how its done right now), writing will happen to all connected servers simultaneously
(and transaction will be committed after all servers returned completion acknowledge).
Protocol will be extended to support dynamic addtion and removal of the servers to/from
currently connected group. Probably there will be some kind of a status messages for servers
(i.e. going offline, do not send me data, or I'm becoming slow, do not read from me
and so on). It will be done in addition to cache coherency messages (I'm yet to implement,
but because of other tasks, this was a bit postponed, probably to weekend), which
will include two types of requests: page invalidation and inode update (that will
also mean that POHMELFS will start supporting attributes (maybe even extended),
right now it doesn't :). Such cache coherency protocol should scale better
than classical MOSI (and its derivatives) and particulary better than pNFS spec
proides (leases to operations for some servers), since it is still possible to work in
parallel with the same file, especially without any overhead of data processing
does not cross different client boundaries, but it has to be tested in practice.
POHMELFS server will be extended to support distributed facilities. Very likely it will
be some kind of PAXOS algorithm, although probably in its very limited mode for the beginning.
So far it will be really simple, so that I could touch all its corner cases and found
optimal development strategy.
All client extensions are rather not that complex, although not always trivial,
so that should not take too much time, so probably you will get something interesting
soon.
Server extensions will be a bit slower, since I will start essentially from the distributed
system ground and gradually move upstairs.
/devel/fs :: Link / Comments ()
Tue, 13 May 2008
New POHMELFS release. Transactions, performance, failover.
Irish Jon Jameson (6 years of experience, really good stuff)
brings us this new POHMELFS release.
Main features include:
- Fast transactions. System will wrap all writings into transactions, which will
be resent to different (or the same) server in case of failure.
- Failover. It is now possible to provide number of servers to be used in round-robin
fasion when one of them dies. System will automatically reconnect to others and send
transactions to them.
- Performance. Super fast (close to wire limit) metadata operations over the network.
By courtesy of writeback cache and transactions the whole kernel archive can be untarred by 2-3 seconds
(including sync) over GigE link (wire limit! Not comparable to NFS).
The nearest roadmap includes:
- Full transaction support for all operations (only writeback is guarded by transactions currently,
default network state just reconnects to the same server).
- Data and metadata coherency extensions (in addition to existing commented object creation/removal messages).
- Server redundancy.
One can check out POHMELFS homepage
for more details. You can download latest release (against 2.6.25 kernel tree) from
archive or
GIT tree.
/devel/fs :: Link / Comments ()
Mon, 12 May 2008
Fast POHMELFS transactions.
With new transactions and new waiting mechanism (see below)
system now untars the whole kernel tree in less than 3 seconds
over the GigE link (including subsequent sync, which
takes less than second always), while async NFS (remote side is tmpfs in both cases)
performs that in a bit more than 30 seconds.
In addition POHMELFS write speed is 125 MB/s (wire limit) vs. less
than 90 MB/s in NFS (dd from /dev/zero
with 1 MB block size and 1000 blocks).
That's what I call a good result.
Transaction mechanism invoked in writeback path is now completely
async too, i.e. it does not wait until remote side confirms that
transaction was received and processed, but writeback does not drop
transactions after sending function returned, instead it stores it
in the in-flight storage and proceeds with the next one.
Transaction can accumulate up to 90 pages in a single frame.
When reply is received, async thread searches for given transaction and
complete it (unlocks page, although it can be done in writeback,
since page is being copied, cleanup writeback bits, drops it from
appropriate radix tree and drops reference counter). If transaction
was not sent due to some error it will be tried to be sent to different
servers, if some error was returned from the server, it will be resent
to different ones. Since original writeback path does not know about
transactions in-flight anymore, any timeout has to be checked by
dedicated thread (or workqueue), which will detect too old transactions
(by simply checking them from the beginning, since each new transaction has
incrased id) and resend them to remote servers.
There is a small problem though - if object size is more than single
transaction can accumulate (90 pages), it will be split into several
transactions, where first one will contain object creation command
and some data to be written, while others will contain only data.
If server runs multiple threads per client (default is one though),
it is possible that not first transaction will be processed first,
so server will write some data into non-existent file, so transaction
will fail. There are two ways to fix this isuue: either wait in writeback
on client while creation transaction is completed, and then send all others
like described above, or add creation command into every subsequent transactions
until object is created on the server (special bit is set on local inode
in that case). Likely the latter is better case.
/devel/fs :: Link / Comments ()
Wed, 07 May 2008
Fast transactions in POHMELFS.
POHMELFS
just switched to faster transactions allocated one-by-one with
even smaller overhead (although it does not use kernel_sendpage()
for page sending yet, it copies data).
System does not serialize after all transactions are completed
(it waits after each one), but with
new transaction allocation it is 1.5 times faster: 98MB/s vs. 64MB/s,
note that without waiting for transaction completion it gets full wire speed of 125MB/s
with 1500 byte MTU. And it is with highmem pages and thus slow kmap()
of each one, and unmap after completion. I do not use ->sendpage()
since it will force to split proper set of iovecs into mixed
calls of kernel_sendmsg() and kernel_sendpage(),
which I want to avoid so far. Now it is (again) faster than NFS, but I want to move further.
So, solution is rather trivial: wait until several transactions
are completed. There is the whole infrastructure already there - in-flight transaction
storage, per-transaction completion and destruction callbacks, proper reference counting
and async completion.
Still only writing transactions are used (i.e. reading/lookup and others will not
redirected to different servers).
There are some bugs of course, but that's the first development version after all.
/devel/fs :: Link / Comments ()
Mon, 05 May 2008
POHMELFS transaction support. Failover (re)connection to different servers.
POHMELFS
just got full transaction support. So far it is only used in ->wrteipages()
callback, which is invoked by writeback mechanism. POHMELFS uses lazy transaction support,
namely it waits after each transaction, which includes header and data to be written for at most
14 pages, 14 is a magic number of pages, which corresponds to struct pagevec size,
used by generic writeback, transaction size is limited by mount option and is 32 pages by default.
Performance was dropped from 125 MB/s down to 64 MB/s, which is not acceptible.
Main problem is of course waiting for transaction to be completed (i.e. completion message from server).
There should not be per transaction waiting, instead writeback has to allocate as much transactions as
needed and proceed one after another, and only start waiting for them, when there are no more
pages to be written. This is the next task.
Transaction mechanism allows quite simple reconnection to different master servers in case of failure,
and rollback of the failed transaction. For example one can provide different number of main
servers (which have to be in sync with each other and be able to be synchronized themselfs,
or they just can use shared storage), so POHMELFS client will switch between them if current
one has failed. System will detect it and reconnect, if reconnect fails, next server will be used
and the whole transaction will be resent there.
It is also possible to write transaction to different server on demand (it may or may not to be connected
already, but it has to have address structure, so far it is only obtained during pre-mount configuration),
which is a prerequistic for parallel data processing. One can create a simple patch to write transactions
one after another to severs in round-robing fasion.
Right now only write transactions are used (and can be combined with object creation if needed), read ones are pending
as long as multiple parallel transactions (which is not complex, but main task is how to wait them all to be
completed, very similar code is used in pohmelfs_aio_read()).
There is also pending task of cache coherency support (server side originated messages
to clients, which used the same pages, which another client is writing into,
also including metadata coherency messages like uid/gid/inode size and other changes),
it is not that complex task, and mostly requires server modifications.
Stay tuned!
/devel/fs :: Link / Comments ()
Fri, 02 May 2008
Design of the POHMELFS transaction model.
It is heavily based on how netlink is implemented in Linux kernel.
Besides the fact that it is likely the most ugly and complex protocol
among communication models supported by the kernel, it is exactly the
most effective, extendible and feature rich one.
This model is based on the attributes, which are embedded into
the message. Each attribute has header, which includes size
of the attached data. So, one can put
effectively unlimited amount of data into any message (limited only by
size field and practical assumptions of the communication), and it is possible
to create message, which will contain any number of different attributes.
The main problem of the netlink is its padding and alignment ugliness.
Protocol tries to get the every bit out of the communication, so there is huge
amount of very hairy things there.
I like to drink and (un)fortunately I got pretty bad quality drinks some times,
but I'm absolutely sure, when Alexey Kuznetsov designed netlink attrubute alignment
policies he had really bad hangover after likely the ever worst crap he drunk.
So, netlink attributes are very ugly, but you can extend it how you like.
The same applies to POHMELFS transactions.
You can put any new attribute into the transaction in a very trivial manner (I worked
with netlink alot, even created
kernel connector
to simplify kernel development side, so I know that taste), although transaction size is limited,
it is controlled only by mount option (default is 32 IO vectors each one
of PAGE_SIZE (4k on x86) in one transaction).
Thus one can easily implement for example any protocol security labeling,
just add new per-packet attribute.
So, it is easily possible to infinitely extend communication protocol with full backward
compatibility.
/devel/fs :: Link / Comments ()
Tue, 29 Apr 2008
POHMELFS transactions and ACID.
POHMELFS
just got initial transactions support and ability to connect to multiple master servers.
Master servers are those, which will say, where data is placed. Essentially
they are the same severs which may provide that data, but main server addresses are
provided during pre-mount configuration time, and data server addresses will be provided
by main servers (if main ones will not want to return data) in run-time.
Also main servers can be used to request data in parallel or to switch between them,
when curently active one has failed.
So far it is a theory, practice is rather miserable: POHMELFS client connects to
multiple servers, but works with only one. Errors are detected, and switch to the next
server can happen, but it is not done. Since there is a serious problem with this
approach: neither server nor client support
ACID for data being written.
Here we come to transaction introduction: it is multiple commands wrapped into
single atomic operation. In case of error during transaction
write, the whole one will be resent to different server (or the same one after reconnect).
This is rather simple (although transactions are not supported by server and client
does not wrap any command into it yet), but it still does not solve ACID problem.
Since POHMELFS has writeback cache, all its writes never reach server, instead writeback
is scheduled by the system, and it starts writing pages to the server. Current POHMELFS implementation
uses only ->writepage() method, which is invoked for each page.
It does not require server to return explicit acknowledge, that page was written,
instead it relies to underlying transport protocol (like TCP) to handle guaranteed delivery,
so data can be queued somewhere when connection was dropped, so POHMELFS client
does not know if data was really written or not. Having per-page acknowledge can fix
ACID problem realy trivially, but that may (or may not) end up with severe performance
degradataion. As a better solution I consider own ->writepages()
implementation, where each transaction will contain multiple pages to be written
and thus smaller amount of explicit acks from server to be received, and thus smaller performance
degradataion. In case of failure whole transaction has to be resent to different server of
course.
Server does not support data mirroring to multiple root directories yet, so actually
not too much is implemented from above description, but transactions and multiple
server connections exist and soon client will get support for reconnection and proper
transaction processing.
/devel/fs :: Link / Comments ()
Sun, 27 Apr 2008
Detailed POHMELFS roadmap.
Transaction support will be added into kernel client.
It is possible that it will be exported to userspace (thus
it will be synchronous write-through operations).
Also kernel client will get locking support (fcntl()
ones first, then more fine-grained ones), this is different from
byte-range
read/write locking, which will be done on server. It is possible to export
it to client too (and will be part of POHMELFS locking API actually, which will
be used for fcntl() too).
The simplest case is data invalidation in client's cache (i.e. if one client
issued a writeback for given page, it has to be marked as not up-to-date on other
clients). Likely it will be done at the beginning of the next week. So far it
will be the last cache coherency item. Task is relly simple because of
asynchronous processing of all data in kernel client. Server will have
to store not only index of directories to watch for object changes there,
but also per-object set of pages, read by client, so that appropriate
users could be notified, that page is no longer up-to-date and has to
be refreshed.
Userspace server will get parallel and distributed facilities. Parallel processing
will be done first by allowing lookup and readdir callbacks return inormation
about objects, which will contain address of the server where object is actually
located, so that server could read, write or check status there. So far the whole
file will be stored on a server, i.e. for the first implementation there will not
be a possibility to store half of the file on one server and another half on different
one. Then it can be extended.
Server will get ability to store data on different root directories (so that client
was not able to see shadow copies). There will be simple regexp policies for data storing,
for example '*.jpg' has to be stored in root1 and root2, '*.txt' only in root1 and so
on. Each root directory can be local or remote mounted one, userspace does not care
about this issues.
Main part is already completed: I have a vision of what system has to provide and how
it will look like, so with good design of the low-level mechanisms it becomes
a doable task for the predictible timeframe.
Stay tuned!
/devel/fs :: Link / Comments ()
Fri, 25 Apr 2008
POHMELFS release.
Vodka and beer together are glad to provide a new POHMELFS release for you.
POHMELFS stands for
Parallel Optimized Host Message Exchange Layered File System.
This is a high performance network filesystem with local coherent cache of data and metadata.
Its main goal is distributed parallel processing of data. Network filesystem is a client transport.
POHMELFS protocol was proven
to be superior to NFS in lots (if not all, then it is in a roadmap) operations.
Basic POHMELFS features:
- Local coherent (notes 1 and
2) cache for data and metadata.
- Completely async processing of all events (hard and symlinks are the only exceptions) including object creation
and data reading.
- Flexible object architecture optimized for network processing. Ability to create long pathes to object and remove arbitrary
huge directoris in single network command.
- High performance is one of the main design goals.
- Very fast and scalable multithreaded userspace server. Being in userspace it works with any underlying filesystem
and still is much faster than async ni-kernel NFS one.
Roadmap includes:
- Server extension to allow storing data on multiple devices (like creating mirroring), first by saving data in several
local directories (think about server, which mounted remote dirs over POHMELFS or NFS, and local dirs).
- Client/server extension to report lookup and readdir requests not only for local destination, but also to different
addresses, so that reading/writing could be done from different nodes in parallel.
- Strong authentification and possible data encryption in network channel.
- Extend client to be able to switch between different servers (if one goes down,
client automatically reconnects to second and so on).
- Async writing of the data from receiving kernel thread into userspace pages via copy_to_user() (check development tracking
blog for results).
One can grab sources from archive
or check a homepage.
Enjoy!
P.S. Moved to listen blues and drink a beer.
/devel/fs :: Link / Comments ()
Thu, 24 Apr 2008
Second POHMELFS release.
Is scheduled for tomorrow, today I have to prepare myself for it.
The whole idea and implementation started during fun new year vacations,
so I have to repeat process at least at some degree...
This release will not include direct writing to userspace from async thread,
since this approach happend to be really non-trivial. What I
described
for the page fault handling works only for the first fault, when page is populated into
the table, it can be referenced and written into and thigs just work. Problem
happens when the same page used for the second read (i.e. new try from the userspace,
for example if to increase size of written data to more than two pages, 'cat'
will use the same two pages to read data). With the second write from the kernel there will be
page fault again, although page exists in table, and fault can not be handled
(at least its reason will not be removed, since it will happen again and again), since
page table entry looks really good for the system, but not for the CPU.
I checked two cases: usual copy_to_user() from kernel on behalf of
userspace thread invoked a read syscall, and the same code, but copy was performed
from the different thread. Page table entry (pte) looks very similar in both cases
(in regards of all flags at least), but fault happens for the second write into the same
page always, when thread's mm context was changed to point to original userspace one.
This does not change if userspace thread was or was not scheduled away from its CPU.
Difference from get-user_pages() in this part is mainly the fact, that resulted page is locked
in the kernel (by increasing its reference counter at least), but I still want to produce the same
behaviour as usual page fault during copy on behalf of userspace thread.
So, I stuck with this problem, but since it is very interesting I will find a solution.
Meanwhile, this release will include following things:
- POHMELFS client. Full client side caching. Async operations for all major events
(not including
copy_to_user() hack described previously, but just async
notifications an copy on behalf of original userspace thread).
Support for usual files and directories only, special files like
device files or pipes are not interesting at this point, and are quite simple to implement, but
so far there is no need for that. Client has support for object creation/removal
cache coherency messages.
- POHMELFS userspace server. Onject creation/removal cache coherency messsage broadcasting will
be commented out, no locking.
Stay tuned!
/devel/fs :: Link / Comments ()
Tue, 22 Apr 2008
Cache coherency in POHMELFS. Continue.
While moving home I thought a lot about cache coherency issues.
While we belive that NFS has coherent cache, since it is somewhat
write-through, its cache actually is not synchronous, since between
object creation and moment when other clients see new object really lot
of time can run, for example when client, which create an object, has
slow link... So, object creation and removal should not be synced to other
clients during writeback on one of them, instead clients which are interested
in object perform a lookup, which may or may not return object, this is not a
race or cache non-coherency, this is usual multithreaded environment without
client's synchronization.
What we really care about, is data consistency on the server. When we have
multipage write, which overlaps with another write from different client,
we should not read data back from the middle of the transactions. Locking the
whole file is not an issue, instead proper byte-range (page-range actually)
locking has to be implemented. I already have a
prototype,
but have to check it in real life.
So, other competing projects may or may not follow my way and drop
creation/removal/stat coherency from the TODO list (afacs, no one implemented
that yet :) based on my analysis and concentrate on server read/write locking.
And I will start some bits of VM hacking: plan is to implement generic enough
(well, working on x86 for start :)
mechanism to copy data from different (i.e. not that one which
started a syscall) thread to userspace, while original one sleeps in syscall,
via copy_to_user(). Likely it will be somewhat similar to what
I did for zero-copy userspace sniffer
and how get_user_pages() work.
Result, which has to be as fast as usual copy_to_user(), otherwise it is not
interesting solution, will be used in POHMELFS client and its async reading.
/devel/fs :: Link / Comments ()
Mon, 21 Apr 2008
Cache coherency in POHMELFS.
Example:
Client 1 Client 2
# ls -a /mnt/
. ..
ls -a /mnt
. ..
echo qwe > /mnt/asdasd
sync
ls -a /mnt/
. .. asdasd
rm -f /mnt/asdasd
sync
ls -a /mnt/
. ..
dmesg | tail -n1
pohmelfs_remove_response: parent: 2, path: '//asdasd'.
ls -a /mnt
. .. asdasd
As you might noticed, when one client creates an object and it is written back
to server (during writeback), it is broadcasted to all clients, which read the same
directory before. This information is stored on server in binary tree, so it takes
(M-1)*O(log(N)) time, where M is total number of clients and N is number of directories
they read. This can be further optimized though.
Objects are not removed from clients, when one of them remove it (and this is synced
to server via writeback), since so far I can not call sys_unlink() directly
from module, and I did not yet wrote code to deal with dentry cache (that will be siple),
instead you can see in dmesg, that another clients received a command and just need to drop
inode and dentry.
Also inode information is not broadcasted yet (for example when file size increases
or access rights are changed), so new files have always zero size. This informaion should be
broadcasted during writing, and since server is heavily multithreaded, this should not
hurt performance.
There is different opinion though: we do not need cache coherency at all, since the last writer
will overwrite data anyway, and when we open new object, we first look it up on server,
so if it was created there, it will be opened, but if it exists only in cache on some other client, we
do not know about it anyway. We can broadcast above messages during object creation on clients,
but this will be effectively write-through cache, since we can create object on server that time.
Anyway, I will proceed with either remove/stat messages, or with ability to copy data to userspace
from different thread. The latter looks like very interesting hack.
/devel/fs :: Link / Comments ()
Sun, 20 Apr 2008
Real Jedi does not use kernel.
He writes new or extends existing, but it is from different serie.
This one will tell you how one will be able to build a distributed
and then parallel filesystem using POHMELFS.
Headline says it all: POHMELFS server will not be placed into kernel
so far, since it is already very fast (compared to in-kernel async NFS server),
and userspace programming is a bit easier and mostly because there is no
need to wait about 10 minutes while servers come up after ipmi reboot,
since they are located somewhere I do not know where and there is no posibility
to quickly reboot them by hand, so servers have lots of things to bring themself
up even if something was really screwed, like network boot, add here scsi probing,
possible fsck, initial bios memtest (8GB)...
So, planned POHMELFS server updates:
- PMCC - poor man
cache coherency protocol. Scheduled for the first half of the next week, btw.
- server extension to allow storing data on multiple devices (like creating mirroring),
first by saving data in several local directories (think about server, which mounted remote
dirs over POHMELFS or NFS, and local dirs).
- client/server extension to report lookup and readdir requests not only for local destination,
but also to different addresses, so that reading/writing could be done from different nodes
in parallel.
Somewhere at the beginning there is also a task to extend client to be able to
switch between different servers (if one goes down, client automatically reconnects to second
and so on).
And the most complex task is server parallelization, i.e. ability to have multiple
servers, which handle the same metadata, to work in parallel and be coherent. AFAIK, there
are no such (at least open) solutions, neither Lustre, nor PVFS2, nor Ceph,
nor glusterfs, nor whatever.
There are solutions to have master-slave setup (IIRC, Lustre works that way), Ceph has ability
to spread metadata between multiple servers, but they do not handle the same sets of objects,
so there is no metadata server redundancy.
So far I consider this as the most complex part, and I have not yet come to solution.
/devel/fs :: Link / Comments ()
Fri, 18 Apr 2008
Poor man's cache coherency protocol design for POHMELFS.
As you might know,
POHMELFS is a network
filesystem with client's cache of data and metadata. Any place with cache has to
provide cache-coherency algorithm to sync data with other users.
There are two common cases when caches become non-coherent:
- client created/removed/modified object, which is not shared with other clients (i.e. this
object does not exist in theirs caches and no object with the same name was created on different
clients)
- object being handled by one client exists in other caches
Poor man's solution for the above problems resolves quite easily: client will flush its changes
to whatever objects it wants during local writeback, this changes are then propagated to all
other clients, which worked with parent object (this information will be stored in server
each time client read dir or perform a lookup). For the first non-coherent case above client
will just receive a new object from the server, which will be easily imported into existing tree
(because of async nature of the POHMELFS it is trivial task, which right now works out of the box,
although only on client). For the latter case there might be problem if local object was modified:
in this case we can either replace its context with new data, or (better) to rename local object to
something different (like old name plus sync time), so that user could merge data manually.
So far there will be no locks, which will be implemented next.
/devel/fs :: Link / Comments ()
POHMELFS AIO reading benchmark vs async NFS.
After I spent two days implemententing real AIO for POHMELFS, following things happened:
- Implemented 3 different AIO schemes, two of which could be zero-copy. Here is a brief description of them.
First, POHMELFS ->aio_read() callback schedules number of pages to be read from the server
(if page is already up-to-date, it is copied to userspace, otherwise network request is being sent), then
it waits...
- when async data is received from remote side, appropriate inode and pages are found, then (physical)
userspace page is locked in memory and data is either received into that page, or received into VFS
cache page and then copied into userspace one. Then userspace page is unlocked.
- when async data is received (note that it is received completely asynchronous in different thread) into
VFS cache page, received thread copies data into userspace via
copy_to_user(). Since receiver
thread has completely different virtual memory layout, it can not simply copy data to provided userspace address,
first it has to setup page tables to be equal to userspace thread layout, in theory setting CR3 register
on x86 should be enough, but that's only theory, I was not able to fully complete this method, since eventually
thread crashed (obviously: userspace thread could be still active on different CPU, so installing the same CR3 register
for different CPUs pointing to the same page tables lead to crappy things). This interesting hack can be finished though.
- when async data is received, pages are marked as ready and placed into list, so userspace thread can copy
them back via
copy_to_user(). The simplest method. And it works great (graphs below).
- found a bug in 2.6.25-rc7 shmem when removing 1gb file from it:
Bad page state in process 'rm'
page:c49948c0 flags:0xf7d4a600 mapping:00000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 9454, comm: rm Not tainted 2.6.25-rc7 #11
[] bad_page+0x52/0x7a
[] free_hot_cold_page+0x5e/0x15a
[] __pagevec_free+0x18/0x22
[] release_pages+0xfb/0x142
[] __pagevec_release+0x15/0x1d
[] truncate_inode_pages_range+0xea/0x29f
[] __link_path_walk+0xa7e/0xb28
[] truncate_inode_pages+0x9/0xc
[] shmem_delete_inode+0x26/0xac
[] shmem_delete_inode+0x0/0xac
[] generic_delete_inode+0x88/0xec
[] iput+0x60/0x62
[] do_unlinkat+0xb7/0xf9
[] do_page_fault+0x2b6/0x6c2
[] do_page_fault+0x31e/0x6c2
[] sys_ioctl+0x2c/0x43
[] sysenter_past_esp+0x5f/0x85
[] pci_scan_single_device+0x377/0x446
Did not try to investigate (this is my testing server, not tainted with POHMELFS code).
- Ran multiple tests...
Test details for the second round of POHMELFS vs NFS fight.
Hardware and software was already described in the first round,
I need to note, that server (2.6.25-rc7) has all debugging options turned off.
Tests performed: kernel tree reading
(find linux-2.6.24.4 -type f | xargs cat > /dev/null)
from disk over the net (XFS filesystem, cold server and client caches) and big file reading
from the tmpfs (to eliminate server disk latencies). Graph was added to the previous round results.

Note that async NFS and POHMELFS behave very similar with operations which involve reading from the disk,
that is because of disk latencies (although 10krpm SCSI disk used allows about 80 MB/s sequential read,
XFS behaves quite badly with lots of small files), tmpfs comparison shows advantages of the
POHMELFS network protocol.
Reading from huge remote tmpfs file is about 2 times faster for POHMELFS because of its AIO implementation,
although it is not main reason - server was almost always capable of handling requests from the POHMELFS client
one-by-one using one thread, which saturated bandwidth for about 70% (add here all debug options turned on on client).
One of the main factors I think is readahead being turned off - sync readahead has zero advantage in asynchronous
network filesystem, since while it waits for readahead to complete, it could schedule new requests, while
->readpage() method used in readahead waits until page is transferred, and only then
readahead code schedules new request. One can implement ->readpages() though.
Kernel tree reading micro-benchmark was also performed: POHMELFS has 2-times win because of its network protocol, which
batches (via TCP_CORK only though, I think I need to implement better directory reading command) server replies.
Another solution is to correctly implement transactional model, which is next task now.
/devel/fs :: Link / Comments ()
Wed, 16 Apr 2008
Massively multithreaded POHMELFS server.
Because of completely asynchronous POHMELFS
nature
it is possible to implement mulithreaded server, where not only requests from
different clients are processed in parallel, but also async requests from the
same users are handled simultaneously by pool of threads.
Such multithreading requires to introduce transactional model of the communications,
for example object creation and writing data, right now this race is handled
by sending a reply after creation, so the whole writeback sleeps waiting for that,
which drops performance (to NFS level). Transaction contrary will contain both operations,
which will be processed by the same thread without race. It can also handle
other problematic places with multiple server threads.
So far userspace server can run several or one processing thread per client,
but there is no transactions implemented. I just started
AIO
reading implementation, which should provide great speedup for any reading
workload.
Stay tuned!
/devel/fs :: Link / Comments ()
Mon, 14 Apr 2008
Initial network filesystem benchmark. POHMELFS vs NFS. Round 1.
Hardware (both client and server have the same hardware).
4-way (2 logical (HT) + 2 physical cpus) 3.00 Xeon (32 bits with PAE :), 8 GB of RAM,
Intel 82541GI gbit adapters, Seagate ST3300007LC 10k rpm scsi disk on
Adaptec AIC7902 PCI-X Ultra320 SCSI adapter.
Software.
Server: 2.6.25-rc7 kernel, in-kernel NFS server, userspace POHMELFS server.
Client: 2.6.25-rc8 kernel, in-kernel clients.
Both have all kernel debugging turned on.
Round 1. Huge directory (linux-2.6.24.4.tar archive) untarring over the network.
Picture shows it all.

Notice, that there is no test for POHMELFS reading (that is why it is only first round),
since it is miserable. And I know the reason: I'm lazy, so I use generic reading function
(generic_file_aio_read()), but actually Linux does not have AIO reading from usual files,
so it is very synchronous and requires to read data page-by-page, so we have a pretty
broken system in regards to network performance.
Since reading is not async, so I will reimplement generic_file_aio_read() as
pohmelfs_aio_read(), which will be a real AIO reading function. That will be second round,
where POHMELFS will win.
But it can not win the game. Because things are changing. Today I've known, that
if filesystem has only 20 users over the world, then it
should not be
merged, since burden
of changing something generic in VFS (and thus propagate it to filesystems)
is too high.
What has happend? Linux kernel maintainers started to be afraid of changes?
Afraid of more code? Afraid of something new they do not want?..
Eh, and they tell they want more developers... They want monkeys who will do only what was
asked them to do.
POHMELFS will be sent for review of course, but it is highly unlikely
I will push it upstream.
/devel/fs :: Link / Comments ()
Fri, 11 Apr 2008
Unhashed inodes can not be synced during writeback. Debunked.
Problem happend to be quite simple: writeback happens for
inodes in sb->s_io superblock list. They are placed
there from sb->s_dirty list, which contains dirty inodes.
Dirty inodes can be placed into that list via mark_inode_dirty(),
which checks if inode is hashed, if it is not, then it will not be placed into dirty list.
Hashed has a synonym in comments: valid...
There is sb->s_op->dirty_inode() superblock operation callback, which is invoked
first, so one can still implement own inode cache, do not use inode hash tables, do not
hash inodes and still put inodes into dirty list and thus be able to run writeback on them.
/devel/fs :: Link / Comments ()
Thu, 10 Apr 2008
Busy inodes after unmount.
VFS: Busy inodes after unmount of pohmel. Self-destruct in 5 seconds. Have a nice day...
After removing private cache of inodes I found, that objects, which were
sent by the server and which were never attached to directory entry (dentry),
will never be freed.
So, essentially this does not work with Linux VFS:
iget()/iget_locked()
...
umount
Inodes, created by iget()/iget_locked() will be placed into at least three
different lists:
inode_in_use - global list of ever created inodes, which have i_count and i_nlink
more than 0
s_inodes - per superblock list, which contains every inode, created for this superblock
inode_hashtable - hash table indexed by inode number. If you want to
work with writeback,
your inodes have to be there. Did not yet investigate why.
So, essentially all inodes, which you created, are accessible by VFS and will be checked
during umount via generic_shutdown_super()->invalidate_inodes(),
where system will notice that if inode in s_inodes list has non-zero reference
counter (or course, otherwise it would be already freed by filesystem), then this inode
can not be freed. Thus we have a leak.
Above lists can only be accessed under global inode lock, so it is not a good idea to destroy inodes
traversing them in for example ->put_super() callback or in any other filessytem callback,
so I had to add a list of all inodes into POHMELFS superblock. Ugly.
/devel/fs :: Link / Comments ()
POHMELFS development status.
It has developed very rapidly last couple of days,
so essentially I rewrote it. I think it is ready for the next
release, which I will announce in a day or so.
Right now all first-milestone features except cache-coherency (check below),
which I planned, are completed (although maybe not in the most
optimal way sometimes).
Because of name cache usage it is now possible to create huge pathes
with multiple directories via single command. The same applies to directory
removal,
although it is because of different design issue.
It would be possible to rewrite generic read/write helpers and provide
set of pages into POHMELFS network stack (which is page
based for data now), but I decided that for the first
step it is not needed.
POHMELSF has now fully async processing of all operations except link creation
(I just decided that it is a bit simpler to make them write-through,
it was done because of laziness and not some fundamental arch problems).
It was achieved by serious (read: from scratch) changes in the arch,
which had own problematic places, namely error report. Because of this
move it becomes really simple to implement any kind of protocol, if it obeys
async rules, namely sending of the message never requires sync reply,
and where it is needed, reply comes as an independent incoming message,
which is processed asynchronously from waiting and via common state machine.
Such arch allows to have simple cache coherency algorithm, when server just sends
a missed entries or commands to remove some objects and client's core handles that just
fine since its reciving code does not depend on sending one. This is not
100% correct way to handle collisions (collisions thus became new objects
in the filesystem tree, like old name plus some suffix), but it is what lots
of the users need, but not real cache-coherency.
Writeback cache does not play very well with cache-coherency, since every metadata
changes (like object creation or removal)
has to be checked against server state, since different clients can do the same with
the same object. Level of paranoidality has to be thought of in advance.
First cache-coherency step is implementation of the trivial scheme, when
every object is synced during its writeback time and changes being broadcasted by server
to other clients. If another client has the same object being processed
it can either be renamed to collision or just overwritten. Having locks
and thus real states is a next step.
Also, POHMELFS does not have authentification and strong checksums right now,
and although this is a simple task to implement, its priority is questionable.
There is also possibility to implement cryptographically strong encryption of the
communication channels.
So, lots of ideas, but main part is ready - async data processing design was
definitely a right choice to implement, so all other features become very simple
to complete.
New release will be announced very soon, stay tuned!
/devel/fs :: Link / Comments ()
Sun, 06 Apr 2008
The is only one way: asynchronous.
This is a new motto for POHMELFS.
It is a completely new filesystem now.
POHMELFS got new page processing code (sending side: commands and data), new lookup,
which is based on the Linux VFS inode cache without reinventing the wheel (comment
says it is very smp-friendly, although I do not quite understand how
it is possible with global inode_lock), it also got
completely new object creation and referencing path. It is possible
to create a huge path (up to 4k, but can be easily extended if there will be such demand)
with multiple objects in it with only single network command.
But the main feature of new POHMELFS is its name cache. I did not find
how to hook into VFS dentry cache, so invented own. It is fast
to travers from child to the highest level parent, which is actively
used in POHMELFS writeback path. Although it is not 100% the best
storage, but a simple RB-tree (and thus requires smp-unfriendly mutex), the whole
idea shows its gains already. Eventually it will be replaced with
faster and more scalable approach protected by RCU (even properly sized hash
table will show better scalability, although dynamic resizing of hash tables
prevents RCU usage), but I started from the simplest ground.
POHMELFS already outperforms async NFS during untarring and completely saturates
my testing Xen domains (both network and disk speed), while NFS is almost two
times slower. Testing machines have 256 Mb of RAM, maximum 3 MB/s interconnect speed
(something is broken in Xen setup likely, since it is supposed to be 100 mbit/s
and there is no high load), which is very unfriendly (read: in such scenario POHMELFS
will show its worse results) for POHMELFS, but nevertheless it is fast.
It became not only much faster, but also simpler. Its userspace server has
two times less lines of code (816 vs. 1613), kernel side is smaller and simpler too:
mainly there are no zillions of different trees indexed by any possible keys,
so far only per-inode tree of child names for readdir and per-superblock path
entry cache.
There are drawbacks of course: there is no receiving code (at all). It will be a dedicated
thread, which will asynchronously process all incoming packets (mostly
readdir async return, read page content and cache-coherency messages). First
two are really simple. The last one will be implemented as a full MOSI/MSI
library for inode content. Likely it will be possible to use in my
other projects.
P.S. I frequently think that I'm very good vapourware seller :)
Stay tuned!
/devel/fs :: Link / Comments ()
Wed, 02 Apr 2008
Unhashed inodes can not be synced during writeback.
So essentially there is no way to implement own inode
cache tied to system's writeback mechanism, which is a bad
news. POHMELFS in its current reincarnation does not use
system's inode cache and all its indeas are unhashed, which
results in a fact, that they are never synced, since writeback
mechanism just does not see them.
So I will fallback to hashed inodes, which will be used just for that,
and writeback for single inode will end up creating directory structure
for the all upper layer objects.
Another idea is to implement own writeback, which would be scheduled from the
main one or after memory notifications, this approach has lots of
advantages actually, but let's first complete simpler part with hased inodes.
This is called learning curve - I'm essentially where I was before,
but with extended baggage of knowledge.
/devel/fs :: Link / Comments ()
Sun, 30 Mar 2008
To SSD or not to SSD.
Couple of days ago I talked with person, who ordered 4 high-end 128G SSD disks
to create RAID for testing purposes, seek time for that devises is 0.1ms.
Each one costs about $4k. His main workload is databases, i.e. random reads and writes,
so we calculated that theoretically it has to be about 14 times faster than
high-end scsi disks with 3.5 ms seek latency and about 100Mb/ssequential access speed
in given
workload for processing random data at 8-16kb chunks (usual 'page' in sql servers).
Besides the fact, that putting 14 disks into mirror will
be as fast as single ssd disk (theoretically), it will be 14 times more reliable
and likely have smaller price,
main workload is to replace RAM with SSD, not disks with SSD.
My prognosis is that SSD will be at most 2-3 times faster (if will be fater
at all, since its theoretical performance advantages can be killed by FS)
than SCSI disk for
given workload, and as is, it is not a breakthrough technology.
If I'm wrong (it will be tested likely next week with
sysbench read-write benchmark),
I will buy a good bottle of whiskey for us, otherwise...
/devel/fs :: Link / Comments ()
Thu, 27 Mar 2008
Filesystem as a database or database in filesystem.
I actually do not understand what prevents filesystem writers to implement
trivial interface and access library for metadata manipulations,
which would allow not only path lookup,
but also lookup for various keys, for example stored in extended attributes.
Yes, it requires filesystem changes, but I can not believe it is impossible
or even too complex.
Need to think...
/devel/fs :: Link / Comments ()
Wed, 26 Mar 2008
Added maildir benchmark results.
The simulation works on each filesystem in the following stages:
- The empty filesystem is created and mounted.
- The directory structure is created, with no files.
- A single delivery simulator and retrieval simulator are run
simultaneously. The script waits for each of the simulators to finish,
and then runs the sync command before proceding to the next
step.
- The above step is repeated with 2, 4, 8, and then 16 delivery simulators.
Delivery Simulator.
The delivery simulator does actual maildir deliveries to the given directory:
- It writes a file with a unique file name to the tmp subdirectory.
- It fsyncs the newly written file.
- It renames the file into the new subdirectory.
- It fsyncs the new subdirectory (to ensure that
directory is actually on disk, as most Linux filesystems don't
automatically perform this action during the rename).
More details on original page.
Briefly saing, it is multithreaded maildir simulation.
And results
are quite different compared to for example postmark: very good results from xfs, jfs and reiserfs.
There are no ext2 and btrfs filesystems, since perl's fsync says that
filedescriptor opened there is invalid:
Invalid argument at /root/fs_bench/maildir_fsbench/fsbench/fake-deliver line 38.
Interested reader can check sources and show me a problem, but ext2 worked pretty fine with
2.6.20 kernel and to date glibs/perl/whatever was in Debian.
Anyway, results can be found at contest
homepage.
Now all testing is over.
Main conclusion: things got worse compared to 2.6.20 and there was no major breakthrough in filesystem development at least
from perfomance point of view.
/devel/fs :: Link / Comments ()
Additional XFS test with slightly diferent mount/mkfs options.
mkfs: -d agcount=75 -l size=64m
mount: logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync
Postmark results:


Results are slightly better than
previous
xfs run, although barriers are turned off, which I blame to be the main reason. Other
filesystems did not turn off directory atime also.
Anyway, even with this results XFS is still much worse than any other FS (except reiserfs)
for this workload.
/devel/fs :: Link / Comments ()
Tue, 25 Mar 2008
Filesystem contest results.
Interested reader can check out results
of the ext2/3/4, reiserfs, reiser4, jfs, xfs and btrfs fight for the first prices
in dbench,
iozone,
postmark,
maildir performance bench
and simple file creation micro-benchmark.
It does not contain maildir benchmark, I will add it tomorrow or later today,
xfs has yet not completed and no graphs.
As a conclusion: nothing major changed since
previous contest,
new btrfs filesystem behaves not that bad in some cases,
but quite slow in others... Nothing changed.
Does it mean, that we need something new?
/devel/fs :: Link / Comments ()
POHMELFS status.
I've started mostly from scratch, I think it is a good sign,
when project can be rewritten without any pain to implement a really
interesting ideas instead of having multiple crutches all over the
place. This also means that it is not that complex, so I do not regret
about dropped code.
Now it is in a very testing stage without network protocol at all,
but I test new paradigm in the pohmelfs: its inodes will not be hashed
into global hash table, but instead will be placed into local
trie-like structure, which (optionally) will allow RCU-fied lookup.
Something similar to data structure created for
multidimensional trie
used for unified socket lookup patch.
I very like
two-hash
approach, but since there is no proof (yet) it will work for all possible cases,
I will first implement radix-like tree to store object names. Network
protocol will also operate on full-length pathes, which actually can be
a bad idea, I will see.
Another uber cool feature of the full-path approach
is ability to create number of directories, which form a path to given
object, in a single command, i.e. when client sends a network command
to create object /a/b/c/d/file, there is no need to send
separate commands to create /a, /a/b and so on,
it can be done automatically by server. This requires to send not only
path though, but also information about permissions for each subdir.
/devel/fs :: Link / Comments ()
Second filesystem contest is over.
Although I plan to run additional couple of tests for
btrfs,
namely all tests for nodatacow option and without ssd option,
which will likely take part of the day. But all others were
already completed, so expect nice graphs tomorrow.
There was number of surprises during that testing. For example
reiser4 constantly freezes the test box in dbench workload
with 150-200 threads. There are no messages in dmesg, but nothing
is turned on in kernel hacking section of the config. Both
btrfs and reiser4 are very slow creating and writing into
lots of small (4k) files. Reiser4 is two times faster than btrfs,
the latter creates/writes/syncs/closes about 10 files per second
average when 10k-30k files are created one-by-one.
Ext4 is also slower than any other (except above two) filesystem
in this microbenchmark.
Something strange was made during 2.6.20-2.6.24 kernel: above file
creation microbenchmark produced much worse results for all
filesystems (magnitude of 10 in some cases) compared to previous
contest.
Maybe sync code was implemented correctly, I do not know...
I will likely drop maildir
benchmark results, since perl script which works there constantly tells me,
that fsync() has invalid parameter...
So, wait about 12 hours (I have to have some sleep: do not mix
absinthe with different red wines and beer, when I did that yesterday/today
night, it was quite tasty, but not todays morning)
/devel/fs :: Link / Comments ()
Mon, 24 Mar 2008
BTRFS got subvolumes support.
Subvolumes
are block devices on top of which btrfs
can be created. This is first known filesystem in Linux which can be built on
top of multiple block devices. Chris Mason renamed his unstable branch to
really-really unstable because of that. It is possible to put devices into
mirror or striping mode, although it is far from being clear from short
mail description.
Although support for mirror and striping in filesystem is questionable feature,
ability to create filesystem on top of multiple block devices with per-device
allocation policies is a huge step in Linux filesystem development.
/devel/fs :: Link / Comments ()
Thu, 20 Mar 2008
Second filesystem contest has been started.
So far I removed maildir
test and file creation benchmark, the former requires manual start in my
scripts, the latter requires some filesystems to be removed from the run,
namely Reiser4 and BTRFS, both are very slow creating and writing into lots of small
(4k) files. XFS is probably also a candidate, although with optimizations, described below,
it behaves much better than with default options and 2.6.21 tree.
So, we have dbench,
iozone and
postmark queued...
Testing is being performed with 2.6.24.3 tree, Reiser4 was ported from the latest
breakout of -mm tree (requires lots of manual patching to be started on recent kernels).
BTRFS was taken from the unstable
branch, since it is the same as 0.13 AFAICS. All other filesystems were taken from the
vanilla tree.
There are following optimisations for the filesystems:
- XFS: mkfs:
-d agcount=1 -l size=128m,version=2, mount: noatime,logbsize=256k,
as suggested by Dave Chinner
- EXT4: mkfs: none, mount:
data=writeback,noatime,extents
- EXT3: mkfs: none, mount:
data=writeback,noatime
- EXT2: mkfs: none, mount:
noatime
- JFS: mkfs: none, mount:
noatime
- REISER4: mkfs: none, mount:
noatime
- REISER3 aka REISERFS: mkfs:
--format 3.6, mount: noatime
- BTRFS: mkfs:
-l 4k -n 4k, mount: noatime,nodatasum, for postmark also added ssd option,
as suggested by Chris Mason
First results are expected to be ready tomorrow evening or even (past)weekend... Although all runs
are being performed automatically, nice graphs
generating requires manual start. Then I will proceed with
maildir
test and file creation benchmark.
/devel/fs :: Link / Comments ()
Fri, 14 Mar 2008
Why binary trees are bad. New cache structure for pohmelfs.
I already found experimentally that write-through cache scales very badly,
even noticebly worse than without cache at all for some workloads, so an ideal
solution
does not involve any kind of write-through operations notably no synchronous commands,
which require immediate response.
This means that inode numbers will differ on client and server, so there should be
some kind of tracked dependency between them so that operations on different machines
can be done in sync. Initial though was to use binary tree to store pointers to appropriate
inodes, which would be indexed on server and clients by combination of hashes of inode (direntry)
and its parent data. Even embedded systems can easily have millions of inodes, so choice
was thought to be correct from the first point of view. Now I think different since there
is a serious problem with indexing of such a tree.
Since the only information common to both client and server is object name it should be used
as a key, maybe not name directly, but its hash, that does not matter at this point. Here comes a
problem with binary tree choice: in binary tree there is no connection between real parent in the
filesystem and parent in the binary tree, so there will be serious problems when we will put two
different object with the same name into binary tree - there will be a conflict. To solve
this problem we should use some information about where this object is placed, i.e. information
about its parent directory. Using parent name hash as a part of the key in the binary tree
does not solve problem too, since there might exist multiple directories with the same name and
the same object in it. We could solve the problem by putting into key hash of the object's name
and hash of parents key (which in turn is hash of the name and hash of its parent key), so this
recursive hashing would end up at the highest level (i.e. root directory). This works, but there
might be scalability problem with the following issues:
- server has to either cache opened directories or reopen it one-by-one when accessing an object
- when object is moved/renamed all keys of its children and parent has to be changed
which is unacceptible. So new solution was thought of.
So far I have two ideas:
- kind of radix tree
- multi-layer hash tables indexed by double name hash
While the former is kind of obvious, the latter is quite interesting but very simple idea. Consider
that each directory has a hash table of its children, it is indexed by double hash of child's name.
We need double hash to remove possibility of collision (I can not prove mathematically (maybe only yet)
that there are two hashes which will not allow simultaneous collision in both, but feel quite strongly
that such hash pairs exist) and to use them in network commands. Commands can be optimised either
to use full path if it is short enough (just sent a path string during writeback or readpage as a
path to where data belongs) or use an array of hashes of the path elements instead of '/' separated
names. Hash tables actually have to be changed to different data structure capable of hosting not only
small hash values, but full 32 or 64 bit hashes. It can be a binary tree or judy array, something similar
to what was used in
unified socket storage. The former looks a bit excessive.
Using such approach it is possible to lookup an object with O(k) operations where 'k' is number of directories
in a path, very usually it is smaller than 10, which for binary tree corresponds to as much as 1024 inodes,
which is too small for the real system.
This approach (especially when full path is being sent) allows to eliminate mentioned above scalability problems.
Implementation start is scheduled for today, but I have to think about details first.
/devel/fs :: Link / Comments ()
Wed, 12 Mar 2008
(Cache) Coherent Remote File System sources are available now.
Zach Brown has
announced
CRFS source code openess.
CRFS is a network filesystem which works
with remote BTRFS volume and supports
cache on clients.
Here is a brief set of features CRFS supports:
- the user space server exports a private BTRFS volume
- the network protocol operates on ranges of BTRFS disk items
- the kernel client provides posix semantics by operating on items
- the server can grant and revoke client caches of data and metadata
CRFS protocol is very tied to how BTRFS is organized. For example there is natural
batching of some commads like the recursive delete commands, since btrfs keys
placed one-by-one, so there is no need for additional command to be sent, instead
the first one can be extended to cover wider key range.
As you might notice pohmelfs
was started as a competitor to crfs project, because the latter is interesting and was closed. Right
now pohmelfs has set of very interesting features crfs does not and likely will not support (like offline
working, different server filesystem support), also its todo list has plenty of very interesting
stuff, so it will not be closed. Instead I plan to proceed the competition (which is a bit
complex for me, since it is first filesystem I write and essentially I did not know what inode
before) and fully complete pohmelfs. Although I subscribed to crfs-devel :)
My new shiny servers will be installed today, so tomorrow I will start (re)implementation of
the ground ideas of
pohmelfs.
Stay tuned!
/devel/fs :: Link / Comments ()
Thu, 06 Mar 2008
POHMELFS: was done just wrong!
So, last several days devoted mostly to thinking about the things and some
experiments with them lead me to the headline conclusion: pohmelfs was done
just wrong!
Its network ping-pong protocol is wrong, its inode resync logic and overall
need for inode number change is wrong, its writeback logic is wrong (btw, why
Linux VFS calls writeback for inode after it calls writeback for inode's pages?
This leads to the inode number resync code duplication and fair number of problems),
its userspace server cache is wrong (well, its userspace server is a braindamage,
but that does not prevent it from being wrong too), and the most important: it becomes complex,
so I frequently have to read my own code multiple times to understand what I meant here or
there.
That just has to be changed (mostly just removed)!
Thinking about all that crap lead me to the more phylosophical conclusion: any network
protocol which requires precise acknowledge for a packet is broken. Point.
TCP is not broken, since it can send acks for multiple packets. TCP can aggregate on both
sides of the connection (which can lead to the huge
performance increase
as was observed in userspace network
stack over netchannels),
so it is a stream, not a ping-pong, although its policy for ack generation is not always the best decision.
Out of curiosity, why original ping and traceroute commands were not implemented as TCP applications
which would catch ack/rst packets?
So, anything ping-pong like is just broken. Never ever use that logic at all, since it breaks performance
and ability to extend. More to the game, it breaks ability to create real duplex communication,
since while you expect an ack you can get data from the other peer for different command.
So, brilliant idea (yes, I sometimes get them from the deep abyss of the mindless) is to convert POHMELFS
protocol into two real streams: from clinet to server and completely independent stream from server to client.
It has zillions of benefits, but lets see how it is going to be implemented and what will be fully broken in the fileystem.
First, there will not be resync logic. At all. Each inode (and its number) on the client will not correspond
to any inode object on the server, so local inodes will never be synced with the server one. Instead cache of the objects
on the server side will be indexed by special keys containing name, length and other parameters needed for unique number generation.
Client inode number will never be sent to the server, so object creation will have only single direction: just send a packet.
If there is unrecoverable error, connection can be broken, so subsequent command sending would reconnect or make some
changes. Things like permissions will be guarded by the client, there might be no space problem though.
Second, commands, which require feedback from the server, like reading directory content will become completely
asynchronous, so feedback from the server will not be exactly a sync reply for given command, instead
we can wait until directory content was populated and start providing it back to VFS.
Third, and the main, there is a possibility for the stream commands both from client and server. Since clients
now do not require sync ack/reply, they can be batched to the maximum performance, but that is not a main feature,
really interesting is ability to receive a stream of commands from the server, so each ot them can be parsed
independently from the original client command state. This allows to implement cache coherency protocol without major
pain and have a high perfomance stream of data from server to client.
More to the game is ->sendpage()/sendfile(), which are
broken
without proper acknowledge, so to fix the issue I plan to submit a socket extension patch, which will call
appropriate registered callback when page reference counter is about to be dropped, which automatically means
data was received on the remote side. This kind of acknowledge does not break connection down more than
simple unidirectional bulk transfer, so it is fast.
So, started deleting lots of code and implement needed bits, the nearest future will show how broken my approach is.
This rises a question about design vs. evolution... I actually prefer the former, but frequently end up with the
latter (like this decision about network protocol, which is a design, but only after several evolution steps
in wrong direction). This reminds me kernel evolution
topic, which does not actually show anything good for the kernel: there are lots of dead-end evolutional branches which
believe they are the top of the progress, maybe mankind is one of them...
That was a lyrical digression, so back to business!
/devel/fs :: Link / Comments ()
Sun, 02 Mar 2008
Removing arbitrary size directory with single network command in POHMELFS.
All operations in pohmelfs
are made locally and are populated back to the server during writeback time (or via cache coherency
algorithm, which is not implemented fully yet). POHMELFS uses
writeback cache in all its power, which allows to remove directory of arbitrary size
using only single network command.
During unlink/rmdir time local object is removed and potentially destroyed, while short reference
of what it was is stored in a sync list of the parent, which is marked as dirty. So, when writeback
hits parent directory of the just removed object, it sends all information of the removed objects to the server.
So, when directory with arbitrary number
of subdirs and other objects is recursively removed locally, information is not sent, but added to appropriate
parent subdirs, which are removed in own turn, so when the whole subdir is removed, only single object
becomes dirty - parent of the just-removed directory, which contains information of the removed
dir. Message about this will be sent later (on writeback or because cache coherency protocol), which will
force server to remove the whole subdir recursively. This is much faster than sending information about
every single object being removed during recursive removal of the directory.
Of course if writeback starts hitting pohmelfs inodes during deletion time it is possible that not only
information about the highest removed directory will be sent, but also about some underlying subdirs, but
that does not matter a lot, since this is a very short condition (inode is in dirty list and yet not removed
by the recursive removal) and number of such inodes is still much smaller compared to overall number of removed
objects.
Actually cache coherency algorithm is the last serious thing to implement in pohmelfs I think. There are bugs
of course and some feature extensions, but major milestone will be set after this got implemented.
Stay tuned!
/devel/fs :: Link / Comments ()
Thu, 21 Feb 2008
CacheFS and NFS local caching.
David Howells of RedHat recently
posted
next round of his CacheFS implementation. Main idea of the project is to
store locally data and metadata modification on disk.
Cache is implemented as write-through one. Locally data is stored as
usual files on a special partition formatted as one or another filesystem.
David also posted
benchmarks
of his apporach. Metadata intensive operations showed significant slowdown
with the local on-disk cache, getting metadata from local cache also shows
a slowdown. The former can be explained by the write-through nature of the cache
and slow local disk operations, which is also a reason for metadata reading
downgrade of the speed.
There is also no cache-coherency algorithm implemented for CacheFS. Another problem,
pointed also by Kevin Coffman is possible slower reading of data from the cache than
from the local filesystem (and from remote one if bandwith is not a limiting
factor which is frequently the case).
This is third (actually the first :) local cache implementation for the network
filesystem, so competition between
CRFS,
POHMELFS and
CACHEFS becomes even
more interesting :)
Stay tuned!
/devel/fs :: Link / Comments ()
Wed, 20 Feb 2008
Latency problems in pohmelfs.
trying to make at least something...
As was mentioned full inode resync logic
is very slow.
Latency is introduced likely somewhere at protocol layer, which is used
by pohmelfs. To test this scenario and find out the best possible
solution I implemented trivial network module and userspace server, which
talk to each other via protocol very similar to what is used in lookup/create
operations in pohmelfs. Server and client also maintain trees of the objects
it sent/received, so that model would be as much as possible similar to
pohmelfs usage patterns.
Its time to test things and find out where the problem lies, but as usual
there are problems. You are sick, everything is aching, but you
want to beat the crap, to move a bit further, to make something
interesting, so you start implementing the tiny bits, you start thinking,
you finally make the things, so you become happy and proud, and that is
just to find out, that all testing machines you had access previously
are turned off, and new ones are behind a firewall and there is no access
to the network from the ass of the world. This is called 'shit happens'.
/devel/fs :: Link / Comments ()
Wed, 13 Feb 2008
POHMELFS got full inode number resync logic.
Now it updates all upper inodes in the tree when doing writeback for some inodes.
Here is a result:
/mnt/tmp$ mkdir -p 1/2/3/4
/mnt/tmp$ echo qweqweqwe > 1/2/3/4/file
/mnt/tmp$ ls -liR ./
./:
3332986296 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 1
./1:
3332988600 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 2
./1/2:
3306456568 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 3
./1/2/3:
3332985144 drwxr-xr-x 2 zbr users 0 2008-02-13 12:07 4
./1/2/3/4:
3306458488 -rw-r--r-- 0 zbr users 10 2008-02-13 12:07 file
/mnt/tmp$ sync
/mnt/tmp$
/mnt/tmp$ ls -liR ./
./:
557065 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 1
./1:
557066 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 2
./1/2:
557069 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 3
./1/2/3:
557070 drwxr-xr-x 2 zbr users 0 2008-02-13 12:07 4
./1/2/3/4:
557071 -rw-r--r-- 0 zbr users 10 2008-02-13 12:07 file
It also works with much bigger trees (like untarring linux kernel tree,
although ugliness of userspace server requires to rise maximum amount of opened
file descriptors).
There is a single problem in this case: it is damn slow. And I do not see
an easy explaination for that. Well, tcpdump shows small window, but that is an end result
I think, not a reason, and the reason is likely in the protocol pohmelfs uses - system sends
number of short packets in round-robin fashion, which may be slow for some reason.
Since I'm waiting for real hardware to test things on (since oprofile does not work on installed
Xen version), I can only handwave about the root of the problem...
And that is exactly the same problem which was with write-through cache pohmelfs had first, I think
even timings are similar, so after this problem is fixed, new version will be released.
There is another problem, which complicates the development - I got a cold (second one this year, and third
one for the last 3 or 4 years though), but such condition with some temperature, when brain is in the
'hinged' state between sick and good shape, opens very fun feelings about things around, which usually
ends up with very interesting results.
/devel/fs :: Link / Comments ()
Tue, 12 Feb 2008
POHMELFS got inode number resync logic.
It happens when inode in question is being under writeback -
protocol implements quite simple ping-pong message passing,
so result looks like this:
/mnt/tmp$ echo qweqweqwe > qwe
/mnt/tmp$ ls -lai ./
total 8
557057 drwxrwxrwt 2 root root 4096 2008-02-12 19:58 .
2 drwxr-xr-x 22 root root 4096 2008-02-12 19:58 ..
3322992632 -rw-r--r-- 0 zbr users 10 2008-02-12 20:32 qwe
/mnt/tmp$ sync
/mnt/tmp$ ls -lai ./
total 8
557057 drwxrwxrwt 2 root root 4096 2008-02-12 19:58 .
2 drwxr-xr-x 22 root root 4096 2008-02-12 19:58 ..
557065 -rw-r--r-- 0 zbr users 10 2008-02-12 20:32 qwe
But overall it does not work, since writeback can happen for any inode
inside the whole not-synced tree, so trying to sync inode number for some
obscure object, which sits in the directory server never saw before, is quite
problematic - the whole tree has to be traversed from the inode under writeback
up to the one which is known for the server host.
Although this is not a very complex task, but there is a question about what to
sync. Should the whole directory content be synced, or just single inode,
if the former, than should we force writeback for other objects in the directory under
resync... I think the simplest case is to force only higher layer object creations,
not syncing theirs content (like other objects in the directory), but directory itself
should be marked as dirty, so that access from different clients forced appropriate
resynchronization.
/devel/fs :: Link / Comments ()
Mon, 11 Feb 2008
Initial implementation of the offline and cache coherency algorithms.
It is rather dumb and even does not have state machine handling
in the usual meaning.
Existing pohmelfs implementation has only two places where content of the inode
is 'globaly' modified, by 'gloabaly' I mean some changes, which have to be seen
by other clients if they will access given inode.
First one is directory reading, when inode in question gets information about
other inodes in given one, another one is object creation. Object removal is local
operation, and there are no collisions if multiple clients delete the same object
simultaneously.
When directory is being read first time, pohmelfs just syncs its content from the server,
all subsequent reads happen from cache, since all creations and removals happen locally.
This case is simple.
When pohmelfs is about to create an object, it marks parent inode as dirty,
if parent inode was not marked dirty previously, this ends up sending a single
message to the server. Server in turn can return content of the directory in question,
if that inode was already modified by different client. If there are objects with the same
name as local ones, local objects are 'renamed' to the 'oldname-synctime', so that
user could later run diff or whatever and merge changes. That is how offline
pohmelfs clients work.
Object is always created in the local cache only with local inode number. So far
it is never being sent to server (although code which does it and changes the inode
content exists), even writeback does not work right now (since server does not know
about object with local inode numbers). This part is a bit more complex: pohmelfs
has to sync inode (i.e. to send current inode info, wait until server creates object,
then receive real inode info and change local cache) either in writeback (when
system forces to writeback a page(s), appropriate inode will be synced first)
or in cache coherency algo. For that purpose each network state locking first checks if there
are messages in the queue from the server, which have to be processed first,
so far only server content receiving is supported, forcing to send own content on request
from server is a base of the cache coherency nad this is not yet turned on. Here
major race lives, which can lead to the full resync of the idea actually. After we locked
own network state and checked that there are no requests from the server, client can start
sending own commands, but before they came to the server, it can start CC resync
(and send messages into the same pipe as clients command) initiated
by different client, which will break protocol state machine. This is main idea to think about.
Oh, and to implement the same logic on server :)
/devel/fs :: Link / Comments ()
Thu, 07 Feb 2008
POHMELFS and CRFS in the news.
At LWN.net. And as usual I do not have an account this time...
So, will wait for a week for free article, by that time pohmelfs will contain very tasty things,
which do not exist in any other fs out there (or at least in the single filesystem).
Edited to add, that Simon Holm Thøgersenshared a link to the article. It is somewhat fun,
although author (Jake Edge) writes quite differently from Jonathan Corbet imho. Article
does not compare pohmelfs and crfs, but shows that they are very similar. I've known, that
Zach Brown works about a year on CRFS, while pohmelfs exists
less than a month. Someone shared a secret knowledge about meaning of the pohmelfs abbreviation
in russian, well, maybe he/she is right, who knows...
Article does not cover features scheduled for pohmelfs like offline working and inode resync logic.
Commenters try to compare crfs and pohmelfs with afs and pnfs. Both do not have metadata caching
mechanisms, so they are fundamentally different, pnfs in addition allows to implement closed
extensions, which will lead to vendor lock.
One point to writer Jake Edge is that he does not use names in the articles, but only last names.
/devel/fs :: Link / Comments ()
Filesystem freezer. Removable device.
There is a long discussion in linux-fsdevel
about various filesystem freezing implementations and features it should have.
Main goal of this project is to freeze any filesystem, so that all write requests
would be blocked. This allows to implement consistent backups. This task
belongs to block layer though, and this patchset actually implements that by
suspending underlying block device. Although interface (ioctl) is a bit ugly,
it will likely be accepted, since other filesystems (namely XFS) have such feature
via own provite ioctls. People say that it does not always work though.
LVM supports consistent backups natively, but having such interesting feature without
need to work on top of device mapper would be a great deal!
This highlighted a very interesting project I have in mind (actually it will be
another reinvention of the wheel though) about various removable devices. Actually
it is not only about removable, but any devices, which can suddenly dissapear or stuck
(like network filesystem, broken cable to local disk or bad drive).
Old idea is to remount access to such device as readonly and with error returned to
any atempt to access it. There is a frevoke() syscall which does that
for given file descriptor - it is marked as errorneous so access to it returns errror,
but this does not fix a problem with network filesystem for example. Let's suppose
we have NFS client which stuck because of server was disconnected, there are cases when
it will never resume and return error. Or bad block/bad drive access, which will try
again and again forever...
Revoking particular file descriptor is simple task, but what if we have a web server,
which accesses broken drive for each new client or similar scenario? While we revoke one file descriptor,
server will create another two, stuck in the middle of the operation.
The very good solution I have in mind is to break all existing access pathes (block
layer has access to all bios) and either replace underlying device with fake one,
so that all requests would be completed with error (consider it like hotplug/unplug
of storage device), or replace filesystem (inode and file) operations, so that
they returned error (that is like hotplug/unplug of the filesystem). In the latter case
it would be even possible to change filesystem on the fly! First, plug a filesystem which
just queues requests isntead of processing data, then unplug real filesystem,
plug new one and unplug fake one.
Not sure it is very useful functionality, but very interesting...
/devel/fs :: Link / Comments ()
Btrfs 0.12 has been released.
Chris Mason changed on-tree disk format again, which leads
to very noticeble (30 times!) speed improvement
for random write access (from 1 mb/s to 30 mb/s).
This release also contains mount option and some tweaks for SSD (solid-state disk),
mainly write clustering without getting into account directory
file writes belong to. Also added simple ENOSPC handling,
although it is still possible to crash machine, when there is
not space left on device, now it is a bit harder.
Next step for btrfs
is to support multiple devices for single filesystem via
subvolumes.
Release notes.
/devel/fs :: Link / Comments ()
Wed, 06 Feb 2008
Continuing POHMELFS client side caching design (offline working capabilities).
As I wrote previously,
accepted design of the local cache
allows not only to fix problem cases with inode generation numbers, but also
provides a very interesting feature with offline working.
Let's suppose client was moved offline or just does not yet synced its cache with the server.
It can work without any problem and later when it connects back to server system will resync
its data with server one. For all files, which are different on client and server, client will
have an own version, but with different name (like orig_name-$date_of_sync),
so that user could run diff or anything else and merge changes properly.
Number of usage cases for this excellent imho functionality is extremely large...
There is a problem though, since client's memory is limited, and eventually writeback will
start pushing data to server, so for such cases client has to have ability to cache not only
to mem, but to disk too. That is future extension though.
An anounymous reader dropped me a note, that such behaviour of locally cached files,
when its inode number will change after resync with server, will be frowned upon by some
RSBAC systems.
I believe that inode-only based approach is broken because of heavy problems with filesystems,
when file can be changed by different clients. There is a possibility to remove file and then
create new one, and it will have the same inode number as just removed one, so withough knowing
name of the file system will be screwed. And how does this system work with hardlinks, which
have the same inode number as target object, but different names?
/devel/fs :: Link / Comments ()
Tue, 05 Feb 2008
POHMELFS inode generation and cache coherency.
I think I've just designed the way to fix the problem with
overlapping inodes on different clients or server and clients.
Here is short problem description: when client locally creates some
object, it has to assign unique number to its inode and put it into
global hash tables. With local cache and maximum performance (or when client
is offline) it shold not connect to server and perform create operation
at all, instead it should pick some number for inode and work with it.
Problem is that number of clients can have the same inode number for
different inodes and have actually the same object but with different inode
number on different client's machine.
When clients and server will have to sync its states problem rises: server
does not know about inode with client's number and thus sync can not happen.
Solution is quite simple imo, which solves both cache coherency problem and
inode number one.
Clients use any numbers they like: for example sequential increase from zero.
When new object is created its parent is marked as dirty by client (if it is already
marked as dirty by other clien, it is forced to push its changes to the server,
which then will be forwarded to the new client), and client uses own inode numbering
scheme. When later there is a need for resync (lile forced writeback or above case
of cache coherency synchronization), client sends inode content to the server
with both name and local inode number. Server then creates an object and assigns
real unique inode number to it, which is then returned back to client. Client
removes inode with old (local) number from hash and inserts it back with different
inode number. That's all.
Simple. And allows to work with any filesystem on the server side because system
uses both object name and object id (inode number) as identificators during creation time.
So far I do not see any drawbacks in this approach, but practice will show if it is
correct design or not. Stay tuned.
/devel/fs :: Link / Comments ()
Thu, 31 Jan 2008
POHMELFS release notes.
One can grab release notes, my thoughts (a bit chaotic) and code
here (POHMELFS core)
and local-only-cache hack here.
Please note that POHMELFS is less than one month old, so do not
be too severe with it :)
And I'm going to have some fuel about this release, it was hard, but bloody cool!
/devel/fs :: Link / Comments ()
First POHEMLFS version, codename water:50ml, has been released.
A small benchmark of the local cached mode:
$ time tar -xf /home/zbr/threading.tar
POHMELFS NFS v3 (async)
real 0m0.043s 0m1.679s
Which is damn 40 times!
Excited? Below is a bucket with ice for you and me.
Of course this will not be _that_ huge difference in a real world, when
tested archives are larger (this one if a git archive of my
userspace threading
library), which is very small. Since it is so small there is no writeback
cache flushing.
But you got the key :)
And that version will not be released, since it uses so heavy hack,
called local cache, which is never synced with remote server. Actually one
can consider this as tmpfs or something like that. Code supports sync,
but since inode generation process is very different, files and dirs can not
be blindly synced to the ext3 fs. So, I will release POHMELFS as two patches:
first one is a network filesystem implementation with
write-through cache, when object is first created on the remote side and then
populated to the local cache. This one is slow.
Second patch is a hack to disable writeback caching and implement local caching
only, which is very fast.
After that I will start thinking about how to generically solve the problem with
syncing local changes with remote server. This, among others, will allow offline work
with automatic syncing after reconnect.
This is not intended for inclusion, CRFS
is a bit ahead of POHMELFS, but it is not generic enough (because of above problem)
and works only with BTRFS.
And, btw, I changed name conventions, since having set of volumes from 50ml to 1 liter
is not enough for serious development, I will prepend a liquid name for each raw. So, it will
be water:{50ml, 100ml ... 1 liter}, tea {50 ml ... 1 liter} ... spirit {50ml ... 1 liter}.
Amount of different "waters" I know should be enough for this project :)
Stay tuned!
/devel/fs :: Link / Comments ()
Nasty dentry abuse or...
... searching for rakes by stepping on them in a dark room. That is how I can describe
the process of hunting for obscure bugs in filesystem code.
Preface 1.
System locks hardly without any single message in dmesg, although all kernel
hacking options are enabled in config. System responses to ping, but there
is no way to login or to do somthing by local user.
Preface 2.
I recall, things were cool.
Bisecting is not my friend today, since fair number of fixes was added and
while I can find situation, when new bug does not exist, old ones can kill
the system, so I decided to manually check every patch in git I added
for the last days. Since I do not know VFS enough, there are several things
I just copied from other filesystems (most of them do it that way),
so I started to drop some bits out of that code in pohmelfs.
Eventually I found, that lookup, which fails to find requested dentry
in most filesystems adds NULL inode into dentry either via d_add()
or via d_splice_alias(). Both look harmless, except that dentry
with NULL inode exists in the dentry cache. Maybe it is good and there is
some other bug in pohmelfs, but after I added it I started to get that obscure
freezes (it is quite easily reproducible with almost 100% probability in some test),
and some times general protection fault happend in VFS code during umount.
So, I just removed code, which adds NULL inode into dentry via d_add()
and things are good again. I do not know how frequently this can happen in local filesystem,
but fact is fact, after removing this code pohmelfs behaves excellent (modulo its speed).
Edited to add: no, somthing wrong still exists in the system, although I'm not sure for whom
to blame:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.23-pohmelfs #4
-------------------------------------------------------
bash/4116 is trying to acquire lock:
(&journal->j_list_lock){--..}, at: [] journal_try_to_free_buffers+0xd4/0x187 [jbd]
but task is already holding lock:
(inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (inode_lock){--..}:
[] __lock_acquire+0xa66/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] __mark_inode_dirty+0xce/0x147
[] __set_page_dirty+0xd0/0xdf
[] mark_buffer_dirty+0x8b/0x92
[] __journal_temp_unlink_buffer+0x174/0x17b [jbd]
[] __journal_unfile_buffer+0xb/0x15 [jbd]
[] __journal_refile_buffer+0x6a/0xe3 [jbd]
[] journal_commit_transaction+0xf46/0x11eb [jbd]
[] kjournald+0xb5/0x1c1 [jbd]
[] kthread+0x3b/0x63
[] kernel_thread_helper+0x7/0x10
[] 0xffffffff
-> #0 (&journal->j_list_lock){--..}:
[] __lock_acquire+0x952/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] journal_try_to_free_buffers+0xd4/0x187 [jbd]
[] ext3_releasepage+0x68/0x74 [ext3]
[] try_to_release_page+0x33/0x44
[] __invalidate_mapping_pages+0x74/0xe0
[] drop_pagecache+0x70/0xd8
[] drop_caches_sysctl_handler+0x36/0x4e
[] proc_sys_write+0x6b/0x85
[] vfs_write+0x82/0xb8
[] sys_write+0x3d/0x61
[] syscall_call+0x7/0xb
[] 0xffffffff
other info that might help us debug this:
2 locks held by bash/4116:
#0: (&type->s_umount_key#11){----}, at: [] drop_pagecache+0x38/0xd8
#1: (inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8
stack backtrace:
[] show_trace_log_lvl+0x1a/0x2f
[] show_trace+0x12/0x14
[] dump_stack+0x16/0x18
[] print_circular_bug_tail+0x5f/0x68
[] __lock_acquire+0x952/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] journal_try_to_free_buffers+0xd4/0x187 [jbd]
[] ext3_releasepage+0x68/0x74 [ext3]
[] try_to_release_page+0x33/0x44
[] __invalidate_mapping_pages+0x74/0xe0
[] drop_pagecache+0x70/0xd8
[] drop_caches_sysctl_handler+0x36/0x4e
[] proc_sys_write+0x6b/0x85
[] vfs_write+0x82/0xb8
[] sys_write+0x3d/0x61
[] syscall_call+0x7/0xb
=======================
Although it does not contain any signs of pohmelfs, it still can be related...
/devel/fs :: Link / Comments ()
Wed, 30 Jan 2008
POHMNELFS first release: 50 ml.
Is suspended for tomorrow. Kernel side is fully ready and was quite
actively tested (I think I found lots of tricks used by ext2 and others
when they maintain link counters and process inodes). There are some issues
with local caches, which I will think about later (there are two caches
right now - one is for gloabal hash to inode conversation, another one is
per-inode, it contains hash to inode number only keys, for example
it contains hard links and directories like '.' and '..', other usual
directories and files exists in both caches, the latter cache is used
for ->readdir() implementation, since it is also indexed
by offset field).
I will not release code today because of userspace server, which is so
utterly bad (for every single operation it has to traverse tree of the
objects and to open/close each parent's file descriptor), so it screams
for rewriting. At least for the initial rewrite it will open every single
object it contains, so that requests from remote client would not require
lots and lots of tree traversals. I know that this is not a good solution,
but the only good solution is to move server into the kernel too, but it
will take several days to complete, so it will be scheduled for future versions.
Also found that debugging in Xen is a nightmare: first, it does not support oprofile
(at least the latest version constantly says me "No sample file found" when I try
to see the report), second, it is buggy - I have (it looks so) two xen domains with identical kernels,
one of them regulary freezes in so much obscure places, that it is impossible
to debug it correctly. And then I lost (some ldap problems which do not allow
to login to that domain anymore) the first setup, which worked good... Third,
Xen setup I have is slow (very damn slow), fourth, it is unfair testing,
since different domains can eat all cpu during one or another test and that will
not be easily detected.
So, for initial testing it is enough, but real development will require real hardware.
Stay tuned...
/devel/fs :: Link / Comments ()
Tue, 29 Jan 2008
CRFS release plans.
Zach Brown will release
his CRFS (Cache/Coherent Remote File System) at
LCA this friday.
My congratulations!
This does not change any plans about pohmelfs though,
first version of which will be released today or tomorrow :)
/devel/fs :: Link / Comments ()
Mon, 28 Jan 2008
Is Lustre dead after Sun's acquisition?
It looks so.
First, because new (1.8) release, which will see the light this summer, will run as a userspace application
(I want to highlight, that it is a parallel (!) high-performace (!!) filesystem (!!!))
on top of ZFS, which is slow
(and a high-end test: zfs is faster than ufs
only in single setup, which says how bad solaris vfs cache is, although it is a speculation only)
filesystem, designed and first implemented in Sun, then ported to
Linux via zfs-fuse project.
Userspace zfs runs slower
than kernel one in most cases, actually it is faster than kernel zfs only in single test and difference is close to error rate (about 5-8%).
Sun posted that tests to lustre-devel couple of months ago.
Second, because kernel support of Lustre (it is based on ext3) is "too complex" for Sun,
and thus will be dropped after 2.0 release (end of the year):
... because it removes the burden of having to maintain kernel patches for Linux.
The encumbrance of kernel patches has made development and debugging of Lustre considerably
more complex than in user space; it has slowed our support for new Linux kernels and distros;
and it's even been the source of some nasty regressions when unsupported kernel APIs changed
from under us.
Btw, Lustre 2.0 will support clustered metadata, which will allow metadata-intensive
operations to scale greatly.
Such situation is perfect for the new distributed filesystem development!
/devel/fs :: Link / Comments ()
POHMELFS naming conversion and the first release date.
I've just decided, that
POHMELFS
will not use traditional versioning (1, 2, 3 or 0.1, 0.2 and so on) system,
but completely new, related to its name.
As you probably know,
POHMELFS stands for Parallel Optimized
Host Message Exchange Layered
File System, so it is very logical to use following naming converstion:
50 ml, 100 ml, 0.3l, pint, 0.5l, 0.7l, 1 liter and so on...
The first release is scheduled for this week. it will not include cache coherency
algorithm implemented, but will have completely new and faster local cache.
Stay tuned!
/devel/fs :: Link / Comments ()
Fri, 25 Jan 2008
POHMELFS got correct rmdir support.
That was quite easy, somehow when directory is being removed, it requires
to drop its reference counter twice and drop one for higher layer directory.
Files do not require that (or there is a bug in my code): they only drop
own counter.
Also started link()/symlink() implementation. The former
has a folowing problem: userspace server has a mapping between inode
number and object name, when link() is executed, it creates
new object, which refers to the existing inode with different name, so code
fails. I will think about how to implement it withouth creating dentry/inode
cache on the server side, but that will be another argument against userspace
server and for kernelspace one. In kernel all those operations should be very
straightforward and fast.
symlink() require new operation (i.e. new network structure to
be transferred), which will include symlink name, name of the object it
refers to (this can be arbitrary string) and parent directory entry.
Should not be complex to implemnt.
After all this things are completed, I will perform
LTP testing on top of it,
and then run some benchmarks...
Stay tuned.
/devel/fs :: Link / Comments ()
Thu, 24 Jan 2008
POHMELFS development progress.
I've perfomed number of tests (before electricity was shut down), which included
untar, execution and compilation of small objects, they all went perfectly
fine except directory removal, it has some troubles because I only decrement
number of links in object, not including directory itself, so funny things can
be observed during unmounting (like 100% cpu usage produced likely by dentry cache
processing code in VFS). I also found how crappy Debian Etch (or
at least installation I have) is - I do not know
why, but every ls operation tries first to access ldaprc
file in every directory I ran it. If you would see which files gcc compiler
wants to see in compilation directory for simple fstatat()
testing application...
But overall it looks ok, so far without cache coherency protocol involved,
but I think I have pretty clear idea on how to implement it correctly.
I'm looking at this and recall that not that long ago I wanted to get a linux kernel
hacker position in some company, they developed multi-layer cache system
(i.e. vanilla page cache in memory, then lower level cache on disk and finally
tape storage) and asked me about my experience. It was quite miserable (and I would
not say I suddenly became brilliant :).
Then it was a question about what inode is...
And now I develop my own network filesystem, then local and distributed - how
interesting things move with time, what will be next?... I belive that everything
what happend was excellent, and will be even better.
/devel/fs :: Link / Comments ()
Fri, 18 Jan 2008
POHMELFS got initial writing support.
$ ls -l /mnt/tmp/
total 0
$ echo asdasdasdzxczxczxcqeqweqwe > /mnt/tmp/test
$ sync
$ ls -l /mnt/tmp/
total 0
-rw-r--r-- 1 zbr users 27 Jan 18 22:29 test
$ cat /mnt/tmp/test
asdasdasdzxczxczxcqeqweqwe
$ mount | grep pohmel
qweqwe on /mnt type pohmel (rw)
The same data in on server, and it was only written there after sync
was executed, i.e. exactly in ->writeback() callback and thus via page cache.
I will describe it in details in the next post.
To be completed (simnple!) FS I have to implement inode operations for special files and link support, both
are quite simple (and probably can be postponed), the most interesting idea I have to think about
is metadata caching (so far it is write-through cache, which is not optimal, I want write-back one).
Next complex task is cache coherency algorithms. It will be started after testing (including performance)
of the initial POHMELFS implementation withouth cache coherency involved at all.
Stay tuned!
/devel/fs :: Link / Comments ()
Anatomy of the filesystem. Object creation and removal.
Let's first discuss object creation. It is pretty simple,
each directory inode has inode_operations
structure, which contains ->create()/->mkdir()
callbacks. Prototype of both looks like this:
static int pohmelfs_create(struct inode *dir, struct dentry *dentry, int mode,
struct nameidata *nd);
static int pohmelfs_mkdir(struct inode *dir, struct dentry *dentry, int mode);
Where dir is parent directory inode, dentry
is directory entry structure, which contains inode for given object
(dentry->d_inode, it is NULL for the object being created, since there is no
inode yet for the given dentry),
its name (dentry->d_name)
and lots of other interesting fields, which are not that interesting for
filesystem creation. FS code should allocate space for the new entry and
add it there.
At the end one has to fill dentry with new inode info, it can be done either by
d_add(dentry, &npi->vfs_inode);, or more correct by
d_instantiate(dentry, &npi->vfs_inode);, which is called from d_add(),
which then adds dentry into hash chains. Ext2 also multiple times marks inode as dirty, the same does minix.
This operation has no effect on network filesystem, afaics, but for block based filesystems
it adds inode into dirty list. However, practice shows that d_instantiate(dentry, &npi->vfs_inode);
is not enough, and d_add(dentry, &npi->vfs_inode); should be called for network
filesystem.
Object removal is essentially the same. There are following callbacks invoked by VFS layer,
when object is being deleted: ->unlink() and ->rmdir().
The former is called for usual files, nodes and so on, the latter - when you
call rmdir(). Both have following prototype:
static int pohmelfs_unlink(struct inode *dir, struct dentry *dentry);
static int pohmelfs_rmdir(struct inode *dir, struct dentry *dentry);
Where dir is parent directory inode and dentry contains directory entry,
which in turn has inode pointer and name of the object.
Filesystem should remove appropriate object from the disk, update
its fields and mainly offsets, used in the
->readdir()
callbacks.
All described callbacks should return negative error value or zero in case of correct completion.
/devel/fs :: Link / Comments ()
Wed, 16 Jan 2008
Filesystems and disk caches.
It is known that disk caches are generally very bad for data
integrity in case of various hardware failures or power outages.
It looks like even the most safe filesystem will have hard time recovering
in such cases.
Alan Cox describes
how Ext3 behaves in such situation: if powerfail during write damages the sector, ext3 can
not recover; powerfail during write may cause random numbers to be returned on read, buf
fsck should handle that; ext3 should survive if powerfail damages some sectors
around sector which was written. All above does not happen always and bad things
can happen in every case.
XFS have even more serious
damage in case of powerfail.
/devel/fs :: Link / Comments ()
Tue, 15 Jan 2008
POHMELFS development progrees.
If you are curious about strange delay in POHMELFS development do not think
it is closed or stuck, there is number of things I'm working on in this network
filesystem and delay is only because of administrivia steps about my testing environment
and things like that...
Now it seems things settled down and I have some news.
First, it supports object creation in the filesystem, so far only regular files, but
directories, links and directories is just a matter of additional flags, so it is simple.
Second, it supports object removal (tested on files only though). It does not support
file writing yet, and all metadata operations described above (removing and creation)
perform network sending and receiving (removing can be done in local cache only).
I will write more detailed explaination of the operations involved just after directory/link
creation is ready, likely tomorrow.
/devel/fs :: Link / Comments ()
BTRFS 0.10 has been released.
Chris Mason announced
new release of the BTRFS filesystem.
According to changelog, this version contains pretty serious changes:
- on-disk format changes, now it supports back references from every data and metadata blocks.
This allows future extensions like implementation of the on-line fsck
(a question rises, why is it ever needed for COW FS?) and to allow data migration between different
devices.
- online resizing (including shrinking)
- in-place conversation from ext3 to btrfs :) Although it is offline only, it is a very good
step for easier migration for users.
The conversion program uses the copy on write nature of Btrfs to preserve the
original Ext3 FS, sharing the data blocks between Btrfs and Ext3 metadata.
Btrfs metadata is created inside the free space of the Ext3 filesystem, and it
is possible to either make the conversion permanent (reclaiming the space used
by Ext3) or roll back the conversion to the original Ext3 filesystem.
- data=ordered support. (Probably it is option of the transactin log journal)
- mount options to disable checksumming and COW (the latter explains a lot about
fsck and journalling)
- barrier supports
From the changelog observation only, it looks really impressive, my congratulations for the
project, although list of not fixed bugs worries a bit, but I'm pretty sure, things will be fixed.
/devel/fs :: Link / Comments ()
Direct IO with filesystem from the kernel and fast mapping for loop device.
Although every bit of the system is easily accessible from the kernel,
it is quite hard to do filesystem related tasks, which are generally only
performed from the userspace. For example to read and write files. Actually
one can call the whole sys_open()/sys_read()/sys_write() path
from the kernel, but it is quite slow and ineffective.
Likely the most common example is loop block device driver, which allows
to make a usual file to look like a block device, so one can
mount if, create files there and so on.
With time loop driver became more and more complex, I recall I my first
block layer driver (async block device,
which was similar to loop device, but allowed to perform a lot of operations
asynchronously, it was used to test acrypto
crypto system) was based on it.
Loop device is quite slow, so Jens Axboe (block layer maintainer) came into the game and
extended it to support much faster mapping of the blocks to read/write from the kernel,
than existing.
His first version was extended by Chris Mason (btrfs
author among other), which basically moved mapping code into the filesystem,
so address space operations were extended to include new callbacks called
->map_extent() and ->extent_io_complete().
The former is used to map offset inside the file into extent. Basically extent is a bigger than
a block area on the disk, so far it is not supported by mainline tree (at least 2.6.24 tree),
so one can consider this callback is a mapping from file offset into block number. Usually it is
implemented by filesystem specific ->get_block() callback. Extent part of the patchset
adds a special tree of extents, which can be addressed by offset in the address space, if there
is no extent in the tree, it can be inserted. Extent creation is implemented via ->get_block().
Second callback, ->extent_io_complete(), is only used to invoke calling layer, when
IO is completed, so far it is only used to show when hole filling is completed. Actually I do not know,
how this callback can be used by classical filesystem, but copy-on-write ones should benefit greatly,
since they automatically get a completion, which is async, so higher-layer tree can be updated. Classical
filesystems already handle this situation though. Since it is only implemented for hole filling, it looks
like a little hack :)
Here is Jens' first presentation,
and here is Chris' presentation
of the extent mapping code used to implement fast mapping in loop device.
/devel/fs :: Link / Comments ()
Sun, 13 Jan 2008
POHMELFS filesystem development progress.
So far it is not that big - I'm still trying to setup 3 testing machines (I do not
have physical access to them, so there should be a way to reboot them, check console
output and so on), actually it is 3 small Xen domains on remote machine, so things
are a bit more complex, but it is not enough for initial testing.
Since pohmelfs testing is postponed a bit I started distributed cache coherency system designing and hacking.
So far I will implement so called MESI
protocol, which is used, for example, in IA-32 SMP machines. There is number of problems,
since distributed system is vere different compared to bus-driven SMP machine, for example
there is no way in remotely sane distributed system for single node to snoop data requests
made by other nodes, but this trick is heavily used to catch requests for modified cache lines.
Cache coherency protocol is a very interesting problem itself, so I developing it first as a standalone
application, which will be scalability tested against huge number of users. Then I will integrate it into
pohmelfs.
I do not have fast internet at home anymore, I returned SkyLink modem and will use crappy
GPRS until good internet connection setup, so this make things a bit more complex too...
Right now I have less time hacking things, since quite a lot of spare time is being
eaten by some others, but I expect to be in shape soon.
So, if you do not see frequent update, its just a fluent time, things will be ok.
Stay tuned...
/devel/fs :: Link / Comments ()
Wed, 09 Jan 2008
Cached metadata operations on clients and remote server.
Things are not that simple actually - there is no way to work with offline
server with existing filesystems - since every existing filesystem uses own
inode generation methods, clinet disconnected from the server can not create
new objects in its cache since its inode numbers will not corespond ones,
which would be created if server is online. When network filsystem is only bound
to the single server filesystem it can use the same logic and then only resolve some
problems when multiple clients created different objects with the same inode numbers
while server was offline, with single client there would be no such problem at all.
So, to correctly implement new object creation I've completed non-cached create/remove
methods for the objects.
Right now I'm waiting for server setup to start testing new features (file writing support
and file/dir creation/removal), I hope it will be done today, so that I could share
problems and interesting results found during this stage.
/devel/fs :: Link / Comments ()
2008 Linux Storage and Filesystem Workshop.
I was invited to LSF workshop,
but it is quite hard for me to attend. Not even counting visa problems, travel and
other such small things.
As kernel summit showed to me, this is actually a very personal meeting, i.e. people
come there to met with other persons which have something to talk about. Mostly it is
all about personal contact I think.
I do not have enough personal contacts in the community actually, so there will be
quite a little amount of people to talk about different things, that I belive is a main
reason.
I think we will have very interesting talks by emails, irc and the like first :)
/devel/fs :: Link / Comments ()
Tue, 08 Jan 2008
Write support in POHMELFS.
My network filesystem got file writing support, which is rather trivial
right now - ->prepare_write()/->commit_write() callbacks
do nothing, but ->writepage() method sends data to the server.
It uses very simple request/reply protocol to report errors on the server
side, and does not include any cache coherency mechanisms yet. Since
only ->writepage() is used, data always stays in the client's cache
and only is being sent to the server when local system wants (for example
when system requires to flush some data to the storage or when it wants
more memory).
Next step is to implement metadata operations - directory entry creation/modifications
(like file/directory create/remove/move, link/unlink and so on) and file metadata operations
(like attributes management and truncation).
After this tasks are completed (I expect it to be finished quite soon, it is not that
complex to operate on local cached entries), cache coherency protocol will enter the game.
So far it will be quite simple: each client will have a number of states associated for each
inode, so when one or another is changed, server will be notified and when another client
is about to access modified data it will be synced to server.
Another task is to test clients scalability: when there are multiple users working on the same
client of the pohmel filesystem, how well network filesystem performs? Is locking too coarse?
It is right now - there is a single lock, which guards each network operation, and should not
be changed except by introducing multiple sockets, which is quite bad decision imho, since
network is supposed to be a bottleneck (or remote storage speed, but that can be changed
by switching to faster storage) in this scenario, so having too fain grained locks for different
network operations does not change anything at all. Local cache, which contains inodes,
can be operated using three different tuples (I described them
previously), but there are
two locks: one lock for offset based searches (offset inside address space of the inode, for example
reading directory content, where each directory entry in the stream is located by its offset in given
stream), and another lock for more generic operations like searching for inode by its number or by hash
of its name in the parent direntry (including length and parent inode number). Although both former
operations are supposed to be very fast (it is about O(log2(N)),
where N is total number of inodes in the filesystem), practice can break that dream, since that speed
can be too low for very dense filesystems.
The last one is userspace server, which is quite simple so far and likely have own bottlenecks. One of the crazy
ideas is to move it into the kernel, so that lookup of the inode (file or directory in the userspace)
could be very fast. It will also reduce number of unneded copies (there is number of them - I use
simple send()/recv() instead of mapping and generally there is at least one unneded,
but unavoidable in userspace, copy from kernelspace to userspace).
Some work should be performed with server redundancy - right now there is no failover recovery neither on clients
(I do not know about any filesystem which supports that though, do not confuse that with NFS.
All operations with local cache will succeed of course, but reading from the remote side
will stall), nor on servers (if server fails, clients can not
proceed with work, since there are no other servers which could catch the data and metadata flows.
It has to be fixed).
Anyway, there is number of interesting tasks to complete, and I expect to have something to show quite soon...
/devel/fs :: Link / Comments ()
|