←May→
| Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
| |
|
|
|
1 |
2 |
3 |
| 4 |
5 |
6 |
7 |
8 |
9 |
10 |
| 11 |
12 |
13 |
14 |
15 |
16 |
17 |
| 18 |
19 |
20 |
21 |
22 |
23 |
24 |
| 25 |
26 |
27 |
28 |
29 |
30 |
31 |
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Wed, 07 May 2008
Fast transactions in POHMELFS.
POHMELFS
just switched to faster transactions allocated one-by-one with
even smaller overhead (although it does not use kernel_sendpage()
for page sending yet, it copies data).
System does not serialize after all transactions are completed
(it waits after each one), but with
new transaction allocation it is 1.5 times faster: 98MB/s vs. 64MB/s,
note that without waiting for transaction completion it gets full wire speed of 125MB/s
with 1500 byte MTU. And it is with highmem pages and thus slow kmap()
of each one, and unmap after completion. I do not use ->sendpage()
since it will force to split proper set of iovecs into mixed
calls of kernel_sendmsg() and kernel_sendpage(),
which I want to avoid so far. Now it is (again) faster than NFS, but I want to move further.
So, solution is rather trivial: wait until several transactions
are completed. There is the whole infrastructure already there - in-flight transaction
storage, per-transaction completion and destruction callbacks, proper reference counting
and async completion.
Still only writing transactions are used (i.e. reading/lookup and others will not
redirected to different servers).
There are some bugs of course, but that's the first development version after all.
/devel/fs :: Link / Comments (0)
Mon, 05 May 2008
POHMELFS transaction support. Failover (re)connection to different servers.
POHMELFS
just got full transaction support. So far it is only used in ->wrteipages()
callback, which is invoked by writeback mechanism. POHMELFS uses lazy transaction support,
namely it waits after each transaction, which includes header and data to be written for at most
14 pages, 14 is a magic number of pages, which corresponds to struct pagevec size,
used by generic writeback, transaction size is limited by mount option and is 32 pages by default.
Performance was dropped from 125 MB/s down to 64 MB/s, which is not acceptible.
Main problem is of course waiting for transaction to be completed (i.e. completion message from server).
There should not be per transaction waiting, instead writeback has to allocate as much transactions as
needed and proceed one after another, and only start waiting for them, when there are no more
pages to be written. This is the next task.
Transaction mechanism allows quite simple reconnection to different master servers in case of failure,
and rollback of the failed transaction. For example one can provide different number of main
servers (which have to be in sync with each other and be able to be synchronized themselfs,
or they just can use shared storage), so POHMELFS client will switch between them if current
one has failed. System will detect it and reconnect, if reconnect fails, next server will be used
and the whole transaction will be resent there.
It is also possible to write transaction to different server on demand (it may or may not to be connected
already, but it has to have address structure, so far it is only obtained during pre-mount configuration),
which is a prerequistic for parallel data processing. One can create a simple patch to write transactions
one after another to severs in round-robing fasion.
Right now only write transactions are used (and can be combined with object creation if needed), read ones are pending
as long as multiple parallel transactions (which is not complex, but main task is how to wait them all to be
completed, very similar code is used in pohmelfs_aio_read()).
There is also pending task of cache coherency support (server side originated messages
to clients, which used the same pages, which another client is writing into,
also including metadata coherency messages like uid/gid/inode size and other changes),
it is not that complex task, and mostly requires server modifications.
Stay tuned!
/devel/fs :: Link / Comments (0)
Fri, 02 May 2008
Design of the POHMELFS transaction model.
It is heavily based on how netlink is implemented in Linux kernel.
Besides the fact that it is likely the most ugly and complex protocol
among communication models supported by the kernel, it is exactly the
most effective, extendible and feature rich one.
This model is based on the attributes, which are embedded into
the message. Each attribute has header, which includes size
of the attached data. So, one can put
effectively unlimited amount of data into any message (limited only by
size field and practical assumptions of the communication), and it is possible
to create message, which will contain any number of different attributes.
The main problem of the netlink is its padding and alignment ugliness.
Protocol tries to get the every bit out of the communication, so there is huge
amount of very hairy things there.
I like to drink and (un)fortunately I got pretty bad quality drinks some times,
but I'm absolutely sure, when Alexey Kuznetsov designed netlink attrubute alignment
policies he had really bad hangover after likely the ever worst crap he drunk.
So, netlink attributes are very ugly, but you can extend it how you like.
The same applies to POHMELFS transactions.
You can put any new attribute into the transaction in a very trivial manner (I worked
with netlink alot, even created
kernel connector
to simplify kernel development side, so I know that taste), although transaction size is limited,
it is controlled only by mount option (default is 32 IO vectors each one
of PAGE_SIZE (4k on x86) in one transaction).
Thus one can easily implement for example any protocol security labeling,
just add new per-packet attribute.
So, it is easily possible to infinitely extend communication protocol with full backward
compatibility.
/devel/fs :: Link / Comments (0)
Tue, 29 Apr 2008
POHMELFS transactions and ACID.
POHMELFS
just got initial transactions support and ability to connect to multiple master servers.
Master servers are those, which will say, where data is placed. Essentially
they are the same severs which may provide that data, but main server addresses are
provided during pre-mount configuration time, and data server addresses will be provided
by main servers (if main ones will not want to return data) in run-time.
Also main servers can be used to request data in parallel or to switch between them,
when curently active one has failed.
So far it is a theory, practice is rather miserable: POHMELFS client connects to
multiple servers, but works with only one. Errors are detected, and switch to the next
server can happen, but it is not done. Since there is a serious problem with this
approach: neither server nor client support
ACID for data being written.
Here we come to transaction introduction: it is multiple commands wrapped into
single atomic operation. In case of error during transaction
write, the whole one will be resent to different server (or the same one after reconnect).
This is rather simple (although transactions are not supported by server and client
does not wrap any command into it yet), but it still does not solve ACID problem.
Since POHMELFS has writeback cache, all its writes never reach server, instead writeback
is scheduled by the system, and it starts writing pages to the server. Current POHMELFS implementation
uses only ->writepage() method, which is invoked for each page.
It does not require server to return explicit acknowledge, that page was written,
instead it relies to underlying transport protocol (like TCP) to handle guaranteed delivery,
so data can be queued somewhere when connection was dropped, so POHMELFS client
does not know if data was really written or not. Having per-page acknowledge can fix
ACID problem realy trivially, but that may (or may not) end up with severe performance
degradataion. As a better solution I consider own ->writepages()
implementation, where each transaction will contain multiple pages to be written
and thus smaller amount of explicit acks from server to be received, and thus smaller performance
degradataion. In case of failure whole transaction has to be resent to different server of
course.
Server does not support data mirroring to multiple root directories yet, so actually
not too much is implemented from above description, but transactions and multiple
server connections exist and soon client will get support for reconnection and proper
transaction processing.
/devel/fs :: Link / Comments (0)
Sun, 27 Apr 2008
Detailed POHMELFS roadmap.
Transaction support will be added into kernel client.
It is possible that it will be exported to userspace (thus
it will be synchronous write-through operations).
Also kernel client will get locking support (fcntl()
ones first, then more fine-grained ones), this is different from
byte-range
read/write locking, which will be done on server. It is possible to export
it to client too (and will be part of POHMELFS locking API actually, which will
be used for fcntl() too).
The simplest case is data invalidation in client's cache (i.e. if one client
issued a writeback for given page, it has to be marked as not up-to-date on other
clients). Likely it will be done at the beginning of the next week. So far it
will be the last cache coherency item. Task is relly simple because of
asynchronous processing of all data in kernel client. Server will have
to store not only index of directories to watch for object changes there,
but also per-object set of pages, read by client, so that appropriate
users could be notified, that page is no longer up-to-date and has to
be refreshed.
Userspace server will get parallel and distributed facilities. Parallel processing
will be done first by allowing lookup and readdir callbacks return inormation
about objects, which will contain address of the server where object is actually
located, so that server could read, write or check status there. So far the whole
file will be stored on a server, i.e. for the first implementation there will not
be a possibility to store half of the file on one server and another half on different
one. Then it can be extended.
Server will get ability to store data on different root directories (so that client
was not able to see shadow copies). There will be simple regexp policies for data storing,
for example '*.jpg' has to be stored in root1 and root2, '*.txt' only in root1 and so
on. Each root directory can be local or remote mounted one, userspace does not care
about this issues.
Main part is already completed: I have a vision of what system has to provide and how
it will look like, so with good design of the low-level mechanisms it becomes
a doable task for the predictible timeframe.
Stay tuned!
/devel/fs :: Link / Comments (0)
Fri, 25 Apr 2008
POHMELFS release.
Vodka and beer together are glad to provide a new POHMELFS release for you.
POHMELFS stands for
Parallel Optimized Host Message Exchange Layered File System.
This is a high performance network filesystem with local coherent cache of data and metadata.
Its main goal is distributed parallel processing of data. Network filesystem is a client transport.
POHMELFS protocol was proven
to be superior to NFS in lots (if not all, then it is in a roadmap) operations.
Basic POHMELFS features:
- Local coherent (notes 1 and
2) cache for data and metadata.
- Completely async processing of all events (hard and symlinks are the only exceptions) including object creation
and data reading.
- Flexible object architecture optimized for network processing. Ability to create long pathes to object and remove arbitrary
huge directoris in single network command.
- High performance is one of the main design goals.
- Very fast and scalable multithreaded userspace server. Being in userspace it works with any underlying filesystem
and still is much faster than async ni-kernel NFS one.
Roadmap includes:
- Server extension to allow storing data on multiple devices (like creating mirroring), first by saving data in several
local directories (think about server, which mounted remote dirs over POHMELFS or NFS, and local dirs).
- Client/server extension to report lookup and readdir requests not only for local destination, but also to different
addresses, so that reading/writing could be done from different nodes in parallel.
- Strong authentification and possible data encryption in network channel.
- Extend client to be able to switch between different servers (if one goes down,
client automatically reconnects to second and so on).
- Async writing of the data from receiving kernel thread into userspace pages via copy_to_user() (check development tracking
blog for results).
One can grab sources from archive
or check a homepage.
Enjoy!
P.S. Moved to listen blues and drink a beer.
/devel/fs :: Link / Comments (0)
Thu, 24 Apr 2008
Second POHMELFS release.
Is scheduled for tomorrow, today I have to prepare myself for it.
The whole idea and implementation started during fun new year vacations,
so I have to repeat process at least at some degree...
This release will not include direct writing to userspace from async thread,
since this approach happend to be really non-trivial. What I
described
for the page fault handling works only for the first fault, when page is populated into
the table, it can be referenced and written into and thigs just work. Problem
happens when the same page used for the second read (i.e. new try from the userspace,
for example if to increase size of written data to more than two pages, 'cat'
will use the same two pages to read data). With the second write from the kernel there will be
page fault again, although page exists in table, and fault can not be handled
(at least its reason will not be removed, since it will happen again and again), since
page table entry looks really good for the system, but not for the CPU.
I checked two cases: usual copy_to_user() from kernel on behalf of
userspace thread invoked a read syscall, and the same code, but copy was performed
from the different thread. Page table entry (pte) looks very similar in both cases
(in regards of all flags at least), but fault happens for the second write into the same
page always, when thread's mm context was changed to point to original userspace one.
This does not change if userspace thread was or was not scheduled away from its CPU.
Difference from get-user_pages() in this part is mainly the fact, that resulted page is locked
in the kernel (by increasing its reference counter at least), but I still want to produce the same
behaviour as usual page fault during copy on behalf of userspace thread.
So, I stuck with this problem, but since it is very interesting I will find a solution.
Meanwhile, this release will include following things:
- POHMELFS client. Full client side caching. Async operations for all major events
(not including
copy_to_user() hack described previously, but just async
notifications an copy on behalf of original userspace thread).
Support for usual files and directories only, special files like
device files or pipes are not interesting at this point, and are quite simple to implement, but
so far there is no need for that. Client has support for object creation/removal
cache coherency messages.
- POHMELFS userspace server. Onject creation/removal cache coherency messsage broadcasting will
be commented out, no locking.
Stay tuned!
/devel/fs :: Link / Comments (0)
Tue, 22 Apr 2008
Cache coherency in POHMELFS. Continue.
While moving home I thought a lot about cache coherency issues.
While we belive that NFS has coherent cache, since it is somewhat
write-through, its cache actually is not synchronous, since between
object creation and moment when other clients see new object really lot
of time can run, for example when client, which create an object, has
slow link... So, object creation and removal should not be synced to other
clients during writeback on one of them, instead clients which are interested
in object perform a lookup, which may or may not return object, this is not a
race or cache non-coherency, this is usual multithreaded environment without
client's synchronization.
What we really care about, is data consistency on the server. When we have
multipage write, which overlaps with another write from different client,
we should not read data back from the middle of the transactions. Locking the
whole file is not an issue, instead proper byte-range (page-range actually)
locking has to be implemented. I already have a
prototype,
but have to check it in real life.
So, other competing projects may or may not follow my way and drop
creation/removal/stat coherency from the TODO list (afacs, no one implemented
that yet :) based on my analysis and concentrate on server read/write locking.
And I will start some bits of VM hacking: plan is to implement generic enough
(well, working on x86 for start :)
mechanism to copy data from different (i.e. not that one which
started a syscall) thread to userspace, while original one sleeps in syscall,
via copy_to_user(). Likely it will be somewhat similar to what
I did for zero-copy userspace sniffer
and how get_user_pages() work.
Result, which has to be as fast as usual copy_to_user(), otherwise it is not
interesting solution, will be used in POHMELFS client and its async reading.
/devel/fs :: Link / Comments (6)
Mon, 21 Apr 2008
Cache coherency in POHMELFS.
Example:
Client 1 Client 2
# ls -a /mnt/
. ..
ls -a /mnt
. ..
echo qwe > /mnt/asdasd
sync
ls -a /mnt/
. .. asdasd
rm -f /mnt/asdasd
sync
ls -a /mnt/
. ..
dmesg | tail -n1
pohmelfs_remove_response: parent: 2, path: '//asdasd'.
ls -a /mnt
. .. asdasd
As you might noticed, when one client creates an object and it is written back
to server (during writeback), it is broadcasted to all clients, which read the same
directory before. This information is stored on server in binary tree, so it takes
(M-1)*O(log(N)) time, where M is total number of clients and N is number of directories
they read. This can be further optimized though.
Objects are not removed from clients, when one of them remove it (and this is synced
to server via writeback), since so far I can not call sys_unlink() directly
from module, and I did not yet wrote code to deal with dentry cache (that will be siple),
instead you can see in dmesg, that another clients received a command and just need to drop
inode and dentry.
Also inode information is not broadcasted yet (for example when file size increases
or access rights are changed), so new files have always zero size. This informaion should be
broadcasted during writing, and since server is heavily multithreaded, this should not
hurt performance.
There is different opinion though: we do not need cache coherency at all, since the last writer
will overwrite data anyway, and when we open new object, we first look it up on server,
so if it was created there, it will be opened, but if it exists only in cache on some other client, we
do not know about it anyway. We can broadcast above messages during object creation on clients,
but this will be effectively write-through cache, since we can create object on server that time.
Anyway, I will proceed with either remove/stat messages, or with ability to copy data to userspace
from different thread. The latter looks like very interesting hack.
/devel/fs :: Link / Comments (2)
Sun, 20 Apr 2008
Real Jedi does not use kernel.
He writes new or extends existing, but it is from different serie.
This one will tell you how one will be able to build a distributed
and then parallel filesystem using POHMELFS.
Headline says it all: POHMELFS server will not be placed into kernel
so far, since it is already very fast (compared to in-kernel async NFS server),
and userspace programming is a bit easier and mostly because there is no
need to wait about 10 minutes while servers come up after ipmi reboot,
since they are located somewhere I do not know where and there is no posibility
to quickly reboot them by hand, so servers have lots of things to bring themself
up even if something was really screwed, like network boot, add here scsi probing,
possible fsck, initial bios memtest (8GB)...
So, planned POHMELFS server updates:
- PMCC - poor man
cache coherency protocol. Scheduled for the first half of the next week, btw.
- server extension to allow storing data on multiple devices (like creating mirroring),
first by saving data in several local directories (think about server, which mounted remote
dirs over POHMELFS or NFS, and local dirs).
- client/server extension to report lookup and readdir requests not only for local destination,
but also to different addresses, so that reading/writing could be done from different nodes
in parallel.
Somewhere at the beginning there is also a task to extend client to be able to
switch between different servers (if one goes down, client automatically reconnects to second
and so on).
And the most complex task is server parallelization, i.e. ability to have multiple
servers, which handle the same metadata, to work in parallel and be coherent. AFAIK, there
are no such (at least open) solutions, neither Lustre, nor PVFS2, nor Ceph,
nor glusterfs, nor whatever.
There are solutions to have master-slave setup (IIRC, Lustre works that way), Ceph has ability
to spread metadata between multiple servers, but they do not handle the same sets of objects,
so there is no metadata server redundancy.
So far I consider this as the most complex part, and I have not yet come to solution.
/devel/fs :: Link / Comments (0)
Fri, 18 Apr 2008
Poor man's cache coherency protocol design for POHMELFS.
As you might know,
POHMELFS is a network
filesystem with client's cache of data and metadata. Any place with cache has to
provide cache-coherency algorithm to sync data with other users.
There are two common cases when caches become non-coherent:
- client created/removed/modified object, which is not shared with other clients (i.e. this
object does not exist in theirs caches and no object with the same name was created on different
clients)
- object being handled by one client exists in other caches
Poor man's solution for the above problems resolves quite easily: client will flush its changes
to whatever objects it wants during local writeback, this changes are then propagated to all
other clients, which worked with parent object (this information will be stored in server
each time client read dir or perform a lookup). For the first non-coherent case above client
will just receive a new object from the server, which will be easily imported into existing tree
(because of async nature of the POHMELFS it is trivial task, which right now works out of the box,
although only on client). For the latter case there might be problem if local object was modified:
in this case we can either replace its context with new data, or (better) to rename local object to
something different (like old name plus sync time), so that user could merge data manually.
So far there will be no locks, which will be implemented next.
/devel/fs :: Link / Comments (0)
POHMELFS AIO reading benchmark vs async NFS.
After I spent two days implemententing real AIO for POHMELFS, following things happened:
- Implemented 3 different AIO schemes, two of which could be zero-copy. Here is a brief description of them.
First, POHMELFS ->aio_read() callback schedules number of pages to be read from the server
(if page is already up-to-date, it is copied to userspace, otherwise network request is being sent), then
it waits...
- when async data is received from remote side, appropriate inode and pages are found, then (physical)
userspace page is locked in memory and data is either received into that page, or received into VFS
cache page and then copied into userspace one. Then userspace page is unlocked.
- when async data is received (note that it is received completely asynchronous in different thread) into
VFS cache page, received thread copies data into userspace via
copy_to_user(). Since receiver
thread has completely different virtual memory layout, it can not simply copy data to provided userspace address,
first it has to setup page tables to be equal to userspace thread layout, in theory setting CR3 register
on x86 should be enough, but that's only theory, I was not able to fully complete this method, since eventually
thread crashed (obviously: userspace thread could be still active on different CPU, so installing the same CR3 register
for different CPUs pointing to the same page tables lead to crappy things). This interesting hack can be finished though.
- when async data is received, pages are marked as ready and placed into list, so userspace thread can copy
them back via
copy_to_user(). The simplest method. And it works great (graphs below).
- found a bug in 2.6.25-rc7 shmem when removing 1gb file from it:
Bad page state in process 'rm'
page:c49948c0 flags:0xf7d4a600 mapping:00000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 9454, comm: rm Not tainted 2.6.25-rc7 #11
[] bad_page+0x52/0x7a
[] free_hot_cold_page+0x5e/0x15a
[] __pagevec_free+0x18/0x22
[] release_pages+0xfb/0x142
[] __pagevec_release+0x15/0x1d
[] truncate_inode_pages_range+0xea/0x29f
[] __link_path_walk+0xa7e/0xb28
[] truncate_inode_pages+0x9/0xc
[] shmem_delete_inode+0x26/0xac
[] shmem_delete_inode+0x0/0xac
[] generic_delete_inode+0x88/0xec
[] iput+0x60/0x62
[] do_unlinkat+0xb7/0xf9
[] do_page_fault+0x2b6/0x6c2
[] do_page_fault+0x31e/0x6c2
[] sys_ioctl+0x2c/0x43
[] sysenter_past_esp+0x5f/0x85
[] pci_scan_single_device+0x377/0x446
Did not try to investigate (this is my testing server, not tainted with POHMELFS code).
- Ran multiple tests...
Test details for the second round of POHMELFS vs NFS fight.
Hardware and software was already described in the first round,
I need to note, that server (2.6.25-rc7) has all debugging options turned off.
Tests performed: kernel tree reading
(find linux-2.6.24.4 -type f | xargs cat > /dev/null)
from disk over the net (XFS filesystem, cold server and client caches) and big file reading
from the tmpfs (to eliminate server disk latencies). Graph was added to the previous round results.

Note that async NFS and POHMELFS behave very similar with operations which involve reading from the disk,
that is because of disk latencies (although 10krpm SCSI disk used allows about 80 MB/s sequential read,
XFS behaves quite badly with lots of small files), tmpfs comparison shows advantages of the
POHMELFS network protocol.
Reading from huge remote tmpfs file is about 2 times faster for POHMELFS because of its AIO implementation,
although it is not main reason - server was almost always capable of handling requests from the POHMELFS client
one-by-one using one thread, which saturated bandwidth for about 70% (add here all debug options turned on on client).
One of the main factors I think is readahead being turned off - sync readahead has zero advantage in asynchronous
network filesystem, since while it waits for readahead to complete, it could schedule new requests, while
->readpage() method used in readahead waits until page is transferred, and only then
readahead code schedules new request. One can implement ->readpages() though.
Kernel tree reading micro-benchmark was also performed: POHMELFS has 2-times win because of its network protocol, which
batches (via TCP_CORK only though, I think I need to implement better directory reading command) server replies.
Another solution is to correctly implement transactional model, which is next task now.
/devel/fs :: Link / Comments (0)
Wed, 16 Apr 2008
Massively multithreaded POHMELFS server.
Because of completely asynchronous POHMELFS
nature
it is possible to implement mulithreaded server, where not only requests from
different clients are processed in parallel, but also async requests from the
same users are handled simultaneously by pool of threads.
Such multithreading requires to introduce transactional model of the communications,
for example object creation and writing data, right now this race is handled
by sending a reply after creation, so the whole writeback sleeps waiting for that,
which drops performance (to NFS level). Transaction contrary will contain both operations,
which will be processed by the same thread without race. It can also handle
other problematic places with multiple server threads.
So far userspace server can run several or one processing thread per client,
but there is no transactions implemented. I just started
AIO
reading implementation, which should provide great speedup for any reading
workload.
Stay tuned!
/devel/fs :: Link / Comments (0)
Mon, 14 Apr 2008
Initial network filesystem benchmark. POHMELFS vs NFS. Round 1.
Hardware (both client and server have the same hardware).
4-way (2 logical (HT) + 2 physical cpus) 3.00 Xeon (32 bits with PAE :), 8 GB of RAM,
Intel 82541GI gbit adapters, Seagate ST3300007LC 10k rpm scsi disk on
Adaptec AIC7902 PCI-X Ultra320 SCSI adapter.
Software.
Server: 2.6.25-rc7 kernel, in-kernel NFS server, userspace POHMELFS server.
Client: 2.6.25-rc8 kernel, in-kernel clients.
Both have all kernel debugging turned on.
Round 1. Huge directory (linux-2.6.24.4.tar archive) untarring over the network.
Picture shows it all.

Notice, that there is no test for POHMELFS reading (that is why it is only first round),
since it is miserable. And I know the reason: I'm lazy, so I use generic reading function
(generic_file_aio_read()), but actually Linux does not have AIO reading from usual files,
so it is very synchronous and requires to read data page-by-page, so we have a pretty
broken system in regards to network performance.
Since reading is not async, so I will reimplement generic_file_aio_read() as
pohmelfs_aio_read(), which will be a real AIO reading function. That will be second round,
where POHMELFS will win.
But it can not win the game. Because things are changing. Today I've known, that
if filesystem has only 20 users over the world, then it
should not be
merged, since burden
of changing something generic in VFS (and thus propagate it to filesystems)
is too high.
What has happend? Linux kernel maintainers started to be afraid of changes?
Afraid of more code? Afraid of something new they do not want?..
Eh, and they tell they want more developers... They want monkeys who will do only what was
asked them to do.
POHMELFS will be sent for review of course, but it is highly unlikely
I will push it upstream.
/devel/fs :: Link / Comments (6)
Fri, 11 Apr 2008
Unhashed inodes can not be synced during writeback. Debunked.
Problem happend to be quite simple: writeback happens for
inodes in sb->s_io superblock list. They are placed
there from sb->s_dirty list, which contains dirty inodes.
Dirty inodes can be placed into that list via mark_inode_dirty(),
which checks if inode is hashed, if it is not, then it will not be placed into dirty list.
Hashed has a synonym in comments: valid...
There is sb->s_op->dirty_inode() superblock operation callback, which is invoked
first, so one can still implement own inode cache, do not use inode hash tables, do not
hash inodes and still put inodes into dirty list and thus be able to run writeback on them.
/devel/fs :: Link / Comments (0)
Thu, 10 Apr 2008
Busy inodes after unmount.
VFS: Busy inodes after unmount of pohmel. Self-destruct in 5 seconds. Have a nice day...
After removing private cache of inodes I found, that objects, which were
sent by the server and which were never attached to directory entry (dentry),
will never be freed.
So, essentially this does not work with Linux VFS:
iget()/iget_locked()
...
umount
Inodes, created by iget()/iget_locked() will be placed into at least three
different lists:
inode_in_use - global list of ever created inodes, which have i_count and i_nlink
more than 0
s_inodes - per superblock list, which contains every inode, created for this superblock
inode_hashtable - hash table indexed by inode number. If you want to
work with writeback,
your inodes have to be there. Did not yet investigate why.
So, essentially all inodes, which you created, are accessible by VFS and will be checked
during umount via generic_shutdown_super()->invalidate_inodes(),
where system will notice that if inode in s_inodes list has non-zero reference
counter (or course, otherwise it would be already freed by filesystem), then this inode
can not be freed. Thus we have a leak.
Above lists can only be accessed under global inode lock, so it is not a good idea to destroy inodes
traversing them in for example ->put_super() callback or in any other filessytem callback,
so I had to add a list of all inodes into POHMELFS superblock. Ugly.
/devel/fs :: Link / Comments (0)
POHMELFS development status.
It has developed very rapidly last couple of days,
so essentially I rewrote it. I think it is ready for the next
release, which I will announce in a day or so.
Right now all first-milestone features except cache-coherency (check below),
which I planned, are completed (although maybe not in the most
optimal way sometimes).
Because of name cache usage it is now possible to create huge pathes
with multiple directories via single command. The same applies to directory
removal,
although it is because of different design issue.
It would be possible to rewrite generic read/write helpers and provide
set of pages into POHMELFS network stack (which is page
based for data now), but I decided that for the first
step it is not needed.
POHMELSF has now fully async processing of all operations except link creation
(I just decided that it is a bit simpler to make them write-through,
it was done because of laziness and not some fundamental arch problems).
It was achieved by serious (read: from scratch) changes in the arch,
which had own problematic places, namely error report. Because of this
move it becomes really simple to implement any kind of protocol, if it obeys
async rules, namely sending of the message never requires sync reply,
and where it is needed, reply comes as an independent incoming message,
which is processed asynchronously from waiting and via common state machine.
Such arch allows to have simple cache coherency algorithm, when server just sends
a missed entries or commands to remove some objects and client's core handles that just
fine since its reciving code does not depend on sending one. This is not
100% correct way to handle collisions (collisions thus became new objects
in the filesystem tree, like old name plus some suffix), but it is what lots
of the users need, but not real cache-coherency.
Writeback cache does not play very well with cache-coherency, since every metadata
changes (like object creation or removal)
has to be checked against server state, since different clients can do the same with
the same object. Level of paranoidality has to be thought of in advance.
First cache-coherency step is implementation of the trivial scheme, when
every object is synced during its writeback time and changes being broadcasted by server
to other clients. If another client has the same object being processed
it can either be renamed to collision or just overwritten. Having locks
and thus real states is a next step.
Also, POHMELFS does not have authentification and strong checksums right now,
and although this is a simple task to implement, its priority is questionable.
There is also possibility to implement cryptographically strong encryption of the
communication channels.
So, lots of ideas, but main part is ready - async data processing design was
definitely a right choice to implement, so all other features become very simple
to complete.
New release will be announced very soon, stay tuned!
/devel/fs :: Link / Comments (0)
Sun, 06 Apr 2008
The is only one way: asynchronous.
This is a new motto for POHMELFS.
It is a completely new filesystem now.
POHMELFS got new page processing code (sending side: commands and data), new lookup,
which is based on the Linux VFS inode cache without reinventing the wheel (comment
says it is very smp-friendly, although I do not quite understand how
it is possible with global inode_lock), it also got
completely new object creation and referencing path. It is possible
to create a huge path (up to 4k, but can be easily extended if there will be such demand)
with multiple objects in it with only single network command.
But the main feature of new POHMELFS is its name cache. I did not find
how to hook into VFS dentry cache, so invented own. It is fast
to travers from child to the highest level parent, which is actively
used in POHMELFS writeback path. Although it is not 100% the best
storage, but a simple RB-tree (and thus requires smp-unfriendly mutex), the whole
idea shows its gains already. Eventually it will be replaced with
faster and more scalable approach protected by RCU (even properly sized hash
table will show better scalability, although dynamic resizing of hash tables
prevents RCU usage), but I started from the simplest ground.
POHMELFS already outperforms async NFS during untarring and completely saturates
my testing Xen domains (both network and disk speed), while NFS is almost two
times slower. Testing machines have 256 Mb of RAM, maximum 3 MB/s interconnect speed
(something is broken in Xen setup likely, since it is supposed to be 100 mbit/s
and there is no high load), which is very unfriendly (read: in such scenario POHMELFS
will show its worse results) for POHMELFS, but nevertheless it is fast.
It became not only much faster, but also simpler. Its userspace server has
two times less lines of code (816 vs. 1613), kernel side is smaller and simpler too:
mainly there are no zillions of different trees indexed by any possible keys,
so far only per-inode tree of child names for readdir and per-superblock path
entry cache.
There are drawbacks of course: there is no receiving code (at all). It will be a dedicated
thread, which will asynchronously process all incoming packets (mostly
readdir async return, read page content and cache-coherency messages). First
two are really simple. The last one will be implemented as a full MOSI/MSI
library for inode content. Likely it will be possible to use in my
other projects.
P.S. I frequently think that I'm very good vapourware seller :)
Stay tuned!
/devel/fs :: Link / Comments (0)
Wed, 02 Apr 2008
Unhashed inodes can not be synced during writeback.
So essentially there is no way to implement own inode
cache tied to system's writeback mechanism, which is a bad
news. POHMELFS in its current reincarnation does not use
system's inode cache and all its indeas are unhashed, which
results in a fact, that they are never synced, since writeback
mechanism just does not see them.
So I will fallback to hashed inodes, which will be used just for that,
and writeback for single inode will end up creating directory structure
for the all upper layer objects.
Another idea is to implement own writeback, which would be scheduled from the
main one or after memory notifications, this approach has lots of
advantages actually, but let's first complete simpler part with hased inodes.
This is called learning curve - I'm essentially where I was before,
but with extended baggage of knowledge.
/devel/fs :: Link / Comments (0)
Sun, 30 Mar 2008
To SSD or not to SSD.
Couple of days ago I talked with person, who ordered 4 high-end 128G SSD disks
to create RAID for testing purposes, seek time for that devises is 0.1ms.
Each one costs about $4k. His main workload is databases, i.e. random reads and writes,
so we calculated that theoretically it has to be about 14 times faster than
high-end scsi disks with 3.5 ms seek latency and about 100Mb/ssequential access speed
in given
workload for processing random data at 8-16kb chunks (usual 'page' in sql servers).
Besides the fact, that putting 14 disks into mirror will
be as fast as single ssd disk (theoretically), it will be 14 times more reliable
and likely have smaller price,
main workload is to replace RAM with SSD, not disks with SSD.
My prognosis is that SSD will be at most 2-3 times faster (if will be fater
at all, since its theoretical performance advantages can be killed by FS)
than SCSI disk for
given workload, and as is, it is not a breakthrough technology.
If I'm wrong (it will be tested likely next week with
sysbench read-write benchmark),
I will buy a good bottle of whiskey for us, otherwise...
/devel/fs :: Link / Comments (5)
Thu, 27 Mar 2008
Filesystem as a database or database in filesystem.
I actually do not understand what prevents filesystem writers to implement
trivial interface and access library for metadata manipulations,
which would allow not only path lookup,
but also lookup for various keys, for example stored in extended attributes.
Yes, it requires filesystem changes, but I can not believe it is impossible
or even too complex.
Need to think...
/devel/fs :: Link / Comments (2)
Wed, 26 Mar 2008
Added maildir benchmark results.
The simulation works on each filesystem in the following stages:
- The empty filesystem is created and mounted.
- The directory structure is created, with no files.
- A single delivery simulator and retrieval simulator are run
simultaneously. The script waits for each of the simulators to finish,
and then runs the sync command before proceding to the next
step.
- The above step is repeated with 2, 4, 8, and then 16 delivery simulators.
Delivery Simulator.
The delivery simulator does actual maildir deliveries to the given directory:
- It writes a file with a unique file name to the tmp subdirectory.
- It fsyncs the newly written file.
- It renames the file into the new subdirectory.
- It fsyncs the new subdirectory (to ensure that
directory is actually on disk, as most Linux filesystems don't
automatically perform this action during the rename).
More details on original page.
Briefly saing, it is multithreaded maildir simulation.
And results
are quite different compared to for example postmark: very good results from xfs, jfs and reiserfs.
There are no ext2 and btrfs filesystems, since perl's fsync says that
filedescriptor opened there is invalid:
Invalid argument at /root/fs_bench/maildir_fsbench/fsbench/fake-deliver line 38.
Interested reader can check sources and show me a problem, but ext2 worked pretty fine with
2.6.20 kernel and to date glibs/perl/whatever was in Debian.
Anyway, results can be found at contest
homepage.
Now all testing is over.
Main conclusion: things got worse compared to 2.6.20 and there was no major breakthrough in filesystem development at least
from perfomance point of view.
/devel/fs :: Link / Comments (0)
Additional XFS test with slightly diferent mount/mkfs options.
mkfs: -d agcount=75 -l size=64m
mount: logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync
Postmark results:


Results are slightly better than
previous
xfs run, although barriers are turned off, which I blame to be the main reason. Other
filesystems did not turn off directory atime also.
Anyway, even with this results XFS is still much worse than any other FS (except reiserfs)
for this workload.
/devel/fs :: Link / Comments (0)
Tue, 25 Mar 2008
Filesystem contest results.
Interested reader can check out results
of the ext2/3/4, reiserfs, reiser4, jfs, xfs and btrfs fight for the first prices
in dbench,
iozone,
postmark,
maildir performance bench
and simple file creation micro-benchmark.
It does not contain maildir benchmark, I will add it tomorrow or later today,
xfs has yet not completed and no graphs.
As a conclusion: nothing major changed since
previous contest,
new btrfs filesystem behaves not that bad in some cases,
but quite slow in others... Nothing changed.
Does it mean, that we need something new?
/devel/fs :: Link / Comments (3)
POHMELFS status.
I've started mostly from scratch, I think it is a good sign,
when project can be rewritten without any pain to implement a really
interesting ideas instead of having multiple crutches all over the
place. This also means that it is not that complex, so I do not regret
about dropped code.
Now it is in a very testing stage without network protocol at all,
but I test new paradigm in the pohmelfs: its inodes will not be hashed
into global hash table, but instead will be placed into local
trie-like structure, which (optionally) will allow RCU-fied lookup.
Something similar to data structure created for
multidimensional trie
used for unified socket lookup patch.
I very like
two-hash
approach, but since there is no proof (yet) it will work for all possible cases,
I will first implement radix-like tree to store object names. Network
protocol will also operate on full-length pathes, which actually can be
a bad idea, I will see.
Another uber cool feature of the full-path approach
is ability to create number of directories, which form a path to given
object, in a single command, i.e. when client sends a network command
to create object /a/b/c/d/file, there is no need to send
separate commands to create /a, /a/b and so on,
it can be done automatically by server. This requires to send not only
path though, but also information about permissions for each subdir.
/devel/fs :: Link / Comments (2)
Second filesystem contest is over.
Although I plan to run additional couple of tests for
btrfs,
namely all tests for nodatacow option and without ssd option,
which will likely take part of the day. But all others were
already completed, so expect nice graphs tomorrow.
There was number of surprises during that testing. For example
reiser4 constantly freezes the test box in dbench workload
with 150-200 threads. There are no messages in dmesg, but nothing
is turned on in kernel hacking section of the config. Both
btrfs and reiser4 are very slow creating and writing into
lots of small (4k) files. Reiser4 is two times faster than btrfs,
the latter creates/writes/syncs/closes about 10 files per second
average when 10k-30k files are created one-by-one.
Ext4 is also slower than any other (except above two) filesystem
in this microbenchmark.
Something strange was made during 2.6.20-2.6.24 kernel: above file
creation microbenchmark produced much worse results for all
filesystems (magnitude of 10 in some cases) compared to previous
contest.
Maybe sync code was implemented correctly, I do not know...
I will likely drop maildir
benchmark results, since perl script which works there constantly tells me,
that fsync() has invalid parameter...
So, wait about 12 hours (I have to have some sleep: do not mix
absinthe with different red wines and beer, when I did that yesterday/today
night, it was quite tasty, but not todays morning)
/devel/fs :: Link / Comments (0)
Mon, 24 Mar 2008
BTRFS got subvolumes support.
Subvolumes
are block devices on top of which btrfs
can be created. This is first known filesystem in Linux which can be built on
top of multiple block devices. Chris Mason renamed his unstable branch to
really-really unstable because of that. It is possible to put devices into
mirror or striping mode, although it is far from being clear from short
mail description.
Although support for mirror and striping in filesystem is questionable feature,
ability to create filesystem on top of multiple block devices with per-device
allocation policies is a huge step in Linux filesystem development.
/devel/fs :: Link / Comments (2)
Thu, 20 Mar 2008
Second filesystem contest has been started.
So far I removed maildir
test and file creation benchmark, the former requires manual start in my
scripts, the latter requires some filesystems to be removed from the run,
namely Reiser4 and BTRFS, both are very slow creating and writing into lots of small
(4k) files. XFS is probably also a candidate, although with optimizations, described below,
it behaves much better than with default options and 2.6.21 tree.
So, we have dbench,
iozone and
postmark queued...
Testing is being performed with 2.6.24.3 tree, Reiser4 was ported from the latest
breakout of -mm tree (requires lots of manual patching to be started on recent kernels).
BTRFS was taken from the unstable
branch, since it is the same as 0.13 AFAICS. All other filesystems were taken from the
vanilla tree.
There are following optimisations for the filesystems:
- XFS: mkfs:
-d agcount=1 -l size=128m,version=2, mount: noatime,logbsize=256k,
as suggested by Dave Chinner
- EXT4: mkfs: none, mount:
data=writeback,noatime,extents
- EXT3: mkfs: none, mount:
data=writeback,noatime
- EXT2: mkfs: none, mount:
noatime
- JFS: mkfs: none, mount:
noatime
- REISER4: mkfs: none, mount:
noatime
- REISER3 aka REISERFS: mkfs:
--format 3.6, mount: noatime
- BTRFS: mkfs:
-l 4k -n 4k, mount: noatime,nodatasum, for postmark also added ssd option,
as suggested by Chris Mason
First results are expected to be ready tomorrow evening or even (past)weekend... Although all runs
are being performed automatically, nice graphs
generating requires manual start. Then I will proceed with
maildir
test and file creation benchmark.
/devel/fs :: Link / Comments (0)
Fri, 14 Mar 2008
Why binary trees are bad. New cache structure for pohmelfs.
I already found experimentally that write-through cache scales very badly,
even noticebly worse than without cache at all for some workloads, so an ideal
solution
does not involve any kind of write-through operations notably no synchronous commands,
which require immediate response.
This means that inode numbers will differ on client and server, so there should be
some kind of tracked dependency between them so that operations on different machines
can be done in sync. Initial though was to use binary tree to store pointers to appropriate
inodes, which would be indexed on server and clients by combination of hashes of inode (direntry)
and its parent data. Even embedded systems can easily have millions of inodes, so choice
was thought to be correct from the first point of view. Now I think different since there
is a serious problem with indexing of such a tree.
Since the only information common to both client and server is object name it should be used
as a key, maybe not name directly, but its hash, that does not matter at this point. Here comes a
problem with binary tree choice: in binary tree there is no connection between real parent in the
filesystem and parent in the binary tree, so there will be serious problems when we will put two
different object with the same name into binary tree - there will be a conflict. To solve
this problem we should use some information about where this object is placed, i.e. information
about its parent directory. Using parent name hash as a part of the key in the binary tree
does not solve problem too, since there might exist multiple directories with the same name and
the same object in it. We could solve the problem by putting into key hash of the object's name
and hash of parents key (which in turn is hash of the name and hash of its parent key), so this
recursive hashing would end up at the highest level (i.e. root directory). This works, but there
might be scalability problem with the following issues:
- server has to either cache opened directories or reopen it one-by-one when accessing an object
- when object is moved/renamed all keys of its children and parent has to be changed
which is unacceptible. So new solution was thought of.
So far I have two ideas:
- kind of radix tree
- multi-layer hash tables indexed by double name hash
While the former is kind of obvious, the latter is quite interesting but very simple idea. Consider
that each directory has a hash table of its children, it is indexed by double hash of child's name.
We need double hash to remove possibility of collision (I can not prove mathematically (maybe only yet)
that there are two hashes which will not allow simultaneous collision in both, but feel quite strongly
that such hash pairs exist) and to use them in network commands. Commands can be optimised either
to use full path if it is short enough (just sent a path string during writeback or readpage as a
path to where data belongs) or use an array of hashes of the path elements instead of '/' separated
names. Hash tables actually have to be changed to different data structure capable of hosting not only
small hash values, but full 32 or 64 bit hashes. It can be a binary tree or judy array, something similar
to what was used in
unified socket storage. The former looks a bit excessive.
Using such approach it is possible to lookup an object with O(k) operations where 'k' is number of directories
in a path, very usually it is smaller than 10, which for binary tree corresponds to as much as 1024 inodes,
which is too small for the real system.
This approach (especially when full path is being sent) allows to eliminate mentioned above scalability problems.
Implementation start is scheduled for today, but I have to think about details first.
/devel/fs :: Link / Comments (4)
Wed, 12 Mar 2008
(Cache) Coherent Remote File System sources are available now.
Zach Brown has
announced
CRFS source code openess.
CRFS is a network filesystem which works
with remote BTRFS volume and supports
cache on clients.
Here is a brief set of features CRFS supports:
- the user space server exports a private BTRFS volume
- the network protocol operates on ranges of BTRFS disk items
- the kernel client provides posix semantics by operating on items
- the server can grant and revoke client caches of data and metadata
CRFS protocol is very tied to how BTRFS is organized. For example there is natural
batching of some commads like the recursive delete commands, since btrfs keys
placed one-by-one, so there is no need for additional command to be sent, instead
the first one can be extended to cover wider key range.
As you might notice pohmelfs
was started as a competitor to crfs project, because the latter is interesting and was closed. Right
now pohmelfs has set of very interesting features crfs does not and likely will not support (like offline
working, different server filesystem support), also its todo list has plenty of very interesting
stuff, so it will not be closed. Instead I plan to proceed the competition (which is a bit
complex for me, since it is first filesystem I write and essentially I did not know what inode
before) and fully complete pohmelfs. Although I subscribed to crfs-devel :)
My new shiny servers will be installed today, so tomorrow I will start (re)implementation of
the ground ideas of
pohmelfs.
Stay tuned!
/devel/fs :: Link / Comments (3)
Thu, 06 Mar 2008
POHMELFS: was done just wrong!
So, last several days devoted mostly to thinking about the things and some
experiments with them lead me to the headline conclusion: pohmelfs was done
just wrong!
Its network ping-pong protocol is wrong, its inode resync logic and overall
need for inode number change is wrong, its writeback logic is wrong (btw, why
Linux VFS calls writeback for inode after it calls writeback for inode's pages?
This leads to the inode number resync code duplication and fair number of problems),
its userspace server cache is wrong (well, its userspace server is a braindamage,
but that does not prevent it from being wrong too), and the most important: it becomes complex,
so I frequently have to read my own code multiple times to understand what I meant here or
there.
That just has to be changed (mostly just removed)!
Thinking about all that crap lead me to the more phylosophical conclusion: any network
protocol which requires precise acknowledge for a packet is broken. Point.
TCP is not broken, since it can send acks for multiple packets. TCP can aggregate on both
sides of the connection (which can lead to the huge
performance increase
as was observed in userspace network
stack over netchannels),
so it is a stream, not a ping-pong, although its policy for ack generation is not always the best decision.
Out of curiosity, why original ping and traceroute commands were not implemented as TCP applications
which would catch ack/rst packets?
So, anything ping-pong like is just broken. Never ever use that logic at all, since it breaks performance
and ability to extend. More to the game, it breaks ability to create real duplex communication,
since while you expect an ack you can get data from the other peer for different command.
So, brilliant idea (yes, I sometimes get them from the deep abyss of the mindless) is to convert POHMELFS
protocol into two real streams: from clinet to server and completely independent stream from server to client.
It has zillions of benefits, but lets see how it is going to be implemented and what will be fully broken in the fileystem.
First, there will not be resync logic. At all. Each inode (and its number) on the client will not correspond
to any inode object on the server, so local inodes will never be synced with the server one. Instead cache of the objects
on the server side will be indexed by special keys containing name, length and other parameters needed for unique number generation.
Client inode number will never be sent to the server, so object creation will have only single direction: just send a packet.
If there is unrecoverable error, connection can be broken, so subsequent command sending would reconnect or make some
changes. Things like permissions will be guarded by the client, there might be no space problem though.
Second, commands, which require feedback from the server, like reading directory content will become completely
asynchronous, so feedback from the server will not be exactly a sync reply for given command, instead
we can wait until directory content was populated and start providing it back to VFS.
Third, and the main, there is a possibility for the stream commands both from client and server. Since clients
now do not require sync ack/reply, they can be batched to the maximum performance, but that is not a main feature,
really interesting is ability to receive a stream of commands from the server, so each ot them can be parsed
independently from the original client command state. This allows to implement cache coherency protocol without major
pain and have a high perfomance stream of data from server to client.
More to the game is ->sendpage()/sendfile(), which are
broken
without proper acknowledge, so to fix the issue I plan to submit a socket extension patch, which will call
appropriate registered callback when page reference counter is about to be dropped, which automatically means
data was received on the remote side. This kind of acknowledge does not break connection down more than
simple unidirectional bulk transfer, so it is fast.
So, started deleting lots of code and implement needed bits, the nearest future will show how broken my approach is.
This rises a question about design vs. evolution... I actually prefer the former, but frequently end up with the
latter (like this decision about network protocol, which is a design, but only after several evolution steps
in wrong direction). This reminds me kernel evolution
topic, which does not actually show anything good for the kernel: there are lots of dead-end evolutional branches which
believe they are the top of the progress, maybe mankind is one of them...
That was a lyrical digression, so back to business!
/devel/fs :: Link / Comments (0)
Sun, 02 Mar 2008
Removing arbitrary size directory with single network command in POHMELFS.
All operations in pohmelfs
are made locally and are populated back to the server during writeback time (or via cache coherency
algorithm, which is not implemented fully yet). POHMELFS uses
writeback cache in all its power, which allows to remove directory of arbitrary size
using only single network command.
During unlink/rmdir time local object is removed and potentially destroyed, while short reference
of what it was is stored in a sync list of the parent, which is marked as dirty. So, when writeback
hits parent directory of the just removed object, it sends all information of the removed objects to the server.
So, when directory with arbitrary number
of subdirs and other objects is recursively removed locally, information is not sent, but added to appropriate
parent subdirs, which are removed in own turn, so when the whole subdir is removed, only single object
becomes dirty - parent of the just-removed directory, which contains information of the removed
dir. Message about this will be sent later (on writeback or because cache coherency protocol), which will
force server to remove the whole subdir recursively. This is much faster than sending information about
every single object being removed during recursive removal of the directory.
Of course if writeback starts hitting pohmelfs inodes during deletion time it is possible that not only
information about the highest removed directory will be sent, but also about some underlying subdirs, but
that does not matter a lot, since this is a very short condition (inode is in dirty list and yet not removed
by the recursive removal) and number of such inodes is still much smaller compared to overall number of removed
objects.
Actually cache coherency algorithm is the last serious thing to implement in pohmelfs I think. There are bugs
of course and some feature extensions, but major milestone will be set after this got implemented.
Stay tuned!
/devel/fs :: Link / Comments (0)
Thu, 21 Feb 2008
CacheFS and NFS local caching.
David Howells of RedHat recently
posted
next round of his CacheFS implementation. Main idea of the project is to
store locally data and metadata modification on disk.
Cache is implemented as write-through one. Locally data is stored as
usual files on a special partition formatted as one or another filesystem.
David also posted
benchmarks
of his apporach. Metadata intensive operations showed significant slowdown
with the local on-disk cache, getting metadata from local cache also shows
a slowdown. The former can be explained by the write-through nature of the cache
and slow local disk operations, which is also a reason for metadata reading
downgrade of the speed.
There is also no cache-coherency algorithm implemented for CacheFS. Another problem,
pointed also by Kevin Coffman is possible slower reading of data from the cache than
from the local filesystem (and from remote one if bandwith is not a limiting
factor which is frequently the case).
This is third (actually the first :) local cache implementation for the network
filesystem, so competition between
CRFS,
POHMELFS and
CACHEFS becomes even
more interesting :)
Stay tuned!
/devel/fs :: Link / Comments (0)
Wed, 20 Feb 2008
Latency problems in pohmelfs.
trying to make at least something...
As was mentioned full inode resync logic
is very slow.
Latency is introduced likely somewhere at protocol layer, which is used
by pohmelfs. To test this scenario and find out the best possible
solution I implemented trivial network module and userspace server, which
talk to each other via protocol very similar to what is used in lookup/create
operations in pohmelfs. Server and client also maintain trees of the objects
it sent/received, so that model would be as much as possible similar to
pohmelfs usage patterns.
Its time to test things and find out where the problem lies, but as usual
there are problems. You are sick, everything is aching, but you
want to beat the crap, to move a bit further, to make something
interesting, so you start implementing the tiny bits, you start thinking,
you finally make the things, so you become happy and proud, and that is
just to find out, that all testing machines you had access previously
are turned off, and new ones are behind a firewall and there is no access
to the network from the ass of the world. This is called 'shit happens'.
/devel/fs :: Link / Comments (0)
Wed, 13 Feb 2008
POHMELFS got full inode number resync logic.
Now it updates all upper inodes in the tree when doing writeback for some inodes.
Here is a result:
/mnt/tmp$ mkdir -p 1/2/3/4
/mnt/tmp$ echo qweqweqwe > 1/2/3/4/file
/mnt/tmp$ ls -liR ./
./:
3332986296 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 1
./1:
3332988600 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 2
./1/2:
3306456568 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 3
./1/2/3:
3332985144 drwxr-xr-x 2 zbr users 0 2008-02-13 12:07 4
./1/2/3/4:
3306458488 -rw-r--r-- 0 zbr users 10 2008-02-13 12:07 file
/mnt/tmp$ sync
/mnt/tmp$
/mnt/tmp$ ls -liR ./
./:
557065 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 1
./1:
557066 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 2
./1/2:
557069 drwxr-xr-x 3 zbr users 0 2008-02-13 12:07 3
./1/2/3:
557070 drwxr-xr-x 2 zbr users 0 2008-02-13 12:07 4
./1/2/3/4:
557071 -rw-r--r-- 0 zbr users 10 2008-02-13 12:07 file
It also works with much bigger trees (like untarring linux kernel tree,
although ugliness of userspace server requires to rise maximum amount of opened
file descriptors).
There is a single problem in this case: it is damn slow. And I do not see
an easy explaination for that. Well, tcpdump shows small window, but that is an end result
I think, not a reason, and the reason is likely in the protocol pohmelfs uses - system sends
number of short packets in round-robin fashion, which may be slow for some reason.
Since I'm waiting for real hardware to test things on (since oprofile does not work on installed
Xen version), I can only handwave about the root of the problem...
And that is exactly the same problem which was with write-through cache pohmelfs had first, I think
even timings are similar, so after this problem is fixed, new version will be released.
There is another problem, which complicates the development - I got a cold (second one this year, and third
one for the last 3 or 4 years though), but such condition with some temperature, when brain is in the
'hinged' state between sick and good shape, opens very fun feelings about things around, which usually
ends up with very interesting results.
/devel/fs :: Link / Comments (0)
Tue, 12 Feb 2008
POHMELFS got inode number resync logic.
It happens when inode in question is being under writeback -
protocol implements quite simple ping-pong message passing,
so result looks like this:
/mnt/tmp$ echo qweqweqwe > qwe
/mnt/tmp$ ls -lai ./
total 8
557057 drwxrwxrwt 2 root root 4096 2008-02-12 19:58 .
2 drwxr-xr-x 22 root root 4096 2008-02-12 19:58 ..
3322992632 -rw-r--r-- 0 zbr users 10 2008-02-12 20:32 qwe
/mnt/tmp$ sync
/mnt/tmp$ ls -lai ./
total 8
557057 drwxrwxrwt 2 root root 4096 2008-02-12 19:58 .
2 drwxr-xr-x 22 root root 4096 2008-02-12 19:58 ..
557065 -rw-r--r-- 0 zbr users 10 2008-02-12 20:32 qwe
But overall it does not work, since writeback can happen for any inode
inside the whole not-synced tree, so trying to sync inode number for some
obscure object, which sits in the directory server never saw before, is quite
problematic - the whole tree has to be traversed from the inode under writeback
up to the one which is known for the server host.
Although this is not a very complex task, but there is a question about what to
sync. Should the whole directory content be synced, or just single inode,
if the former, than should we force writeback for other objects in the directory under
resync... I think the simplest case is to force only higher layer object creations,
not syncing theirs content (like other objects in the directory), but directory itself
should be marked as dirty, so that access from different clients forced appropriate
resynchronization.
/devel/fs :: Link / Comments (0)
Mon, 11 Feb 2008
Initial implementation of the offline and cache coherency algorithms.
It is rather dumb and even does not have state machine handling
in the usual meaning.
Existing pohmelfs implementation has only two places where content of the inode
is 'globaly' modified, by 'gloabaly' I mean some changes, which have to be seen
by other clients if they will access given inode.
First one is directory reading, when inode in question gets information about
other inodes in given one, another one is object creation. Object removal is local
operation, and there are no collisions if multiple clients delete the same object
simultaneously.
When directory is being read first time, pohmelfs just syncs its content from the server,
all subsequent reads happen from cache, since all creations and removals happen locally.
This case is simple.
When pohmelfs is about to create an object, it marks parent inode as dirty,
if parent inode was not marked dirty previously, this ends up sending a single
message to the server. Server in turn can return content of the directory in question,
if that inode was already modified by different client. If there are objects with the same
name as local ones, local objects are 'renamed' to the 'oldname-synctime', so that
user could later run diff or whatever and merge changes. That is how offline
pohmelfs clients work.
Object is always created in the local cache only with local inode number. So far
it is never being sent to server (although code which does it and changes the inode
content exists), even writeback does not work right now (since server does not know
about object with local inode numbers). This part is a bit more complex: pohmelfs
has to sync inode (i.e. to send current inode info, wait until server creates object,
then receive real inode info and change local cache) either in writeback (when
system forces to writeback a page(s), appropriate inode will be synced first)
or in cache coherency algo. For that purpose each network state locking first checks if there
are messages in the queue from the server, which have to be processed first,
so far only server content receiving is supported, forcing to send own content on request
from server is a base of the cache coherency nad this is not yet turned on. Here
major race lives, which can lead to the full resync of the idea actually. After we locked
own network state and checked that there are no requests from the server, client can start
sending own commands, but before they came to the server, it can start CC resync
(and send messages into the same pipe as clients command) initiated
by different client, which will break protocol state machine. This is main idea to think about.
Oh, and to implement the same logic on server :)
/devel/fs :: Link / Comments (0)
Thu, 07 Feb 2008
POHMELFS and CRFS in the news.
At LWN.net. And as usual I do not have an account this time...
So, will wait for a week for free article, by that time pohmelfs will contain very tasty things,
which do not exist in any other fs out there (or at least in the single filesystem).
Edited to add, that Simon Holm Thøgersenshared a link to the article. It is somewhat fun,
although author (Jake Edge) writes quite differently from Jonathan Corbet imho. Article
does not compare pohmelfs and crfs, but shows that they are very similar. I've known, that
Zach Brown works about a year on CRFS, while pohmelfs exists
less than a month. Someone shared a secret knowledge about meaning of the pohmelfs abbreviation
in russian, well, maybe he/she is right, who knows...
Article does not cover features scheduled for pohmelfs like offline working and inode resync logic.
Commenters try to compare crfs and pohmelfs with afs and pnfs. Both do not have metadata caching
mechanisms, so they are fundamentally different, pnfs in addition allows to implement closed
extensions, which will lead to vendor lock.
One point to writer Jake Edge is that he does not use names in the articles, but only last names.
/devel/fs :: Link / Comments (2)
Filesystem freezer. Removable device.
There is a long discussion in linux-fsdevel
about various filesystem freezing implementations and features it should have.
Main goal of this project is to freeze any filesystem, so that all write requests
would be blocked. This allows to implement consistent backups. This task
belongs to block layer though, and this patchset actually implements that by
suspending underlying block device. Although interface (ioctl) is a bit ugly,
it will likely be accepted, since other filesystems (namely XFS) have such feature
via own provite ioctls. People say that it does not always work though.
LVM supports consistent backups natively, but having such interesting feature without
need to work on top of device mapper would be a great deal!
This highlighted a very interesting project I have in mind (actually it will be
another reinvention of the wheel though) about various removable devices. Actually
it is not only about removable, but any devices, which can suddenly dissapear or stuck
(like network filesystem, broken cable to local disk or bad drive).
Old idea is to remount access to such device as readonly and with error returned to
any atempt to access it. There is a frevoke() syscall which does that
for given file descriptor - it is marked as errorneous so access to it returns errror,
but this does not fix a problem with network filesystem for example. Let's suppose
we have NFS client which stuck because of server was disconnected, there are cases when
it will never resume and return error. Or bad block/bad drive access, which will try
again and again forever...
Revoking particular file descriptor is simple task, but what if we have a web server,
which accesses broken drive for each new client or similar scenario? While we revoke one file descriptor,
server will create another two, stuck in the middle of the operation.
The very good solution I have in mind is to break all existing access pathes (block
layer has access to all bios) and either replace underlying device with fake one,
so that all requests would be completed with error (consider it like hotplug/unplug
of storage device), or replace filesystem (inode and file) operations, so that
they returned error (that is like hotplug/unplug of the filesystem). In the latter case
it would be even possible to change filesystem on the fly! First, plug a filesystem which
just queues requests isntead of processing data, then unplug real filesystem,
plug new one and unplug fake one.
Not sure it is very useful functionality, but very interesting...
/devel/fs :: Link / Comments (0)
Btrfs 0.12 has been released.
Chris Mason changed on-tree disk format again, which leads
to very noticeble (30 times!) speed improvement
for random write access (from 1 mb/s to 30 mb/s).
This release also contains mount option and some tweaks for SSD (solid-state disk),
mainly write clustering without getting into account directory
file writes belong to. Also added simple ENOSPC handling,
although it is still possible to crash machine, when there is
not space left on device, now it is a bit harder.
Next step for btrfs
is to support multiple devices for single filesystem via
subvolumes.
Release notes.
/devel/fs :: Link / Comments (0)
|