|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Fri, 18 Apr 2008
Poor man's cache coherency protocol design for POHMELFS.
As you might know,
POHMELFS is a network
filesystem with client's cache of data and metadata. Any place with cache has to
provide cache-coherency algorithm to sync data with other users.
There are two common cases when caches become non-coherent:
- client created/removed/modified object, which is not shared with other clients (i.e. this
object does not exist in theirs caches and no object with the same name was created on different
clients)
- object being handled by one client exists in other caches
Poor man's solution for the above problems resolves quite easily: client will flush its changes
to whatever objects it wants during local writeback, this changes are then propagated to all
other clients, which worked with parent object (this information will be stored in server
each time client read dir or perform a lookup). For the first non-coherent case above client
will just receive a new object from the server, which will be easily imported into existing tree
(because of async nature of the POHMELFS it is trivial task, which right now works out of the box,
although only on client). For the latter case there might be problem if local object was modified:
in this case we can either replace its context with new data, or (better) to rename local object to
something different (like old name plus sync time), so that user could merge data manually.
So far there will be no locks, which will be implemented next.
/devel/fs :: Link / Comments (0)
POHMELFS AIO reading benchmark vs async NFS.
After I spent two days implemententing real AIO for POHMELFS, following things happened:
- Implemented 3 different AIO schemes, two of which could be zero-copy. Here is a brief description of them.
First, POHMELFS ->aio_read() callback schedules number of pages to be read from the server
(if page is already up-to-date, it is copied to userspace, otherwise network request is being sent), then
it waits...
- when async data is received from remote side, appropriate inode and pages are found, then (physical)
userspace page is locked in memory and data is either received into that page, or received into VFS
cache page and then copied into userspace one. Then userspace page is unlocked.
- when async data is received (note that it is received completely asynchronous in different thread) into
VFS cache page, received thread copies data into userspace via
copy_to_user(). Since receiver
thread has completely different virtual memory layout, it can not simply copy data to provided userspace address,
first it has to setup page tables to be equal to userspace thread layout, in theory setting CR3 register
on x86 should be enough, but that's only theory, I was not able to fully complete this method, since eventually
thread crashed (obviously: userspace thread could be still active on different CPU, so installing the same CR3 register
for different CPUs pointing to the same page tables lead to crappy things). This interesting hack can be finished though.
- when async data is received, pages are marked as ready and placed into list, so userspace thread can copy
them back via
copy_to_user(). The simplest method. And it works great (graphs below).
- found a bug in 2.6.25-rc7 shmem when removing 1gb file from it:
Bad page state in process 'rm'
page:c49948c0 flags:0xf7d4a600 mapping:00000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 9454, comm: rm Not tainted 2.6.25-rc7 #11
[] bad_page+0x52/0x7a
[] free_hot_cold_page+0x5e/0x15a
[] __pagevec_free+0x18/0x22
[] release_pages+0xfb/0x142
[] __pagevec_release+0x15/0x1d
[] truncate_inode_pages_range+0xea/0x29f
[] __link_path_walk+0xa7e/0xb28
[] truncate_inode_pages+0x9/0xc
[] shmem_delete_inode+0x26/0xac
[] shmem_delete_inode+0x0/0xac
[] generic_delete_inode+0x88/0xec
[] iput+0x60/0x62
[] do_unlinkat+0xb7/0xf9
[] do_page_fault+0x2b6/0x6c2
[] do_page_fault+0x31e/0x6c2
[] sys_ioctl+0x2c/0x43
[] sysenter_past_esp+0x5f/0x85
[] pci_scan_single_device+0x377/0x446
Did not try to investigate (this is my testing server, not tainted with POHMELFS code).
- Ran multiple tests...
Test details for the second round of POHMELFS vs NFS fight.
Hardware and software was already described in the first round,
I need to note, that server (2.6.25-rc7) has all debugging options turned off.
Tests performed: kernel tree reading
(find linux-2.6.24.4 -type f | xargs cat > /dev/null)
from disk over the net (XFS filesystem, cold server and client caches) and big file reading
from the tmpfs (to eliminate server disk latencies). Graph was added to the previous round results.

Note that async NFS and POHMELFS behave very similar with operations which involve reading from the disk,
that is because of disk latencies (although 10krpm SCSI disk used allows about 80 MB/s sequential read,
XFS behaves quite badly with lots of small files), tmpfs comparison shows advantages of the
POHMELFS network protocol.
Reading from huge remote tmpfs file is about 2 times faster for POHMELFS because of its AIO implementation,
although it is not main reason - server was almost always capable of handling requests from the POHMELFS client
one-by-one using one thread, which saturated bandwidth for about 70% (add here all debug options turned on on client).
One of the main factors I think is readahead being turned off - sync readahead has zero advantage in asynchronous
network filesystem, since while it waits for readahead to complete, it could schedule new requests, while
->readpage() method used in readahead waits until page is transferred, and only then
readahead code schedules new request. One can implement ->readpages() though.
Kernel tree reading micro-benchmark was also performed: POHMELFS has 2-times win because of its network protocol, which
batches (via TCP_CORK only though, I think I need to implement better directory reading command) server replies.
Another solution is to correctly implement transactional model, which is next task now.
/devel/fs :: Link / Comments (0)
|