|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Fri, 18 Apr 2008
POHMELFS AIO reading benchmark vs async NFS.
After I spent two days implemententing real AIO for POHMELFS, following things happened:
- Implemented 3 different AIO schemes, two of which could be zero-copy. Here is a brief description of them.
First, POHMELFS ->aio_read() callback schedules number of pages to be read from the server
(if page is already up-to-date, it is copied to userspace, otherwise network request is being sent), then
it waits...
- when async data is received from remote side, appropriate inode and pages are found, then (physical)
userspace page is locked in memory and data is either received into that page, or received into VFS
cache page and then copied into userspace one. Then userspace page is unlocked.
- when async data is received (note that it is received completely asynchronous in different thread) into
VFS cache page, received thread copies data into userspace via
copy_to_user(). Since receiver
thread has completely different virtual memory layout, it can not simply copy data to provided userspace address,
first it has to setup page tables to be equal to userspace thread layout, in theory setting CR3 register
on x86 should be enough, but that's only theory, I was not able to fully complete this method, since eventually
thread crashed (obviously: userspace thread could be still active on different CPU, so installing the same CR3 register
for different CPUs pointing to the same page tables lead to crappy things). This interesting hack can be finished though.
- when async data is received, pages are marked as ready and placed into list, so userspace thread can copy
them back via
copy_to_user(). The simplest method. And it works great (graphs below).
- found a bug in 2.6.25-rc7 shmem when removing 1gb file from it:
Bad page state in process 'rm'
page:c49948c0 flags:0xf7d4a600 mapping:00000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 9454, comm: rm Not tainted 2.6.25-rc7 #11
[] bad_page+0x52/0x7a
[] free_hot_cold_page+0x5e/0x15a
[] __pagevec_free+0x18/0x22
[] release_pages+0xfb/0x142
[] __pagevec_release+0x15/0x1d
[] truncate_inode_pages_range+0xea/0x29f
[] __link_path_walk+0xa7e/0xb28
[] truncate_inode_pages+0x9/0xc
[] shmem_delete_inode+0x26/0xac
[] shmem_delete_inode+0x0/0xac
[] generic_delete_inode+0x88/0xec
[] iput+0x60/0x62
[] do_unlinkat+0xb7/0xf9
[] do_page_fault+0x2b6/0x6c2
[] do_page_fault+0x31e/0x6c2
[] sys_ioctl+0x2c/0x43
[] sysenter_past_esp+0x5f/0x85
[] pci_scan_single_device+0x377/0x446
Did not try to investigate (this is my testing server, not tainted with POHMELFS code).
- Ran multiple tests...
Test details for the second round of POHMELFS vs NFS fight.
Hardware and software was already described in the first round,
I need to note, that server (2.6.25-rc7) has all debugging options turned off.
Tests performed: kernel tree reading
(find linux-2.6.24.4 -type f | xargs cat > /dev/null)
from disk over the net (XFS filesystem, cold server and client caches) and big file reading
from the tmpfs (to eliminate server disk latencies). Graph was added to the previous round results.

Note that async NFS and POHMELFS behave very similar with operations which involve reading from the disk,
that is because of disk latencies (although 10krpm SCSI disk used allows about 80 MB/s sequential read,
XFS behaves quite badly with lots of small files), tmpfs comparison shows advantages of the
POHMELFS network protocol.
Reading from huge remote tmpfs file is about 2 times faster for POHMELFS because of its AIO implementation,
although it is not main reason - server was almost always capable of handling requests from the POHMELFS client
one-by-one using one thread, which saturated bandwidth for about 70% (add here all debug options turned on on client).
One of the main factors I think is readahead being turned off - sync readahead has zero advantage in asynchronous
network filesystem, since while it waits for readahead to complete, it could schedule new requests, while
->readpage() method used in readahead waits until page is transferred, and only then
readahead code schedules new request. One can implement ->readpages() though.
Kernel tree reading micro-benchmark was also performed: POHMELFS has 2-times win because of its network protocol, which
batches (via TCP_CORK only though, I think I need to implement better directory reading command) server replies.
Another solution is to correctly implement transactional model, which is next task now.
/devel/fs :: Link / Comments (0)
Please solve this captcha to be allowed to post (need to reload in a minute): 32 - 42
|