Zbr's days.
February
Sun Mon Tue Wed Thu Fri Sat
         
29  
2008
Months
Feb
Aug Sep
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Fri, 29 Feb 2008

Debugging undebuggable.

If something looks undebuggable from the first view, than take a secon one. Better from different angle. Some problems require third look.

Bits of history of the problem. Pohmelfs has extremely large latencies when syncing local inode to the remote server. This involves sending a command to the server to create an object with given name and receive back a response with its real inode information (like inode number and other fields cached for faster stat() and similar workloads). Pohmelfs then changes local inode info to match the real data.
Syncing of small tree of 500 files takes about 40 (!) seconds. Well, in Xen environment where I develop this things local creation of 500 files in single ext3 directory takes more than 15 seconds, but another 25 is a pure overhead.
That was short description of previous series.

Next, problems of fixing the problems.
First, Xen version used at that testing machine is old enough, so oprofile does not work. Second, I do not know VFS internals enough (this is my first filesystem, interested reader can find how I managed to step likely on every possible rakes on that field, some of them were even small kid rakes...) to determine where there is a possibility to catch that long delays, but since linux filesystem is actually a not that complex system, but set of callbacks, implementation is not really outstanding, but knowing in which condition each callback can be invoked and which problems can be here or there is kind of a magic... Third, remote userspace pohmelfs server was not actually written by me, instead its bytecode was blown out because of some substances inspiration, so it can be very much a reason for all the problems, given that it is trivial as pretty much all my userspace code, even total rewrite will not fix the issue.

So, latency problem in pohmelfs looked really undebuggable. But you know, cup of excellent tea (from tea-packet) with lemon can fix any problem (or high themperature and substances, or fair amount of alcohol, everyone has fun the way he likes), so it was first decided to implement a simple network kernel module which would connect to remote userspace server and exchange messages in a similar fasion like pohmelfs does.
Such module was implemented, started and showed excellent performance (about 1 thousand of messages per second send and received back in test network, which is several orders of magnitude faster than pohmelfs). So, move back to VFS and pray for inspiration.

Inspiration was met today (thanks Arnaldo, likely it is because I'm getting healthier :).
I always thought that number of subsequent calls for recv() is not a good idea no matter where: in kernel or userspace, since it takes a socket lock, which in turn can introduce latencies found, so I eliminated subsequent recvs in pohmelfs code (testing module was written better and does sending and receiving without such 'fragments'), which resulted in... nothing, results did not changed at all. So, wrong step, but having subsequent sending calls in a row is not a good idea too, so I replaced them with allocation and copy, so that there would be only single kernel_sendmsg() call. As you might expect performance... changed by 30 times. Just by having single send call instead of two for as much as 500 invokations forced the whole network exchange to behave completely different.
So, to debug problem further I extended testing module and introduced ability to send and receive data not by single packet but via two fragments: 4 bytes and rest of the packet (60 bytes). Here is a result table for 1000 of messages sent and received back by testing module:

no fragments:				1.43 seconds
send fragments (4 and 60 bytes):	40.43 seconds
recv fragments (4 and 60 bytes):	1.43 seconds
both fragmentations:			40.43 seconds
It is 30 times difference just for simple application change!
tcpdump on receiving side shows that subsequent fragments sending results in a real message sending all the time kernel_sendmsg() is invoked, which results on ack for each such message (both 4 and 60 bytes), which completely degrades tcp window and connection just can not recover with such behaviour.

So, all that words were written just to show that even undebuggable from the first view problems can be easily solved, and that harmless (from the first view again) programming mistakes can result in very interesting results...

Now back to drawing board to think how to improve pohmelfs protocol even more to get the last bits out of the wire.

Btw, interested reader can get my network testing module and userspace from theirs just created homepage.

/devel/networking :: Link / Comments (4)

M.S. wrote at 2008-02-29 18:45:

Hello,

nice investigation -- and I never thought it would make such a difference!

Before changing the sendmsg/recvmsg-code, have you tried to set flag MSG_MORE in msg_flags of msghdr for subsequent calls? Maybe you can gain a similar effect like you do by copying all the data in one large chunk and then sending this chunk?

zbr wrote at 2008-02-29 19:08:

I did not change sendmsg/recvmsg code, instead fixed application usage. Yes, MSG_MORE, if set for the first fragment, fixes this problem too, but pohmelfs protocol is constructed in the manner, which does not allow to always know in advance if there will be any additional packets sent. Copying into single big chunk and sending it via one call was essentially what I ended up with, although I can not say this is a good solution because of additional copy, so I'm rethinking it right now to something more interesting.

ak wrote at 2008-03-01 19:39:

You don't need the continuous buffer, sendmsg will happily do scatter gather using a struct iovec for you

There is also TCP_CORK btw

zbr wrote at 2008-03-02 16:22:

Yes, sendmsg's scatter/gather is exactly what I want to use to avoid additional copy. Or rethink protocol a bit to always have data aligned for the single buffer sending, or actually a combination of both. TCP_CORK only works with TCP, while pohmelfs can work with any network protocol. Putting protocol specific hacks into filesystem callbacks is a really broken approach, which will heavily strike back when it will be used for different protocol like SCTP or even UDP or completely new media like IB (yes, I know about ip over IB, but that is additional potentially slow layer, so better be avoided if possible).

Please solve this captcha to be allowed to post (need to reload in a minute): 5 + 21

Comments are closed for this story.