Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Fri, 29 Feb 2008

Debugging undebuggable.

If something looks undebuggable from the first view, than take a secon one. Better from different angle. Some problems require third look.

Bits of history of the problem. Pohmelfs has extremely large latencies when syncing local inode to the remote server. This involves sending a command to the server to create an object with given name and receive back a response with its real inode information (like inode number and other fields cached for faster stat() and similar workloads). Pohmelfs then changes local inode info to match the real data.
Syncing of small tree of 500 files takes about 40 (!) seconds. Well, in Xen environment where I develop this things local creation of 500 files in single ext3 directory takes more than 15 seconds, but another 25 is a pure overhead.
That was short description of previous series.

Next, problems of fixing the problems.
First, Xen version used at that testing machine is old enough, so oprofile does not work. Second, I do not know VFS internals enough (this is my first filesystem, interested reader can find how I managed to step likely on every possible rakes on that field, some of them were even small kid rakes...) to determine where there is a possibility to catch that long delays, but since linux filesystem is actually a not that complex system, but set of callbacks, implementation is not really outstanding, but knowing in which condition each callback can be invoked and which problems can be here or there is kind of a magic... Third, remote userspace pohmelfs server was not actually written by me, instead its bytecode was blown out because of some substances inspiration, so it can be very much a reason for all the problems, given that it is trivial as pretty much all my userspace code, even total rewrite will not fix the issue.

So, latency problem in pohmelfs looked really undebuggable. But you know, cup of excellent tea (from tea-packet) with lemon can fix any problem (or high themperature and substances, or fair amount of alcohol, everyone has fun the way he likes), so it was first decided to implement a simple network kernel module which would connect to remote userspace server and exchange messages in a similar fasion like pohmelfs does.
Such module was implemented, started and showed excellent performance (about 1 thousand of messages per second send and received back in test network, which is several orders of magnitude faster than pohmelfs). So, move back to VFS and pray for inspiration.

Inspiration was met today (thanks Arnaldo, likely it is because I'm getting healthier :).
I always thought that number of subsequent calls for recv() is not a good idea no matter where: in kernel or userspace, since it takes a socket lock, which in turn can introduce latencies found, so I eliminated subsequent recvs in pohmelfs code (testing module was written better and does sending and receiving without such 'fragments'), which resulted in... nothing, results did not changed at all. So, wrong step, but having subsequent sending calls in a row is not a good idea too, so I replaced them with allocation and copy, so that there would be only single kernel_sendmsg() call. As you might expect performance... changed by 30 times. Just by having single send call instead of two for as much as 500 invokations forced the whole network exchange to behave completely different.
So, to debug problem further I extended testing module and introduced ability to send and receive data not by single packet but via two fragments: 4 bytes and rest of the packet (60 bytes). Here is a result table for 1000 of messages sent and received back by testing module:

no fragments:				1.43 seconds
send fragments (4 and 60 bytes):	40.43 seconds
recv fragments (4 and 60 bytes):	1.43 seconds
both fragmentations:			40.43 seconds
It is 30 times difference just for simple application change!
tcpdump on receiving side shows that subsequent fragments sending results in a real message sending all the time kernel_sendmsg() is invoked, which results on ack for each such message (both 4 and 60 bytes), which completely degrades tcp window and connection just can not recover with such behaviour.

So, all that words were written just to show that even undebuggable from the first view problems can be easily solved, and that harmless (from the first view again) programming mistakes can result in very interesting results...

Now back to drawing board to think how to improve pohmelfs protocol even more to get the last bits out of the wire.

Btw, interested reader can get my network testing module and userspace from theirs just created homepage.

/devel/networking :: Link / Comments ()