Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Thu, 14 Aug 2008

Distributed storage debugging.

DST testing revealed number of bugs, which could be easily fixed if I would not need to debug essentially two separated subsystems of the DST: client and server. Both share lots of code, but it is quite problematic to find who broke the protocol, when one of them starts complaining.

So far peers can connect and start initial data exchange, but there is a major problem, if page size differs on the nodes. Block IO request (bio structure) operates on pages (stored in the bio_vec structure), and has a size and offset attached to each page. If page size differs, and server node has smaller page, then it should somehow store information about how to split own set of pages allocated for given bio size into chunks expected by the client (since we need to transfer size/offser pair for each page). There is no such mechanism right now. It is possible to implement naive approach, when server node will allocate bio pages with sizes requested by the client, but this will break just after short time, since the only guaranteed kernel allocation in Linux VM is single page.
Another approach is to allocate the whole block request (bio) on the server for each page of client data (bio_vec structure), but this will have too big overhead on sequential access and in common case, when page size is equal on both sides of the network channel.

Network block device does not have this problem, since its server lives in userspace and can allocate arbitrary amount of ram, which will be contiguous in virtual memory. Using virtual memory is very slow, although it is possible to just allocate needed buffers using vmalloc(). iSCSI uses single command per block request.

So far I plan to implement following scheme for reading command (which is only one which has described problem): client will iterate over all block requests in each bio it is about to send, and will send as many commands as number of non contiguous blocks in given block request. Server will receive that blocks as separate subcommands, and will allocate a new bio for each such request. Client will need to increment transaction reference counter to the number of such commands, since server can reply to them in arbitrary order.
In the common case this actually should not happen, and I did not see it in practice either, since most reading bios come either from readahead (where they are contiguous) or single block requests (which if bigger than page size will also be contiguous), but nevertheless in theory such bios, where there is number of non-contiguous blocks, can exist and DST should be ready for them.

/devel/dst :: Link / Comments ()