Zbr's days.
October
Sun Mon Tue Wed Thu Fri Sat
 
26
     
2007
Months
Oct

About TODO Blog RSS Old blog Projects Gallery Notes

Fri, 26 Oct 2007

Byte range locking.

Last couple of days I wroked on the tricky algorithm for byte-range locking, which is especially interesting for distributed systems.

Consider the input, which contains of the ordered number of items (for example usual file - ordered set of bytes), and (potentially huge) number of users, who wants to read or write to the arbitrary contiguous areas of the input (like number of threads wants to read or write to the same file with random offsets and data size). Without proper byte-range locking this scenario clearly shows a bottleneck with locking contention.

My algorithm is quite simple - each user requests an ability to write to its area and during that write it locks only it, any subsequent write to the non-overlapping area will be performed in parallel with other writes, writing in case of overlaping regions will suspend second write (which overlaps with current one) until current one is completed. Second write will grab a lock (actually there are no locks, but a tree of permitted writes) for the region maxed of two overlapped regions (i.e. theirs concatenations), so that third write, which overlaps with one of the first two (current or subsequent) would be postponed too. When underlying write is completed, it schedules subsequent write, which in turn will schedule other subsequent write to the overlapping region. Thus all writes become asynchronous and writes to the non-overlapped regions will be perfomed in parallel. This does not match POSIX semantic (since writes to the overlapped regions are postponed and performed in order, writes to the other regions can be completed earlier), but in any given time any read from given region will return correct data for that region.

This problem especially rises when number of threads write to the same file on the different nodes, which can not rely on local page locking (line in Linux VFS).

There are some problems with the algorithm though - it uses (sometimes) too long recursive calls, which should be eliminated before used in real life. That is what I'm working on right now.

/devel/fs :: Link / Comments (4)

Sly wrote at 2007-10-29 15:01:

Are there going to be asynchronous variant of IO?

If yes, do you guarantee to keep the sequence of schedulled write's? (taking in account that subsequent(?) writes may proceed simultaneously)

In other words, what's going to happen to write1 and write2 on the picture? Will they "do down" or stay on the top? http://img87.imageshack.us/img87/8624/writelockbe1.png

Shall it violate data integrity if read is requested in between?

Zbr wrote at 2007-10-29 15:48:

POSIX requires that w2 should be completed after w1, but that hurts performance in distributed system, so I think that this rule should be violated and allow w2 to be completed in parallel with the tree of w1 writes (in case they do not overlap, otherwise it is a FIFO).

In case of the tree below w1 - w1 will be written after all regions which it overlaps are written, probably in parallel with some other writes, which do not overlap with given one.

Writes itself will be async, but VFS calls are sync, so userspace will think that writes are synchronous.

Sly wrote at 2007-10-29 17:32:

The point is that once synchronous IO is used, one can only execute a single operation per user thread. The user may notice the moment he enters the system call and when he leaves. BUT there's no way to determine the sequence in which those writes are actually called. What if the thread was interrupted by context-switch just before the system call?

Therefore, if user API is synchronous, the system may reorder write operations as it likes as long as it doesn't let the thread off. It may always justify itself by saying that was the thread scheduler to reorder the system calls :)

But troubles may rise if user has ability to schedule operations without waiting for it to complete (O_NONBLOCK?)

Maybe you should introduce some kind of sync-points or queues. The latter requests in one queue shouldn't overtake the earlier ones. Though, they may complete "at one moment".

Zbr wrote at 2007-10-29 17:45:

If thread has entered (blockable) write() call, it will not leave it (except by signal), but can be scheduled away (and will). This particular write will be executed later according to the tree of simultaneous (FIFO-like) writes to the overlapping regions.

Writes in the kernel itself will be (should be) async - callback based mechanism to really start writing, when previous thread completed its operation on the overlapping region.

Even if write is non-blocking, there is no need to have anything special - it will be just returned -EAGAIN while another thread is being writing to the same area.

This is very low-level locking layer - it is essentially extension of the linux VFS page lock to the distributed system.

Please solve this captcha to be allowed to post (need to reload in a minute): 25 - 4

Comments are closed for this story.