Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Fri, 26 Oct 2007

Byte range locking.

Last couple of days I wroked on the tricky algorithm for byte-range locking, which is especially interesting for distributed systems.

Consider the input, which contains of the ordered number of items (for example usual file - ordered set of bytes), and (potentially huge) number of users, who wants to read or write to the arbitrary contiguous areas of the input (like number of threads wants to read or write to the same file with random offsets and data size). Without proper byte-range locking this scenario clearly shows a bottleneck with locking contention.

My algorithm is quite simple - each user requests an ability to write to its area and during that write it locks only it, any subsequent write to the non-overlapping area will be performed in parallel with other writes, writing in case of overlaping regions will suspend second write (which overlaps with current one) until current one is completed. Second write will grab a lock (actually there are no locks, but a tree of permitted writes) for the region maxed of two overlapped regions (i.e. theirs concatenations), so that third write, which overlaps with one of the first two (current or subsequent) would be postponed too. When underlying write is completed, it schedules subsequent write, which in turn will schedule other subsequent write to the overlapping region. Thus all writes become asynchronous and writes to the non-overlapped regions will be perfomed in parallel. This does not match POSIX semantic (since writes to the overlapped regions are postponed and performed in order, writes to the other regions can be completed earlier), but in any given time any read from given region will return correct data for that region.

This problem especially rises when number of threads write to the same file on the different nodes, which can not rely on local page locking (line in Linux VFS).

There are some problems with the algorithm though - it uses (sometimes) too long recursive calls, which should be eliminated before used in real life. That is what I'm working on right now.

/devel/fs :: Link / Comments ()