Zbr's days.
April
Sun Mon Tue Wed Thu Fri Sat
   
22
     
2008
Months
Apr
Aug Sep
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Tue, 22 Apr 2008

Cache coherency in POHMELFS. Continue.

While moving home I thought a lot about cache coherency issues. While we belive that NFS has coherent cache, since it is somewhat write-through, its cache actually is not synchronous, since between object creation and moment when other clients see new object really lot of time can run, for example when client, which create an object, has slow link... So, object creation and removal should not be synced to other clients during writeback on one of them, instead clients which are interested in object perform a lookup, which may or may not return object, this is not a race or cache non-coherency, this is usual multithreaded environment without client's synchronization.

What we really care about, is data consistency on the server. When we have multipage write, which overlaps with another write from different client, we should not read data back from the middle of the transactions. Locking the whole file is not an issue, instead proper byte-range (page-range actually) locking has to be implemented. I already have a prototype, but have to check it in real life.

So, other competing projects may or may not follow my way and drop creation/removal/stat coherency from the TODO list (afacs, no one implemented that yet :) based on my analysis and concentrate on server read/write locking.

And I will start some bits of VM hacking: plan is to implement generic enough (well, working on x86 for start :) mechanism to copy data from different (i.e. not that one which started a syscall) thread to userspace, while original one sleeps in syscall, via copy_to_user(). Likely it will be somewhat similar to what I did for zero-copy userspace sniffer and how get_user_pages() work.
Result, which has to be as fast as usual copy_to_user(), otherwise it is not interesting solution, will be used in POHMELFS client and its async reading.

/devel/fs :: Link / Comments (6)

Alexander Boström wrote at 2008-04-26 20:52:

Real consistency:

The write/mkdir/rename/open(O_CREAT)/etc syscall returns after the server has replied "done".

The server tells all the other clients to "invalidate this part of the cache" and wait for them to respond before replying "done" to the original writing client.

If the server keeps track of the contents of client caches it can reduce the amount of "invalidate" messages. There are various other tricks to use to speed things up.

Zbr wrote at 2008-04-26 21:33:

POHMELFS supports writeback cache, so it will not ask server during open/create/mkdir comamnds at all. There is no problem that the same object will be opened on different clients or created and then synced.

Actually your example has a race, when one client writes data to server, which sends message to other clients, but before it was received another client starts sending data to server. The same applies for create/mkdir. It can only be fixed with server-side locking (or fcntl() locking for files), but it is not yet supported by POHMELFS, and it is indeed in TODO.

I would agree on read/write bits, but lets check what happens on SMP system and simultaneous reading and writing into the file: reader will just get what was written, for example half of the page... So, there is no synchronization between them at all. If multiple threads write into the same page, then the last writer will win.

It is possible to invalidate page on other clients when it was synced to disk by one of them, that is what VFS does for mapped files, and what I will commit soon.

Alexander Boström wrote at 2008-04-27 10:39:

Hi again!

Personally, I think "real consistency" between clients is cool and the way to go, but it's not needed everywhere.

It very neat if you can do a complex "make -j" distributed among multiple hosts, but that does make things more complicated and possibly also slower in less demanding situations.

> POHMELFS supports writeback cache, so it will not ask server during open/create/mkdir comamnds at all

Well, yes, if the client knows that it's the only one that has a certain part of the dataset in its cache, thus there are no other caches to invalidate, then it doesn't need to tell the server about every operation. Which is kinda cool.

> Actually your example has a race, when one client writes data to server, which sends message to other clients, but before it was received another client starts sending data to server.

Hmm, are you sure? The message broadcast to other clients isn't really "create a directory 'foo' in 'bar'", but rather "invalidate your cache of 'bar'" because the server is going to update 'bar' after the other clients have responded and there must not be any stale caches. The server doesn't actually need to tell the clients that it's creating 'foo'. They will find out when they try to access 'bar' and start to repopulate their caches.

> I would agree on read/write bits, but lets check what happens on SMP system and simultaneous reading and writing

Hmm... Sync between different clients doesn't have to be better than sync between different processes and processors in the same machine. Is that what you mean?

Zbr wrote at 2008-04-27 20:45:

By the race I meant that both clients can send command create directory 'dir' without prior knowledge that another client is going to create the same. So, when first message is received by server, second one is already in the network queue, but neither client has an object to be invalidated. So, both will send command, first one will create object, second one will or will not, but it does not matter for it anymore - object exists. If it will try to write something into it, it can catch a problem (if permissions differ), but it is another story.

>Sync between different clients doesn't have to be better than sync between different processes and processors in the same machine. Is that what you mean?

Yep.

Christoph wrote at 2008-05-10 12:41:

I think you still need to get some kind of lock from the server *before* a directory or file is created. The advantage of a writeback cache is still that for a new directory (think tar xvfz linux.tgz) there is not additional communication needed.

Zbr wrote at 2008-05-10 15:57:

Sure locking has to be implemented and can be used by clients via fcntl(), so far there is no locking at all. But invoking locking for creation is very questionable for me. Mainly because it does not prevent any race.

Please solve this captcha to be allowed to post (need to reload in a minute): 34 - 9

Name:
URL (optional):
Captcha:
Comments: