Zbr's days.
December
Sun Mon Tue Wed Thu Fri Sat
           
20
         
2007
Months
Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Thu, 20 Dec 2007

open-by-inode() vs. name lookup in network filesystems.

Network filesystem is a tricky bustard - depending on where it is implemented (kernel or userspace) it is very different. By 'very' I mean really complex differences.

In kernel inode, or basic object's identity, always exists for all objects checked before (until special steps completed, when inode is dropped, but usually it stays alive - for example when you traverse some dir, inodes for every object you checked continue to exist, even if you already do not use that directory. When file is opened, inode will be attached to file, when file will be closed, inode will live. This is a fundamental feature of the split of directory entries and inodes - directory entries are linked into the tree, which we can see, but inodes are shadowed objects behind that entries.

In userspace things are completely different: there are no indes, but only files, identified by file descriptors. That's all. So, when kernel performs a lookup, it checks some name in the inode with given number - i.e. it perfoms in-kernel reference-by-inode operation, but in userspace there is no API (except rare special cases, which I think Zach uses in CRFS, and that is likely good speedup for Btrfs) to get file handler by inode number. Basically userspace should have either opened file descriptor for parent directory, or perform a reverse lookup, create a path and open directory to check if some object exists there, since userspace can only work with file descriptors.
open-by-inode was marked by Linus Torvalds as fundamentally broken because of number of reasons (namely because of races with directory layout changes like move and rename), and likely it is correct, but absence of such API greatly reduces performance of userspace metadata operations.

Having network fileserver in kernel is of course much (MUCH) simpler and faster, but so far its implementation will be postponed a bit.
Initial server will be quite dumb - it will always perform a lookup from the root and always close directory, later it will be possible to add cache of opened directories...

/devel/fs :: Link / Comments (8)