Zbr's days.
December
Sun Mon Tue Wed Thu Fri Sat
           
         
2007
Months
Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Mon, 31 Dec 2007

Mostly completed appartment development.

I finished painting, glueing, covering, lights and called friends. There is number of tasks to be completed, but mostly very small, so they can be postponed for a while. I feel myself really excited about my loft, it looks very interesting for me and I do not have anything I would like to change or fix.
Groovy!

/devel/flat :: Link / Comments (0)


Sun, 30 Dec 2007

Meanwhile at appartment development side.

A lot of changes. Huge step forward was made today (and yesterday night). Right now I completed my table (although it has only isngle layer of varmish and not finished rim), mostly finished kitchen (there was not enough wallpapers, but I painted ceiling, glued all wallpapers, which I had, and will setup floor cover tomorrow), finished paintings in the room (I have a blue wall now) and hall (no uchuu yet). So, the only really needed thing is to remove huge amount of dust and garbage from the appartments and then setup floor cover.
I think I'm ready for New Year celebration and amount of work I made will absolutely end up with a good celebration.

/devel/flat :: Link / Comments (0)


Sat, 29 Dec 2007

POHMELFS abbreviation.

POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.

And it has a metadata cache on client now. It contains just pohmelfs inodes, which are indexed by three tuples: first contains of name hash, parent inode number and length of the string (this guarantees, that there will be no identical tuples), second tuple is inodes number and the last one is offset in the address space of the parent.
Cache update operation is independant from its usage, althoguh both are guarded by the same lock.

/devel/fs :: Link / Comments (0)


Fri, 28 Dec 2007

Table development.

It happend that I changed my table design once more, now it has single right angle. Since my wood door contains holes between wood plates, I decided not to remove orgalite (paper filled with glue) plates from top and bottom of the wood plates, but it does not soack up a mordant, so I wll get a coloured varnish and cover table plate again. It is quite hard to work with wood, especially with straigh places without plane using only knife and electric jigsaw, so I will buy one too. I also need set of chisels.
Given that, table is still in very early development stage, but one can check a preliminary photo (made by phone, so quality is not very good) here.

Since table is postponed I will paint walls in the hall and try to clean my loft a bit, I would also like to setup a boiler, start the last arc (instead of the kitchen door), but it is too loud process, so likely will postpone it for tomorrow too.
It will be busy day, and quite short actually - Mephody and Irin arrive, which is a start of the NY celebration process!

/devel/flat :: Link / Comments (0)


Thu, 27 Dec 2007

Meanwhile at appartment devlopment side: blue wall and table.

Yes, I've made it - now I have a blue wall (colour is called 'royal marine', and although it does not look like a sea (I was not near the sea so many years already that do not even recall when I was last time, maybe it was changed, and I did not see an ocean at all, so I will change that too), it looks great. But to made my feelings worse devil told me to get not enough colour, so I actually have only 3 quarters of the wall painted. Will fix it tomorrow or in a day when will move to development shop to get LED cord for the ceiling. So, painting in the room and hall will be finished very soon.
The same roughly applies to my table development. Not that I made too big progress, but I cleaned my old enter wood door (which is a base for the table), put to the floor and painted table contour on it. It looks very expressive and completely different from above pictures: there are no straight lines (although its base is letter 'L'), it will have only single leg (I bought it today) at the end of the longer side, its opposite side will be attached to the walls and will be in a level with window-sill. Maybe later I will replace single leg with leg from the floor to ceiling - that side of the table is essentially round, so it will be convenient to put there some round (glass) shelves. I tried to saw the table using my electric jigsaw, but it is quite loud process and it is about 24-00 in Moscow already, so I postponed it for tomorrow or later. It has to be completed this year, since I need a table to put a lot of things on it (for example fir made of lots of empty beer bottles and jars I have here).
Since I have no chairs, I will made couple of long benches if will have enough of materials (there are two doors here - bigger one (about 200x90 sm) is used as a base for the table, smaller one (200x60 sm) - for smaller part of letter 'L', the rest will be used for benches). I could get wood plates in development shop (it has a lot of interesting types there), but I have some troubles getting it home without a car and do not want to wait for delivery (which will took about a week).
I also got a water hatch for my bathroom today, but I will not set it up, since I have no glue for ceramic tiles, so better to devote this time to other interesting tasks.

Expect some photos of my loft closer to NY time...

/devel/flat :: Link / Comments (0)


Wed, 26 Dec 2007

New release of the distributed storage: Groundhogs strike back: no New Year for humans!

Short changelog:

  • mirroring algorithm improvements
  • debug cleanups
  • extended mirroring initialization
  • documentation update
  • name is 'Groundhogs strike back: no New Year for humans' now
As usual, one can get patch or pull changes from the project homepage.

/devel/dst :: Link / Comments (2)


CDMA (EVDO) vs GPRS.

Good people gave me SkyLink CDMA modem (model USB CNU-550 with EVDO support) to test internet conection in Linux and compare it against GPRS.

Well, here are my conclusions:

  • CDMA works always while GPRS really sucks in the middle of the day: well, it has to be proven tomorrow, but I tested CDMA modem at about 19-00 and it worked ok, while MTS (Mobile TeleSystems in Russia) at about 12-00 worked very bad (I connected quickly, but ssh login took enormously long time).
  • CDMA speed is usually about 10-16 kb/s, while GPRS is usually can not be higher than 1-2 kb/sec. More on this: I believe CDMA session can have higher speeds (pppd session requests about 900 kbit/sec), behaviour of initial login (I saw quite a few of them on different initial speeds during usual work and userspace network stack testing) shows there is some limitation on server or hardware side (i.e. SkyLink either because of special tariff scale or driver limitations, I was told there is no 16 kb/sec limitation with given card though) side (note on testing different congestion control algorithms and develop own if needed: speed downgrades to less than 10kb/sec frequently during download of the big file. SkyLink is a very interesting source of data for such development: it looks like its RTT is quite high for default (new CUBIC) congestion control, at least low-traffic but very-small-latency-wanted source (mutt over ssh on remote host without serious traffic shaping) works quite bad in this setup. Very likely it is just an empty speculation and problem is in hardware on the server (i.e. it has support for bulk streaming access at high speeds (16 kb/sec), but fails to work with low-latency applications, which work in small packets ping-pong environment)).
  • CDMA USB CNU-550 modem works ok in Linux (modulo above issues) with this peers/pap-secrets files without any problems.
Anyway, CDMA SkyLink is much faster and more smooth than MTS GPRS, so decision about what is better is quite obvious...

/other :: Link / Comments (3)


Tue, 25 Dec 2007

Continuing CRFS debates.

Zach Brown again shed some light on his CRFS desing and implementation. Let's compare facts with my thoughts.

The most exciting news is that CRFS caches not only data but metadata too on the client, which is flushed to server on writeback. That is what allows to have 4-6 times higher performance in metadata intensive operations.

Another news is actually quite bad for majority of the potential CRFS users - userspace server is btrfs specific, which can be another gain in the benchmarks (although should be noticebly smaller than metadata caching part). Server does not require any additional patches, but since it is btrfs specific, it likely works on top of ramdisk (when test was perfromed with RAM storage), not tmpfs. Userspace server has exclusive access to given block device, so it is not allowed to simultaneously mount it via usual way (probably it is possible to mount it read-only whlist it is used by CRFS).
Client kernel module only depends on ->write_begin()/->write_end() patchset by Nick Piggin, which was added to mainline recently.

Batching of network requests happens naturally in request/reply protocol, but reply contains not only single request, but set of them, since client caches metadata, it can check if data is in the cache or not and update it if needed.

Getting that knowledge, let's summarize given bits:

  • CRFS is btrfs specific, while pohmelfs is supposed to be fs-agnostic. This CRFS feature allows to have faster (probably even noticebly faster) access to on-disk data. Do not think it is a bad sign, consider it as a client-server filesystem, no one claims AFS is bad, since there is only AFS specific kernel server. Here is the same, but server is in userspace.
    From another point of view, not allowing to work with the same btrfs volume locally can be a show-stopper for some users.
  • Metadata caching. That rocks. It has to be implemented.
  • Extended request/reply protocol: i.e. do not reply with only single data (if it was not explicitly requested), but try to combine objects. The most obvious example is ->readdir() callback, when each request from client should transfer multiple objects, which will then be cached.
I think I was corect in most if not all prognosis about CRFS, probably I should try weather next time...

Given that, I have a clean expectations of what pohmelfs should have and which results we should expect.
CRFS project is a serious step forward in this area, so it is very exciting to work with its ideas and move further.
Stay tuned!

/devel/fs :: Link / Comments (2)


Mon, 24 Dec 2007

Climbing evening.

That was hard. That was really bloody hard, but great training. I climbd high over number of new traces - first for warming I tried something new without label (new yellow trace in the left verticall sector), it happend to be quite complex trace, so I completed it with couple of falls since did not know exact holds of the trace. Next several traces were the same part of the complex trace started on the horizontal negative slope, but I skipped that part, since wanted to know how it behaves higher, I already knew that its start is very complex and fully corresponds to its category (red 7a trace in the middle sector).
There were also couple of simpler traces I made for warming and at the very end to completely flush the power and fasten blood.
It was excellent time!

/life :: Link / Comments (0)


First CRFS (cache coherent remote file system) results.

Zach Brown posted first public results of his CRFS filesystem.
He compared NFS and CRFS when remote storage is on disk (likely btrfs) and in ram (tmpfs) for two operations: big number of file/dir creations (a lot of metaoperations) with small write (untarring kernel archive) and reading all that data into RAM.
In both tests CRFS is noticebly faster: metadata operation test (untarring kernel archive) is 4 times faster for disk storage and about 6 times faster for ram, CRFS reading is about 1.8 times faster than NFS.

Very impressive results, although without knowledge of the CRFS internals it is quite hard to tell, where and how such gain was created, so I will handwave here :)
When CRFS will be opened (if wit will), we will check my thoughts..

First, since there was a tmpfs test, then userspace server does not use anything btrfs specific (like open by inode), although there is a possibility, that btrfs exports some ioctls or kernel was patched, right now I will not consider this as a fact. So, first, userspace server can work on top of any filesystem.

Second, reading is only 2 times faster, while metadata operations is 4-6 times faster. Zach says it is limited by disk speed, so this means metadata was heavily cached. There is a question, though, does server see the last metadata change or it will be sent to server only when another client will access cached data (so caches will become coherent), getting into account, that NFS always sends metadata changes, it looks like CRFS does not. If it is correct, than there is a question, does it need to send metadata updates at all until sync or flush started.

Third, userspace server is fast. With logic I described for pohmelfs server, I think it will not be able to compete, so there is a place for thoughts.

Fourth, network protocol in CRFS batches requests. This can be done either because of special transactional layer between VFS callbacks and network or because of the way VFS callbacks work, for example data is not sent in ->commit_write() callback, but only in ->writepage() and ony if there is a strong demand on that. The same applies to metadata operations - how are they batched and network communication reduced to get 4-6 times performance increase? The most simple case is never send them at creation time at all, but only when writeback for files started (or cache-coherence algorithm requires), so when for example directory is created only notification about dirty parent dir is sent, and when new file is created in this new dir, content of the directory is transferred.

Anyway, from features above pohmelfs currently does not have anything, it is actually read-only, but I already see where it can be improved - for example directory listing (->readdir() callback) is invoked for each access (i.e. each ls /mnt forces directory content resending), since pohmelfs does not cache it.

There is fair number of changes I want to implement to catch with CRFS (I think so :), so stay tuned, I will implement basic functionality first and will run the same tests too...

Making bets? I vote for slower than NFS speeds, because of bad userspace support and no caching of the metadata.
But pohmelfs is developed only 3 days, it is quite young... So, stay tuned.

/devel/fs :: Link / Comments (0)


Sun, 23 Dec 2007

Continuing appartments development.

I think I finished ceramic tiles glueing, at least for this year: first, I have no glue anymore, second, I have to glue only one vertical line with 2.5 tiles width, where vater hatch will be located, and since I do not have a hatch, I do not glue tiles.
Dirty work in bathroom has been essentially completed - I will fill 2mm holes between tiles with plaster and attach ceiling soon, and that will be the end.
Next dirty work is ceramic granite in the hall and checkroom - that will take some time, but since I have no glue and not sure it will be delivered this year, I can postpone this task too. So, main issues are painting finishing and table.
When my head is aching I frequently think out something really interesting and new, so I have a new design for the table in the mind, if I will complete it, that will be really great.
First, table is not movable, but attached to wall (potentially two walls in the corner), end of the table, which does not touch the wall is round and have singe (better steel tube) leg, another part, which touches the wall has a turn, so table looks like letter 'L' with smaller part attached to wall (and window), bigger part can be accessible from both sides.
Or something like that...

/devel/flat :: Link / Comments (0)


GPRS sucks!

Even MTS one is so slow... Although it is enough to read emails and check some news (you know, I have infinite patience which almost never reaches its end), but I want a normal connection.
I know, there is no Ded Moroz (Santa Claus), so there will be no fast internet until past New Year vacations (10 days in Russia), when I will start kicking local ISPs again.

/other :: Link / Comments (0)


Sat, 22 Dec 2007

Meanwhile at appartment development side.

I painted most of the room, and then decided to make one wall either ultramarine or just marine blue. Just because I want, so waiting for the colour, otherwise root would be completed.
Also glued some bits of ceramic tiles in the bathroom - it almost ready too. Today I spent most of the time cutting tiles using corner-grinding machine to make different forms for corners, hatches, door and so on. Became dirty as hell, but completed all but corners and ceiling. Hopefully will finish them tomorrow (or likely not :).

/devel/flat :: Link / Comments (0)


Fri, 21 Dec 2007

Anatomy of the filesystem ->readpage() callback.

This callback is used to read page from the storage to RAM. It has following prototype:

static int pohmelfs_readpage(struct file *file, struct page *page)
Where file is an object associated with opened in userspace file, and page is a page where filesystem has to put data.
On-disk filesystems usually use VFS helpers (like mpage_readpage() or block_read_full_page()), which maps page into set of buffer_head objects, which are then submitted to block layer, where next level of reading from the disk happens. This mapping is implemented via per-filesystem get_block() callback.

Pohmelfs does not follow this standard, since it does not know, which filesystem is on the remote side, and since there is no block device under it. So it just uses request/reply protocol to get given page from the remote host. Page structure already contains its offset from the begining of the file (from the beginning of the address space actually), and it is locked, so simultaneous access is not possible, so we only need to fetch data and mark page (if copy was successful) is uptodate.
Simple.

Here is the result:
server $ md5sum /tmp/ltp-full-20071130.tgz
77bf4032c10c03e858512a5a90c05015  /tmp/ltp-full-20071130.tgz

client # md5sum /mnt/tmp/ltp-full-20071130.tgz
77bf4032c10c03e858512a5a90c05015  /mnt/tmp/ltp-full-20071130.tgz

/devel/fs :: Link / Comments (0)


Anathomy of the filesystem. ->lookup() and ->read_inode() callbacks. First pohmelfs results.

I talked about ->readdir() callback previously, now its time to get other two the most significant callbacks in the VFS lyer.
I call them the most significant (three), since without them it is impossible to mount and get data from filesystem, they have to be implemented for any FS.

Ok, let's first look at ->lookup().
It has following prototype:

struct dentry *pohmelfs_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
As name suggests, this callback is used to lookup inode for given directory entry.
One can check struct dentry, it contains qstr field, which in turn has char array containg name, it also has its length and hashed value (used in dentry cache).
When inode number is found for given directory entry, inode has to be allocated and filled by metainformation. It then should be added into dentry:
err = -ENOMEM;
inode = iget(dir->i_sb, cmd->ino);
if (!inode)
	goto err_out_free;

kfree(data);

d_add(dentry, inode);
That's all for this callback. Pohmelfs uses simple request/reply protocol to get inode for given name, userspace server is rather dumb and contains linked list (it will be changed to tree) of all object names in given directory, so it looks parent directory up, and then finds given name in the dir, then it sends data to client. This operation can be potentially fast (only two tree lookups - one to get parent dir in the main tree and one to find object in the dir).
Pohmelfs client in future can cache received information, so that subsequent access to the same dir would not require rather slow network operations. Right now it does not.

Second callback is ->read_inode(). As name suggests, this has to read inode's metainformation from disk to RAM. It has following prototype:
static void pohmelfs_read_inode(struct inode *inode)
quite simple. Folowing members have to be filled in this callback:
  • i_mode - file mode (file/dir/somthing, access rights)
  • i_nlink - number of links to this inode
  • i_uid/i_gid - uid/gid of the owner
  • i_blocks - number of blocks allocated for this object on disk
  • i_rdev - if object is not regular file, this will hold device numbers
  • i_size - size of the object
  • i_version - used by some filesystems to show that given inode is dead (or not uptodate)
  • i_blkbits - 1 shifted left by this number results in filesstem block size
  • i_mtime/i_atime/i_ctime - modify/access/create time for given inode
  • i_fop - file operations for given inode, this operations include read/write/readdir/aio_read and so on
  • i_op - inode operations, this includes lookup
  • a_op - address space operations, this include readpage/writepage/sync_page/prepare_wrte/commit_write operations
Pohmelfs uses simple request/reply protocol to get this information from the remote server (except various operations).

Having that, one can create simple
$ wc -l fs/pohmelfs/*.[ch]
   120 fs/pohmelfs/config.c
   218 fs/pohmelfs/dir.c
   417 fs/pohmelfs/inode.c
    96 fs/pohmelfs/net.c
   169 fs/pohmelfs/netfs.h
  1020 total
network filesystem, which allows to read data from the remote server
$ wc -l ./fserver/*.[ch]
   267 cfg.c
   750 fserver.c
   581 list.h
   390 rbtree.c
   164 rbtree.h
  2152 total
Note, that rbtree.[ch] and list.h I just got from kernel sources.

Here is an example on client machine:
# ./cfg -a 192.168.4.81 -p 10250 -i 0
# mount -t pohmel /dev/hdb1 /mnt

# ls -l /mnt/
total 88
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 bin
drwxr-xr-x   4 root root  3072 2007-12-21 15:01 boot
drwxr-xr-x  11 root root  3780 2007-12-21 15:01 dev
drwxr-xr-x 105 root root 12288 2007-12-21 15:01 etc
drwxr-xr-x   6 root root  4096 2007-12-21 15:01 home
drwxr-xr-x  14 root root  4096 2007-12-21 15:01 lib
drwx------   2 root root 16384 2007-12-21 15:01 lost+found
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 media
drwxr-xr-x   2 root root     0 2007-12-21 15:01 misc
drwxr-xr-x   4 root root    28 2007-12-21 15:01 mnt
drwxr-xr-x   2 root root     0 2007-12-21 15:01 net
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 opt
dr-xr-xr-x 197 root root     0 2007-12-21 15:01 proc
drwxr-x---   9 root root  4096 2007-12-21 15:01 root
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 sbin
drwxr-xr-x   5 root root     0 2007-12-21 15:01 selinux
drwxr-xr-x   3 root root  4096 2007-12-21 15:01 srv
drwxr-xr-x   5 root root  4096 2007-12-21 15:01 storage1
drwxr-xr-x   7 root root  4096 2007-12-21 15:01 storage2
drwxr-xr-x  12 root root     0 2007-12-21 15:01 sys
drwxrwxrwt  20 root root  4096 2007-12-21 15:01 tmp
drwxr-xr-x  13 root root  4096 2007-12-21 15:01 usr
drwxr-xr-x  23 root root  4096 2007-12-21 15:01 var

# mount | grep mnt
/dev/hdb1 on /mnt type pohmel (rw)
Believe me or not, that is exactly content of the '/' on the my desktop, which is used as a server.

Next step is readpage/writepage/prepare_write/commit_write callbacks, which will allow to read and write files.
Stay tuned.

/devel/fs :: Link / Comments (0)


Thu, 20 Dec 2007

open-by-inode() vs. name lookup in network filesystems.

Network filesystem is a tricky bustard - depending on where it is implemented (kernel or userspace) it is very different. By 'very' I mean really complex differences.

In kernel inode, or basic object's identity, always exists for all objects checked before (until special steps completed, when inode is dropped, but usually it stays alive - for example when you traverse some dir, inodes for every object you checked continue to exist, even if you already do not use that directory. When file is opened, inode will be attached to file, when file will be closed, inode will live. This is a fundamental feature of the split of directory entries and inodes - directory entries are linked into the tree, which we can see, but inodes are shadowed objects behind that entries.

In userspace things are completely different: there are no indes, but only files, identified by file descriptors. That's all. So, when kernel performs a lookup, it checks some name in the inode with given number - i.e. it perfoms in-kernel reference-by-inode operation, but in userspace there is no API (except rare special cases, which I think Zach uses in CRFS, and that is likely good speedup for Btrfs) to get file handler by inode number. Basically userspace should have either opened file descriptor for parent directory, or perform a reverse lookup, create a path and open directory to check if some object exists there, since userspace can only work with file descriptors.
open-by-inode was marked by Linus Torvalds as fundamentally broken because of number of reasons (namely because of races with directory layout changes like move and rename), and likely it is correct, but absence of such API greatly reduces performance of userspace metadata operations.

Having network fileserver in kernel is of course much (MUCH) simpler and faster, but so far its implementation will be postponed a bit.
Initial server will be quite dumb - it will always perform a lookup from the root and always close directory, later it will be possible to add cache of opened directories...

/devel/fs :: Link / Comments (8)


Wed, 19 Dec 2007

Anatomy of the filesystem. ->readdir() callback.

Here I will write simple notes about how some callbacks are used in linux VFS and what filesystem write should implement to be correctly understood by VFS layer.

Let's start from essentially the first callback invoked by FS after fs has been mounted. As name suggests, ->readdir() is used to read directory content. Its prototype looks like this:

static int pohmelfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
where filp is a file structure which is connected to the root inode (which you have to initialize in ->fill_super() callback to be able to mount fs). Dirent is a magic structure, which hosts all directory content you will read, and filldir is a function, which transforms directory names into dirent structure.
Its prototype looks like this:
int filldir(void * __buf, const char * name, int namlen, loff_t offset,
	u64 ino, unsigned int d_type)
and is invoked this way:
size = 1;
if (filldir(dirent, ".", size, filp->f_pos, inode->i_ino, DT_DIR) < 0)
	return -1;
filp->f_pos += size;
I think every step is very straightforward, except last two entries: the former is inode number, which is unique id of the structure, every filesystem has to store it on disk for every inode, obviously in Unix '.' refers to curent dir, so its inode number should be taken from the current dir inode. For '..' directory, which is a parent for given one, filldir() is executed by the following way:
size = 2;
if (filldir(dirent, "..", size, filp->f_pos, parent_ino(filp->f_path.dentry), DT_DIR) < 0)
	return -1;
filp->f_pos += size;
and for some other dir:
size = 8;
if (filldir(dirent, "test_dir", size, filp->f_pos, 14, DT_DIR) < 0)
	return -1;
filp->f_pos += size;
where '14' is inode number for 'test_dir' subdir.
Directory listing for this filesystem will look like this (data from live pohmelfs setup):
# ls -la /mnt/
total 9
drwxr-xr-x  1 root root 4096 1969-12-31 20:02 .
drwxr-xr-x 21 root root 1024 2007-02-08 15:04 ..
drwxr-xr-x  1 root root 4096 1969-12-31 20:02 test_dir

# mount | grep mnt
/dev/hdb1 on /mnt type pohmel (rw)
The last parameter of the filldir() is type of the directory entry, DT_DIR is for directories and it corresponds to 12-15 bits of the stat.st_mode returned from stat() call.

Note, that ->readdir() will be invoked (by ls -la at least) until filp->f_pos stops changing, so after you filled your directory entry and properly updated filp->f_pos, you have to check, that provided filp->f_pos exceeds or not size of the directory (here I mean overall size used by every copied directory entry), and if it does (or is equal), just return 0.

So, how network filesystem should behave here? Answer is pretty simple: it should just send a request to remote server to provide directory listing, copy answer to the allocated buffer and fill directory with provided data. It is possible to cache that data here, but each subsequent ->readdir() has to check on server that data is still valid and was not changed.

With this work pohmelfs becomes a network filesystem, with many interesting features I have in mind, but will open when they got implemented.
This is not intended for mainline inclusion, since Zach Brown's work was first and likely will be more stable and/or feature complete when this stuff become ready.

But nevertheless, stay tuned..

/devel/fs :: Link / Comments (0)


Tue, 18 Dec 2007

Fundamental race between block layer/IO and networking.

This header is about impossibility to work without races with netowork's ->sendpage() method, which is used mostly to transfer IO mapped pages, without either turning off offload capabilities and copying data into new buffer or using own acks in the protocol.

->sendpage() in the optimised case (hardware supports checksum offloading and scater/gather) will not copy content of the page to the new buffer, but instead will increase page's reference counter, so that page could not be freed. When ->sendpage() returns this does not guarantee, that data was sent, received by remote side or whatever, since packet can be queued (in hardware or qdisk), it can be later retransmitted, there is no way to know that data was received until ACK (lets talk about TCP) is received, but there is no API to know that ACK was received. When ACK is received, appropriate packet will be found in the TCP retransmit queue and freed, this will drop page's reference counter.
If user (and there is no other way actually) does expect that after ->sendpage()'s return data can be processed (for example rewritten), then there is non-zero probability that remote side will get this new data, instead of old, which can lead to state machine breaks and data corruption.
One can try to use sendfile() and simultaneously write data to the file - remote side can get mix of the old and new data. One can argue that using proper locking around sendfile() and write will help, but actually it will not - consider the case when we send only single page - after sendfile() returned, data still can be in the queue, so subsequent write, which already does not race with sendfile() itself, but not with data sending, will overwrite data and remote side will get new one instead of old data.

There are two fixes for thei problem: first is not to use ->sendpage() (or use it with copy of the data into new buffer, which is essentially how usual send() works), second is to use protocol specific acknoledgement system, so that any subsequent operation on given data would be postponed not until ->sendpage()/sendfile() returns, but until that ACK received.
Both greatly harm performance.

I would be really glad to find that my conclusions are incorrect.

/devel/fs :: Link / Comments (8)


Climbing evening.

That was very good although again a bit shorter training - most of it was devoted to the complex trace with the start on the horizontal negative slope, which sucked power very quickly, so that at the end (after about 3 hours) I was not able to complete even small parts of it (while doing it quite stable at the begining). Trace requires back and arms especially, so after the training I feel myself tired as hell, which is great of course!
It was very good time there today!

/life :: Link / Comments (0)


Mon, 17 Dec 2007

New release of the distributed storage: Dancing with the smoked neutrino.

Short changelog:

  • new improved mirroring algorithm.
    This algorithm uses sliding window approach for full resync and write log for partial resync.
  • fixed number of typos and debug cleanups
  • update inode size when linear algorithm changes the size of the storage in run time
  • extended number of sysfs files and documentation for them
  • fixed leak in local export node setup
  • name is 'Dancing with the smoked neutrino' now
Overall list of features of the DST can be found on project's homepage.

DST is also exported as a git tree available for clone and pull from here.

Interested reader can test DST with 2.6.23 tree too (it should compile fine, but was not tested).

/devel/dst :: Link / Comments (4)


New distributed storage mirroring algorithm.

Resync logic - sliding window algorithm.

At startup system checks age (unique cookie) of the node and if it does not match first node it resyncs all data from the first node in the mirror to others (non-sync nodes), each non-synced node has a window, which slides from the start of the node to the end. During resync all requests, which enter the window are queued, thus window has to be sufficiently small. When window is synced from the other nodes, queued requests are written and window moves forward, thus subsequent resync is started when previous window is fully completed. When window reaches end of the node, it is marked as synchronized.

If age of the node matches the first one, but log contains different number of write log entries compared to the first node (first node always stands as a clean), then partial resync is scheduled. Partial resync will also be scheduled when log entry pointed by resync index of the node contains error.

Mechanism of this resync type is following: system selects a sync node (checking each node's flags) and fetches a log entry pointed by resync index of the given node and resync data from other nodes to given one. Then it checks the rest of the write log and checks if there are another failed writes, so that next resync block would be fetched for them.

Mirroring log is used to store write request information. It is allocated on disk and in memory (sync happens each time resync work queue fires), and eats about 1% of free RAM or disk (what is less). Each write updates log, so when node goes offline, its log will be updated with error values, so that this entries could be resynced when node will be back online. When number of failed writes becomes equal to number of entries in the write log, recovery becomes impossible (since old log entries were overwritten) and full resync is scheduled.

This does not work well with the situation, when there are multiple writes to the same locations - they are considered as different writes and thus will be resynced multiple times. The right solution is to check log for each write, better if log would be not array, but tree.

/devel/dst :: Link / Comments (0)


Fri, 14 Dec 2007

Linux Test Project on top of DST storage.

# pwd
/mnt/ltp-full-20071130

# ./runltp -p -f fs -d `pwd`/tmp
...
# cat /mnt/ltp-full-20071130/results/results.2007-12-14.11.21.41.17106 
Test Start Time: Fri Dec 14 11:21:41 2007
-----------------------------------------
Testcase                       Result     Exit Value
--------                       ------     ----------
gf01                           PASS       0    
gf02                           PASS       0    
gf03                           PASS       0    
gf04                           PASS       0    
gf05                           PASS       0    
gf06                           PASS       0    
gf07                           PASS       0    

-----------------------------------------------
Total Tests: 57
Total Failures: 0
Kernel Version: 2.6.22-rc5-dst
Machine Architecture: x86_64
Hostname: uganda

# mount | grep mnt
/dev/dst-storage-32 on /mnt type xfs (rw)

# cat /sys/devices/storage/n-0-ffff*/type
R: 192.168.4.81:1025
R: 192.168.4.81:1026
All 'fs' tests completed successfully, although I saw following dump in dmesg:
[ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
[ 8398.609641] turning off the locking correctness validator.
which is XFS bug.

Since DST is quite dumb device, that tests will not find tricky places, but they are good to generate high load on top of given block device.

/devel/dst :: Link / Comments (0)


New release of the userspace network stack.

Changed data reading function, now it does not copy TCP header into user's buffer, only data, and forced packet socket reading path to limit maximum number of packets to be read, which do not match created netchannel.
As usual, new release is available from project homepage.

/devel/networking/unetstack :: Link / Comments (0)


New mirroring module in the distributed storage.

$ git-diff-index --stat HEAD drivers/block/dst/alg_mirror.c
 drivers/block/dst/alg_mirror.c |  745 ++++++++++++++++++++--------------------
 1 files changed, 364 insertions(+), 381 deletions(-)
It is cool and works good in my environment, but (like previous) it forces total mirror resync after main storage node reboot or crash (if it is required, for example when array was not in sync already and main node rebooted).
I want to extend DST mirroring algorithm not to force full resync, but store a log of the writes on each node, so when new array starts, it would check not only age of the nodes (uique id stored at the end of each node, if it does not match, total resync starts), but also write log, so that the latter does not match, only selected number of regions would be synchronized.

Stay tuned...

/devel/dst :: Link / Comments (0)


Thu, 13 Dec 2007

Why pushing project into the kernel is not a main goal?..

One have to have some courage and do not afraid to throw something out and create new things instead of old, even if it will require a lot of efforts and some problems in a short cycle.
So I've just erased mirroring algorithm from DST and will rewrite it mostly from scratch, since I have a very interesting sync algortihm inmind, which will not require clean/dirty bitmap.
Havind DST in kernel would not allow me to have such flexibility...

/devel/dst :: Link / Comments (0)


Wed, 12 Dec 2007

Climbing evening.

It was again a bit late training and thus shorter than usual, but nevertheless it was very saturated - I tried old complex start on the horizontal negative slope and several times managed to complete it fully. That's a very interesting and complex trace itself, but some time ago I tried some of its bits and completed them. I think I can finish it without falls after several trainings, but right now I'm working with the most complex I think part: with power sucking start.
Horizontal negative slope is usually a big problem for me because of my power endurance, it also requires very strong back in some movements, so right now I'm feeling that I still have some muscles in the body and they did not dissapear after sitting in the chair most of the time.
Excellent time!

/life :: Link / Comments (0)


I was a bit pessimistic about DST design bugs.

Things are only bad when resync of the mirror node is in place...
I fixed both issues, but will spent additional time debugging and testing the them, since I do not like how it was done. I think I will rewrite mirroring resync logic.

Subrata Modak of IBM suggested to use Linux Test Project, which I found to have interesting benchmarks, which while being very useful for filesystem development, still can find some bugs in DST.

/devel/dst :: Link / Comments (0)


Shame on me or how complex are design bugs...

I have to admit, that mirroring in DST is not currently well supported.
First, because of a bug I made in the early development stage: in DST there are two objects, which represent a part of the storage, first one is a node, this object contains information about type of the storage and pointers to structure, which represents low level device itself (like block device or network connection). Network connection in turn is represented as a state structure, which contains socket, state machine for transferred data and so on. Nodes are used when block io request comes from the higher layer and states are used when data is transfeerred via network. The former uses fain grained reference counters: when node is being operated on (request is processed), its reference counter is increased, if operations become asynchronous (for example sending queue is full and thus block can not be sent right now), then block request is queued into state's request list and reference counter for the node is dropped. If it reaches zero, node is being freed, which in turn calls exit callback for the state, which flushes the queue of requests.
Things seem simple and correct, but devil is in details - async processing thread can enter at any point into the game and process state too, which leads to bugs.
Second, DST mirroring can ate all your memory during resync, since it does not check amount of free ram in the system and tries to allocate new pages until all memory is used. This is already fixed in the private tree though.
And the last (known) problem is mirror bitmap - it uses single bit for single sector of the device, and although uses vmalloc(), it is still too much of RAM.

Back to fixing.

/devel/dst :: Link / Comments (0)


Tue, 11 Dec 2007

First pohmelfs dmesg and bits of Linux VFS internals.

[ 9941.748766] pohmelfs_alloc_inode, inode: ffff81003bc83ac8.
[ 9941.755070] pohmelfs_read_inode, inode: ffff81003bc83ae0, num: 12,
	inode is regular: 0, dir: 1, link: 0.
[ 9947.667710] pohmelfs_readdir: filp: ffff81003c5ad6a8, inode: ffff81003bc83ae0,
	dirent: ffff81003c5aff38, filldir: ffffffff8027f274.
[ 9950.283976] pohmelfs_readdir: filp: ffff81003e82faa8, inode: ffff81003bc83ae0,
	dirent: ffff81003a00ff38, filldir: ffffffff8027f274.
[10028.705354] pohmelfs_readdir: filp: ffff81003d4f1068, inode: ffff81003bc83ae0,
	dirent: ffff81003e10ff38, filldir: ffffffff8027f274.
[10095.745022] pohmelfs_lookup: dir: ffff81003bc83ae0, dentry: ffff81003b5343a0,
	nameidata: ffff81003e10fe88.
[10095.754922] pohmelfs_lookup: dir: ffff81003bc83ae0, dentry: ffff81003b5343a0,
	nameidata: ffff81003e10fdf8.

uganda:~# mount | grep pohmel
/dev/hdb1 on /mnt type pohmel (rw)

uganda:~# ls -la /mnt/
total 0
It is about 12kb of code just to register own filesystem and provide number of VFS callbacks, so that filesystem could be mounted.
It is not possible to create files or directories since directory lookup method is not implemented (it returns NULL), ls -l does not show any data since ->readdir() callback does not fill directory entries, since there are no such objects in the filesystem at all.

As you understood, this is fairly trivial implementation, which was created just as a reference point. So far it includes stubs for the following VFS methods:
  • basic address space operationsL
    • ->readpage() which reads a page, usually implemented as a generic mpage_readpage(), which uses per-filesystem get_block() callback. This is called via read path, when file's page is not in the page cache yet.
    • ->writepage() - writes a page usually via generic block_write_full_page() helper, which uses per-filesystem get_block() callback. This is called by the VFS core when there is a need to write page from the cache to disk. This happens for example when you call sync and friends.
    • ->prepare_write()/->commit_write() - they are called via write path (for example from generic_file_buffered_write()), this functions has to reserve a space on disk, update related metadata and perform other private filesystem steps for given page, which will be flushed to that on-disk area in ->writepage().
  • basic directory inode and file operations:
    • file operations include ->read() callback, which has to return -EISDIR, and ->readdir(), which has to read directory entries for given inode into provided buffer. Right now it is empty.
    • inode operations are used to create/remote/lookup and perform other tasks on directory content. Readonly filesystems only have to provide ->lookup() callback, which is used to lookup inode for given directory entry. Others have to implement lot more operations: create, lookup, link, unlink, symlink, mkdir, rmdir, mknod, rename, setattr, set of extended attributes operations and so on... Pohmelfs currently does not perform anything at all, but already provide an empty lookpup callback.
  • basic file operations (file operations itself and inode operations for regular files):
    • file operations for regular files are those provided by generic_ro_fops currenly, it includes:
      • ->llseek() - generic_file_llseek() - seek inside file mapping, it just updates files current position and performs some checks, so it does not include anything filesystem specific.
      • ->read() - do_sync_read() - a helper used by read syscall, it will eventually call ->aio_read(), which is generic_file_aio_read() for this file operations, it will call ->readpage() for pages, which are not yet in the page cache
      • ->aio_read(), described above.
      • ->mmap() - generic_file_readonly_mmap() - it will setup a mapping file operations, which include only a fault handler, which in turn will call page_cache_read(), which ends up with ->readpage() calls. Of course mapping is a bit more complex tasks, but from the filesystem point of view that all what we have to know.
      • ->splice_read() - generic_file_splice_read() - this callback is used for splice system calls, which ends up calling the same ->readpage() callback for the set of pages, which are put into spliced buffer of pages.
    • inode operations for regular files is not needed, if it is readonly filesystem (although it can provide some useful callbacks like getting extending and usual attributes), for usual filesystem at least ->truncate() and ->getattr() callbacks are required.

/devel/fs :: Link / Comments (0)


Mon, 10 Dec 2007

I have started laid off process.

Most of the projects have been moved to collegues, talks with management completed.
Just waiting for tiny bits and that's all...

/devel/other :: Link / Comments (2)


PohmelFS.

linux-2.6.fs$ mkdir fs/pohmelfs
linux-2.6.fs$ date
Mon Dec 10 19:38:53 MSK 2007
Stay tuned...

This is a working name of the filesystem, I will think about release name later.
First I will implement a simple base, which will just register itself with the Linux VFS code, so that I will put here some specs about what Linux VFS requires from the filesystem. In parallel it will be used as a base for either network filesystem and/or distributed/local filesystem.

/devel/fs :: Link / Comments (2)


New distributed storage release: Gamardjoba, genacvale!

Short changelog:

  • wakeup state when mirror detected error to seedup reconnect
  • if connecting in csum mode to no-csum server, do not enable csums
  • do not clean queue until all users are removed
  • allow to increase size of the storage in linear add callback (with this change it is possible to add nodes into linear array in real time without stopping storage. Filesystem has to be prepared for the case when underlying device has changed its size. Real-time addon of mirror nodes is also supported)
  • allow to delete gendisk only after device was started
  • dst debug config option
  • Name: Gamardjoba, genacvale! ('Hi friend' in georgian)
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging!

As usual, one can get new release from the project homepage.

/devel/dst :: Link / Comments (0)


Sat, 08 Dec 2007

Pancho Villa.

I spent excellent time in this mexican restourant with friends. We celebrated Perec's birthday: tequilla did flow plentifully, buritos were hot, and hours dissapeared silently.
Excellent time!

/life :: Link / Comments (0)


Fri, 07 Dec 2007

Climbing evening.

It was quite short and not very hard training - I was a bit later than usually, and most of the time I tried quite old but very complex start on the horizontal negative slope. Meantime I talked with instructor and found that start in question does not contain one hold, which was there originally, so that should explain why I fell. I will continue that red trace next time, I even want to put a huge paper around another hold, located where old one was: 'I'm a red hold, I'm just feigning'.

/life :: Link / Comments (0)


Strong checksumms in DST rocks.

Great thanks to person, who suggested me to implement them and Zach Brown, who showed, that Castagnoli crc is a better one than Adler.

I've debugged a setup where system failed to mount XFS filesystem on top of distributed storage, and after turned on strong checksums, system detected they were wrong, so some corruption happend during filesystem setup.
Turning off TSO, RX and TX offload of e1000 nics on machines, which form the storage, fixed the problem.

Strong checksumms rocks!

/devel/dst :: Link / Comments (3)


Distributed storage and long distances.

I've just completed some tests over the distributed system, created on top of usual internet links between machines, located in Moscow, Russia and London, UK.
Remote target was setup, then XFS filesystem created, mounted and some tests ran.
One of the machines (main storage server) is located behind at least one NAT firewall.

/devel/dst :: Link / Comments (4)


The return of syslets.

Zach Brown announced new syslet patchset aimed to simplify and stbilize basic async operations. Syslets is a mechanims of performing syscalls asynchronously - new thread is started when syscall is about to block, execution blocks and old thread is scheduled away to the new one, on behalf of which userspace continues its execution.
Version 7 of the patchset was built on top of indirect syscall, threadlets, userspace function execution and async io was removed from the patchset for simplicity, number of comments and code clarifications were added.
Main goal of the syslets right now is to make fundamental things working right.

Asynchronous IO operations has too long history already - it was implemented as a state machine in KAIO and kevent AIO, kernel supports AIO for directIO operations (userspace requires libaio).
Syslet approach was shown to be in some cases much slower than libaio (which is actually a sync operations for usual files), but it was resolved as unfairness of CFS scheduler, and (iirc) it was fixed/extended.

My main objection against this is the fact, that when you have thousands of actively running applications, system starts sucking badly, but if it is possible to reduce maximum amount of working thread per user to some resonable limit, things will be just fine. Syslets (and its more friendly threadlets user) were supported by Linus and Ingo Molnar, so very likely it will be the default way to do asynchronous IO and other operations.
Right now Zach highlighted following problems:

  • ring buffer of syslet statuses limitations
  • ptrace() problems
  • stale data (when thread issuing a syslet calls for example setuid(), in which case another thread, which actually executes blocked syscall, contains wrong data)
  • problems with sys_clone() and syslets, sys_clone() is actually a mechanism to create a new thread in syslets, so we get a recursion
All above problems are technically not-impossible for resolution, and I think it is not that bad to introduce some simple limitations for users, so that majority of async IO qustions are resolved with this mechanism.

/devel/other :: Link / Comments (0)


B(something)-tree vs RB-tree. On-disk allocations.

In the previous article it was shown, how btree and rbtree behave with allocations are being done in memory. In such conditions btree should suck compared to rbtree, and generally it is true, although in some conditions its insert speed can be even slightly higher htan rbtree.

Now, let's check how they behave when all allocations are performed from disk.
Below graph shows insert speed for both rbtree and btree in such conditions, each node was allocated with 1024+sizeof(node) offset from previous one so that readahead and thus cached disk apges would not influence the results.
Totally 1 million keys were inserted into the tree.
Search speed is roughly the same as with in-memory tests, since most of the tree sat in the ram after insertion.

B(somthing)-tree vs RB-tree. On-disk allocations

High jump around 220 keys is likely a place, where node size becomes bing enough, and amount of them is small enough, so that total tree started to fit the page cache. In some cases there is no such a peak and graph slowly moves to around 40k insertions per second, which likely happens when some background task is actively using page cache flushing away test file's pages from the memory.

/devel/fs :: Link / Comments (0)


The most discouragement-resistant hacker out there.

That is how Jonathan Corbet calls me :)

/devel/other :: Link / Comments (0)


Thu, 06 Dec 2007

Multithreaded filesystem access.

Trees are generally (if not always) very bad in parallel access, since there is no a good strategy what to lock and tree modifications usually requires more than one node changes and in some cases (like b-tree or AVL tree) can lead to changes at every layer.
Thus it is much simpler to lock the whole tree during any changes, but since not every node in the tree is in the main memory and thus has to be fetched from the disk, this can lead to long delays per operation.
Contrary Linux VFS operates with pages, where each page is locked individually.
Similar changes for hash tables (i.e. one lock per hash bucket) actually leads to lower performance since when the whole table is locked by single lock because of bad cache line, containing per-bucket lock, bounces, but this, again, is only applicable to main memory, since usually access to single bucket in the hash table is quite cheap even if it contains several entries.
So, I do not know perfect locking scheme for trees, when they are allocated on the disk, so I will find that knowledge in experiments.

The best solution, which is the most related to the real life, is trivial filesystem of course.
Initially this will be a simple and very small kernel module with basic filesystem in it, so that it could be trivially changed to support on-disk filesystem and network filesystem.

I wanted to put my dirty hands into it quite for a while already, so it is time to start...
Stay tuned!

/devel/fs :: Link / Comments (2)


A simple way to crash machine using XFS and DST.

Let's suppose you want to create an XFS on top of DST array. If you mistakenly will run mkfs.xfs /dev/sda1 (let's suppose you want to create DST storage on top of /dev/sda1 device) and then start DST on top of /dev/sda1:

./dst -n storage -A alg_mirror -d /dev/sda1 -R -s0 -S0
this will overwrite the last sector of the /dev/sda1, where XFS stores its metadata. Mounting XFS after that will lead to almost 100% crash of the machine on 2.6.22 kernels because of some bugs in XFS, which appear when XFS reads corrupted metadata from the last sector.

To work with DST you have to operate with /dev/dst-$storage-$num devices (i.e. run mkfs.xfs /dev/dst-$storage-$num), and not with underlying ones.

/devel/dst :: Link / Comments (0)


Wed, 05 Dec 2007

BTRFS 0.9 release.

Chris Mason announced new release of his btrfs filesystem.
It includes:

  • bigger filesystem block sizes
  • extended attributes (no ACL yet)
  • extent alignment parameter
  • inlining of the file data into btree
  • number of performance and stability improvements
Chris also showed a rough timeline for the filessytem development.

As he pointed, btrfs is still very bad in database loads and does not support multithreaded operations.

As you probably got, implemented inlining of the file data into btree is virtually scaling inodes algorithm, although a bit simpler.
I do like btrfs, and wish a great success to this filesystem. But onlu until I start my own :)
Kidding of course.

/devel/fs :: Link / Comments (0)


Storage hotplugging in DST.

For the interested reader: yes, it is possible to add disks into DST storage on the fly, but be sure that your filesystem supports that (in case of linear setup), mirroring is fairly transparent.
Command to add another node into mirror setup is pretty simple:

./dst -n storage -A alg_mirror -S0 -s0 -a kano -p 1026
Just like adding usual node into the storage before it was started.

Please note, that when adding node which is smaller than current device size, device size will be reduced and this can damage your filesystem!
The same applies to linear setup.

/devel/dst :: Link / Comments (0)


Tue, 04 Dec 2007

DST FAQ.

The most frequently asked question about DST is:

Can you give us a summary of how this differs from using device mapper with NBD or iSCSI?
Answer is quite simple:
From the higher point of view it does not, but it operates quite differently: it has async processing of the requests, thus not blocking, it has different protocol with smaller overhead, supports strong checksums, has in-kernel export server, which supports simple security attributes (i.e. allow to connect, to read or write). It uses smaller amount of memory (zero additional allocations in the common path for linear mapping, not including network allocations, it uses smaller amount of additional allocations for mirroring case). DST supports failure recovery in case of dropped connection (core will reconnect to the remote node when it is ready), thus it is possible to turn off and on remote nodes without special administration steps. DST has simple autoconfiguration at the startup time (support checksums and storage size autonegotiation). It is possible to turn one of the mirror nodes off and use it as a offline backup, since dst mirror node stores data at the end of the storage, so it can be mounted locally.

/devel/dst :: Link / Comments (0)


New distributed storage subsystem release.

This is a maintenance release and includes bug fixes and simple feature extensions only.

Short changelog:

  • fixed bug with XFS metadata update (it can provide slab pages to the DST, so it is not allowed to transfer them using ->sendpage())
  • fixed async error completion path
  • extended netlink communication channel to report errors back to userspace
  • DST name is now "The 10'th dynasty of smuggled slothes"
  • number of fixes for userspace DST target
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging and fixes for userspace DST target and preliminary netlink extension patches.

As usual you can download this release from the homepage.

If you want to try distributed storage this release is a really good candidate to start with.

Enjoy!

Update: This release includes bug fixes for all bugs described here, including uninterruptible sync read operations.

/devel/dst :: Link / Comments (2)


The 22'th century netchannels release.

This is the 22'th release of the netchannels, a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing.

Short changelog:

  • update cached route in the netchannel when it expires
Thanks to Salvatore Del Popolo (delpopolo_dit.unitn.it) for testing.

You can get the latest sources from netchannels homepage.

Userspace network stack is available from own homepage.

/devel/networking :: Link / Comments (0)


Mon, 03 Dec 2007

Climbing evening.

That was a great training, although I completed not that many traces, but they all were good. Several old ones from really simple to quite interesting at the beginning, then number of traverses and boulderings. When Grange arrived it was already quite late, so we made couple of simple traces without the rest in between, and then the greatest trace started: 7a+ (although I was allowed to modify it a bit, so I reduced its categry down to 6c/6c+) over the black holds in the left center sector. I did not finish it (it is not on-sight already quite for a while), since was too tired, but made a key point several times, which flushed me down to the bottom.
I tired as hell and that was a great feeling!

/life :: Link / Comments (0)


Sun, 02 Dec 2007

Distributed filesystem roadmap.

  • Distributed storage. This step is mostly completed, although some bugs are there and there is number of features to be implemented, work is being done on them and it is no the finish line. Feature list include:
    • sync/barrier support
    • error report to usersapce via netlink (patch was made by Matthew Hodgson (matthew_mxtelecom.com)
    • some thoughts about sync operations which can stuck in uninterruptible state if there are some problems with remote noes (Hi NFS), I will create a fix for this issue for DST at least.

    There is a nasty bug in DST currently, which I can not reproduce locally and debug it with Matthew on his setup.
    There is also fair number of fixes for userspace DST target made by him. Great thanks!
  • Local filesystem with very scalable and fast on-disk format, possibilities to have on-line backups, snapshots, no fsck, scalable locking (multithreading reading and writing).
    This had originally a very simialr to btrfs design, but I want to move further and have ability to perform multithreaded and then mutli-machine access to the same files. Call me a looser or wheel reinventer (I would not be where I am if I cared about it), but I want to have a project where I know every single bit to be able to fix things quickly and break something if it is needed for better implementation.
  • linking both network and fs layers together, this will include distributed byte-range locking and cache coherency for client nodes. Bits of this step I described in short discussion with Zach Brown.
This does not mean steps will be completed in the above order, I'm working in parallel in different directions and some parts can appear earlier, so that I would be able to evalute its problems.
Bug fixes has obviously the highest priority.

/devel/fs :: Link / Comments (0)


Meanwhile at appartment development side.

I reached a big milestone yesterday - I completed wallpaper glueing in the hall, room and checkroom. Although it still requires some fixes and bits of work with a knife, it is a huge step forward. I also wanted to paint the walls in the room yesterday, but fell in slack. Today is supposed to be another heavy working day - I will move to the development shop (on the opposite side of Moscow) to get neon cord, water system hatch and ceiling for the bathroom. If I will return not that late, I Will start setting them up, otherwise will paint the room.
I'm curious, when I will have a real vacations and some rest, but I think I found an answer - tomorrow I will start discharge process at work and expect it to be completed in a week or two at most, so I will have about two weeks before the new year celebrations. Most of them will be devoted to the appartment developemnt though.
Well, we will definitely have some rest in an eternity...

/devel/flat :: Link / Comments (0)