Zbr's days.
December
Sun Mon Tue Wed Thu Fri Sat
           
         
2007
Months
Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Mon, 31 Dec 2007

Mostly completed appartment development.

I finished painting, glueing, covering, lights and called friends. There is number of tasks to be completed, but mostly very small, so they can be postponed for a while. I feel myself really excited about my loft, it looks very interesting for me and I do not have anything I would like to change or fix.
Groovy!

/devel/flat :: Link / Comments (0)


Sun, 30 Dec 2007

Meanwhile at appartment development side.

A lot of changes. Huge step forward was made today (and yesterday night). Right now I completed my table (although it has only isngle layer of varmish and not finished rim), mostly finished kitchen (there was not enough wallpapers, but I painted ceiling, glued all wallpapers, which I had, and will setup floor cover tomorrow), finished paintings in the room (I have a blue wall now) and hall (no uchuu yet). So, the only really needed thing is to remove huge amount of dust and garbage from the appartments and then setup floor cover.
I think I'm ready for New Year celebration and amount of work I made will absolutely end up with a good celebration.

/devel/flat :: Link / Comments (0)


Sat, 29 Dec 2007

POHMELFS abbreviation.

POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.

And it has a metadata cache on client now. It contains just pohmelfs inodes, which are indexed by three tuples: first contains of name hash, parent inode number and length of the string (this guarantees, that there will be no identical tuples), second tuple is inodes number and the last one is offset in the address space of the parent.
Cache update operation is independant from its usage, althoguh both are guarded by the same lock.

/devel/fs :: Link / Comments (0)


Fri, 28 Dec 2007

Table development.

It happend that I changed my table design once more, now it has single right angle. Since my wood door contains holes between wood plates, I decided not to remove orgalite (paper filled with glue) plates from top and bottom of the wood plates, but it does not soack up a mordant, so I wll get a coloured varnish and cover table plate again. It is quite hard to work with wood, especially with straigh places without plane using only knife and electric jigsaw, so I will buy one too. I also need set of chisels.
Given that, table is still in very early development stage, but one can check a preliminary photo (made by phone, so quality is not very good) here.

Since table is postponed I will paint walls in the hall and try to clean my loft a bit, I would also like to setup a boiler, start the last arc (instead of the kitchen door), but it is too loud process, so likely will postpone it for tomorrow too.
It will be busy day, and quite short actually - Mephody and Irin arrive, which is a start of the NY celebration process!

/devel/flat :: Link / Comments (0)


Thu, 27 Dec 2007

Meanwhile at appartment devlopment side: blue wall and table.

Yes, I've made it - now I have a blue wall (colour is called 'royal marine', and although it does not look like a sea (I was not near the sea so many years already that do not even recall when I was last time, maybe it was changed, and I did not see an ocean at all, so I will change that too), it looks great. But to made my feelings worse devil told me to get not enough colour, so I actually have only 3 quarters of the wall painted. Will fix it tomorrow or in a day when will move to development shop to get LED cord for the ceiling. So, painting in the room and hall will be finished very soon.
The same roughly applies to my table development. Not that I made too big progress, but I cleaned my old enter wood door (which is a base for the table), put to the floor and painted table contour on it. It looks very expressive and completely different from above pictures: there are no straight lines (although its base is letter 'L'), it will have only single leg (I bought it today) at the end of the longer side, its opposite side will be attached to the walls and will be in a level with window-sill. Maybe later I will replace single leg with leg from the floor to ceiling - that side of the table is essentially round, so it will be convenient to put there some round (glass) shelves. I tried to saw the table using my electric jigsaw, but it is quite loud process and it is about 24-00 in Moscow already, so I postponed it for tomorrow or later. It has to be completed this year, since I need a table to put a lot of things on it (for example fir made of lots of empty beer bottles and jars I have here).
Since I have no chairs, I will made couple of long benches if will have enough of materials (there are two doors here - bigger one (about 200x90 sm) is used as a base for the table, smaller one (200x60 sm) - for smaller part of letter 'L', the rest will be used for benches). I could get wood plates in development shop (it has a lot of interesting types there), but I have some troubles getting it home without a car and do not want to wait for delivery (which will took about a week).
I also got a water hatch for my bathroom today, but I will not set it up, since I have no glue for ceramic tiles, so better to devote this time to other interesting tasks.

Expect some photos of my loft closer to NY time...

/devel/flat :: Link / Comments (0)


Wed, 26 Dec 2007

New release of the distributed storage: Groundhogs strike back: no New Year for humans!

Short changelog:

  • mirroring algorithm improvements
  • debug cleanups
  • extended mirroring initialization
  • documentation update
  • name is 'Groundhogs strike back: no New Year for humans' now
As usual, one can get patch or pull changes from the project homepage.

/devel/dst :: Link / Comments (2)


CDMA (EVDO) vs GPRS.

Good people gave me SkyLink CDMA modem (model USB CNU-550 with EVDO support) to test internet conection in Linux and compare it against GPRS.

Well, here are my conclusions:

  • CDMA works always while GPRS really sucks in the middle of the day: well, it has to be proven tomorrow, but I tested CDMA modem at about 19-00 and it worked ok, while MTS (Mobile TeleSystems in Russia) at about 12-00 worked very bad (I connected quickly, but ssh login took enormously long time).
  • CDMA speed is usually about 10-16 kb/s, while GPRS is usually can not be higher than 1-2 kb/sec. More on this: I believe CDMA session can have higher speeds (pppd session requests about 900 kbit/sec), behaviour of initial login (I saw quite a few of them on different initial speeds during usual work and userspace network stack testing) shows there is some limitation on server or hardware side (i.e. SkyLink either because of special tariff scale or driver limitations, I was told there is no 16 kb/sec limitation with given card though) side (note on testing different congestion control algorithms and develop own if needed: speed downgrades to less than 10kb/sec frequently during download of the big file. SkyLink is a very interesting source of data for such development: it looks like its RTT is quite high for default (new CUBIC) congestion control, at least low-traffic but very-small-latency-wanted source (mutt over ssh on remote host without serious traffic shaping) works quite bad in this setup. Very likely it is just an empty speculation and problem is in hardware on the server (i.e. it has support for bulk streaming access at high speeds (16 kb/sec), but fails to work with low-latency applications, which work in small packets ping-pong environment)).
  • CDMA USB CNU-550 modem works ok in Linux (modulo above issues) with this peers/pap-secrets files without any problems.
Anyway, CDMA SkyLink is much faster and more smooth than MTS GPRS, so decision about what is better is quite obvious...

/other :: Link / Comments (3)


Tue, 25 Dec 2007

Continuing CRFS debates.

Zach Brown again shed some light on his CRFS desing and implementation. Let's compare facts with my thoughts.

The most exciting news is that CRFS caches not only data but metadata too on the client, which is flushed to server on writeback. That is what allows to have 4-6 times higher performance in metadata intensive operations.

Another news is actually quite bad for majority of the potential CRFS users - userspace server is btrfs specific, which can be another gain in the benchmarks (although should be noticebly smaller than metadata caching part). Server does not require any additional patches, but since it is btrfs specific, it likely works on top of ramdisk (when test was perfromed with RAM storage), not tmpfs. Userspace server has exclusive access to given block device, so it is not allowed to simultaneously mount it via usual way (probably it is possible to mount it read-only whlist it is used by CRFS).
Client kernel module only depends on ->write_begin()/->write_end() patchset by Nick Piggin, which was added to mainline recently.

Batching of network requests happens naturally in request/reply protocol, but reply contains not only single request, but set of them, since client caches metadata, it can check if data is in the cache or not and update it if needed.

Getting that knowledge, let's summarize given bits:

  • CRFS is btrfs specific, while pohmelfs is supposed to be fs-agnostic. This CRFS feature allows to have faster (probably even noticebly faster) access to on-disk data. Do not think it is a bad sign, consider it as a client-server filesystem, no one claims AFS is bad, since there is only AFS specific kernel server. Here is the same, but server is in userspace.
    From another point of view, not allowing to work with the same btrfs volume locally can be a show-stopper for some users.
  • Metadata caching. That rocks. It has to be implemented.
  • Extended request/reply protocol: i.e. do not reply with only single data (if it was not explicitly requested), but try to combine objects. The most obvious example is ->readdir() callback, when each request from client should transfer multiple objects, which will then be cached.
I think I was corect in most if not all prognosis about CRFS, probably I should try weather next time...

Given that, I have a clean expectations of what pohmelfs should have and which results we should expect.
CRFS project is a serious step forward in this area, so it is very exciting to work with its ideas and move further.
Stay tuned!

/devel/fs :: Link / Comments (2)


Mon, 24 Dec 2007

Climbing evening.

That was hard. That was really bloody hard, but great training. I climbd high over number of new traces - first for warming I tried something new without label (new yellow trace in the left verticall sector), it happend to be quite complex trace, so I completed it with couple of falls since did not know exact holds of the trace. Next several traces were the same part of the complex trace started on the horizontal negative slope, but I skipped that part, since wanted to know how it behaves higher, I already knew that its start is very complex and fully corresponds to its category (red 7a trace in the middle sector).
There were also couple of simpler traces I made for warming and at the very end to completely flush the power and fasten blood.
It was excellent time!

/life :: Link / Comments (0)


First CRFS (cache coherent remote file system) results.

Zach Brown posted first public results of his CRFS filesystem.
He compared NFS and CRFS when remote storage is on disk (likely btrfs) and in ram (tmpfs) for two operations: big number of file/dir creations (a lot of metaoperations) with small write (untarring kernel archive) and reading all that data into RAM.
In both tests CRFS is noticebly faster: metadata operation test (untarring kernel archive) is 4 times faster for disk storage and about 6 times faster for ram, CRFS reading is about 1.8 times faster than NFS.

Very impressive results, although without knowledge of the CRFS internals it is quite hard to tell, where and how such gain was created, so I will handwave here :)
When CRFS will be opened (if wit will), we will check my thoughts..

First, since there was a tmpfs test, then userspace server does not use anything btrfs specific (like open by inode), although there is a possibility, that btrfs exports some ioctls or kernel was patched, right now I will not consider this as a fact. So, first, userspace server can work on top of any filesystem.

Second, reading is only 2 times faster, while metadata operations is 4-6 times faster. Zach says it is limited by disk speed, so this means metadata was heavily cached. There is a question, though, does server see the last metadata change or it will be sent to server only when another client will access cached data (so caches will become coherent), getting into account, that NFS always sends metadata changes, it looks like CRFS does not. If it is correct, than there is a question, does it need to send metadata updates at all until sync or flush started.

Third, userspace server is fast. With logic I described for pohmelfs server, I think it will not be able to compete, so there is a place for thoughts.

Fourth, network protocol in CRFS batches requests. This can be done either because of special transactional layer between VFS callbacks and network or because of the way VFS callbacks work, for example data is not sent in ->commit_write() callback, but only in ->writepage() and ony if there is a strong demand on that. The same applies to metadata operations - how are they batched and network communication reduced to get 4-6 times performance increase? The most simple case is never send them at creation time at all, but only when writeback for files started (or cache-coherence algorithm requires), so when for example directory is created only notification about dirty parent dir is sent, and when new file is created in this new dir, content of the directory is transferred.

Anyway, from features above pohmelfs currently does not have anything, it is actually read-only, but I already see where it can be improved - for example directory listing (->readdir() callback) is invoked for each access (i.e. each ls /mnt forces directory content resending), since pohmelfs does not cache it.

There is fair number of changes I want to implement to catch with CRFS (I think so :), so stay tuned, I will implement basic functionality first and will run the same tests too...

Making bets? I vote for slower than NFS speeds, because of bad userspace support and no caching of the metadata.
But pohmelfs is developed only 3 days, it is quite young... So, stay tuned.

/devel/fs :: Link / Comments (0)


Sun, 23 Dec 2007

Continuing appartments development.

I think I finished ceramic tiles glueing, at least for this year: first, I have no glue anymore, second, I have to glue only one vertical line with 2.5 tiles width, where vater hatch will be located, and since I do not have a hatch, I do not glue tiles.
Dirty work in bathroom has been essentially completed - I will fill 2mm holes between tiles with plaster and attach ceiling soon, and that will be the end.
Next dirty work is ceramic granite in the hall and checkroom - that will take some time, but since I have no glue and not sure it will be delivered this year, I can postpone this task too. So, main issues are painting finishing and table.
When my head is aching I frequently think out something really interesting and new, so I have a new design for the table in the mind, if I will complete it, that will be really great.
First, table is not movable, but attached to wall (potentially two walls in the corner), end of the table, which does not touch the wall is round and have singe (better steel tube) leg, another part, which touches the wall has a turn, so table looks like letter 'L' with smaller part attached to wall (and window), bigger part can be accessible from both sides.
Or something like that...

/devel/flat :: Link / Comments (0)


GPRS sucks!

Even MTS one is so slow... Although it is enough to read emails and check some news (you know, I have infinite patience which almost never reaches its end), but I want a normal connection.
I know, there is no Ded Moroz (Santa Claus), so there will be no fast internet until past New Year vacations (10 days in Russia), when I will start kicking local ISPs again.

/other :: Link / Comments (0)


Sat, 22 Dec 2007

Meanwhile at appartment development side.

I painted most of the room, and then decided to make one wall either ultramarine or just marine blue. Just because I want, so waiting for the colour, otherwise root would be completed.
Also glued some bits of ceramic tiles in the bathroom - it almost ready too. Today I spent most of the time cutting tiles using corner-grinding machine to make different forms for corners, hatches, door and so on. Became dirty as hell, but completed all but corners and ceiling. Hopefully will finish them tomorrow (or likely not :).

/devel/flat :: Link / Comments (0)


Fri, 21 Dec 2007

Anatomy of the filesystem ->readpage() callback.

This callback is used to read page from the storage to RAM. It has following prototype:

static int pohmelfs_readpage(struct file *file, struct page *page)
Where file is an object associated with opened in userspace file, and page is a page where filesystem has to put data.
On-disk filesystems usually use VFS helpers (like mpage_readpage() or block_read_full_page()), which maps page into set of buffer_head objects, which are then submitted to block layer, where next level of reading from the disk happens. This mapping is implemented via per-filesystem get_block() callback.

Pohmelfs does not follow this standard, since it does not know, which filesystem is on the remote side, and since there is no block device under it. So it just uses request/reply protocol to get given page from the remote host. Page structure already contains its offset from the begining of the file (from the beginning of the address space actually), and it is locked, so simultaneous access is not possible, so we only need to fetch data and mark page (if copy was successful) is uptodate.
Simple.

Here is the result:
server $ md5sum /tmp/ltp-full-20071130.tgz
77bf4032c10c03e858512a5a90c05015  /tmp/ltp-full-20071130.tgz

client # md5sum /mnt/tmp/ltp-full-20071130.tgz
77bf4032c10c03e858512a5a90c05015  /mnt/tmp/ltp-full-20071130.tgz

/devel/fs :: Link / Comments (0)


Anathomy of the filesystem. ->lookup() and ->read_inode() callbacks. First pohmelfs results.

I talked about ->readdir() callback previously, now its time to get other two the most significant callbacks in the VFS lyer.
I call them the most significant (three), since without them it is impossible to mount and get data from filesystem, they have to be implemented for any FS.

Ok, let's first look at ->lookup().
It has following prototype:

struct dentry *pohmelfs_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
As name suggests, this callback is used to lookup inode for given directory entry.
One can check struct dentry, it contains qstr field, which in turn has char array containg name, it also has its length and hashed value (used in dentry cache).
When inode number is found for given directory entry, inode has to be allocated and filled by metainformation. It then should be added into dentry:
err = -ENOMEM;
inode = iget(dir->i_sb, cmd->ino);
if (!inode)
	goto err_out_free;

kfree(data);

d_add(dentry, inode);
That's all for this callback. Pohmelfs uses simple request/reply protocol to get inode for given name, userspace server is rather dumb and contains linked list (it will be changed to tree) of all object names in given directory, so it looks parent directory up, and then finds given name in the dir, then it sends data to client. This operation can be potentially fast (only two tree lookups - one to get parent dir in the main tree and one to find object in the dir).
Pohmelfs client in future can cache received information, so that subsequent access to the same dir would not require rather slow network operations. Right now it does not.

Second callback is ->read_inode(). As name suggests, this has to read inode's metainformation from disk to RAM. It has following prototype:
static void pohmelfs_read_inode(struct inode *inode)
quite simple. Folowing members have to be filled in this callback:
  • i_mode - file mode (file/dir/somthing, access rights)
  • i_nlink - number of links to this inode
  • i_uid/i_gid - uid/gid of the owner
  • i_blocks - number of blocks allocated for this object on disk
  • i_rdev - if object is not regular file, this will hold device numbers
  • i_size - size of the object
  • i_version - used by some filesystems to show that given inode is dead (or not uptodate)
  • i_blkbits - 1 shifted left by this number results in filesstem block size
  • i_mtime/i_atime/i_ctime - modify/access/create time for given inode
  • i_fop - file operations for given inode, this operations include read/write/readdir/aio_read and so on
  • i_op - inode operations, this includes lookup
  • a_op - address space operations, this include readpage/writepage/sync_page/prepare_wrte/commit_write operations
Pohmelfs uses simple request/reply protocol to get this information from the remote server (except various operations).

Having that, one can create simple
$ wc -l fs/pohmelfs/*.[ch]
   120 fs/pohmelfs/config.c
   218 fs/pohmelfs/dir.c
   417 fs/pohmelfs/inode.c
    96 fs/pohmelfs/net.c
   169 fs/pohmelfs/netfs.h
  1020 total
network filesystem, which allows to read data from the remote server
$ wc -l ./fserver/*.[ch]
   267 cfg.c
   750 fserver.c
   581 list.h
   390 rbtree.c
   164 rbtree.h
  2152 total
Note, that rbtree.[ch] and list.h I just got from kernel sources.

Here is an example on client machine:
# ./cfg -a 192.168.4.81 -p 10250 -i 0
# mount -t pohmel /dev/hdb1 /mnt

# ls -l /mnt/
total 88
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 bin
drwxr-xr-x   4 root root  3072 2007-12-21 15:01 boot
drwxr-xr-x  11 root root  3780 2007-12-21 15:01 dev
drwxr-xr-x 105 root root 12288 2007-12-21 15:01 etc
drwxr-xr-x   6 root root  4096 2007-12-21 15:01 home
drwxr-xr-x  14 root root  4096 2007-12-21 15:01 lib
drwx------   2 root root 16384 2007-12-21 15:01 lost+found
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 media
drwxr-xr-x   2 root root     0 2007-12-21 15:01 misc
drwxr-xr-x   4 root root    28 2007-12-21 15:01 mnt
drwxr-xr-x   2 root root     0 2007-12-21 15:01 net
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 opt
dr-xr-xr-x 197 root root     0 2007-12-21 15:01 proc
drwxr-x---   9 root root  4096 2007-12-21 15:01 root
drwxr-xr-x   2 root root  4096 2007-12-21 15:01 sbin
drwxr-xr-x   5 root root     0 2007-12-21 15:01 selinux
drwxr-xr-x   3 root root  4096 2007-12-21 15:01 srv
drwxr-xr-x   5 root root  4096 2007-12-21 15:01 storage1
drwxr-xr-x   7 root root  4096 2007-12-21 15:01 storage2
drwxr-xr-x  12 root root     0 2007-12-21 15:01 sys
drwxrwxrwt  20 root root  4096 2007-12-21 15:01 tmp
drwxr-xr-x  13 root root  4096 2007-12-21 15:01 usr
drwxr-xr-x  23 root root  4096 2007-12-21 15:01 var

# mount | grep mnt
/dev/hdb1 on /mnt type pohmel (rw)
Believe me or not, that is exactly content of the '/' on the my desktop, which is used as a server.

Next step is readpage/writepage/prepare_write/commit_write callbacks, which will allow to read and write files.
Stay tuned.

/devel/fs :: Link / Comments (0)


Thu, 20 Dec 2007

open-by-inode() vs. name lookup in network filesystems.

Network filesystem is a tricky bustard - depending on where it is implemented (kernel or userspace) it is very different. By 'very' I mean really complex differences.

In kernel inode, or basic object's identity, always exists for all objects checked before (until special steps completed, when inode is dropped, but usually it stays alive - for example when you traverse some dir, inodes for every object you checked continue to exist, even if you already do not use that directory. When file is opened, inode will be attached to file, when file will be closed, inode will live. This is a fundamental feature of the split of directory entries and inodes - directory entries are linked into the tree, which we can see, but inodes are shadowed objects behind that entries.

In userspace things are completely different: there are no indes, but only files, identified by file descriptors. That's all. So, when kernel performs a lookup, it checks some name in the inode with given number - i.e. it perfoms in-kernel reference-by-inode operation, but in userspace there is no API (except rare special cases, which I think Zach uses in CRFS, and that is likely good speedup for Btrfs) to get file handler by inode number. Basically userspace should have either opened file descriptor for parent directory, or perform a reverse lookup, create a path and open directory to check if some object exists there, since userspace can only work with file descriptors.
open-by-inode was marked by Linus Torvalds as fundamentally broken because of number of reasons (namely because of races with directory layout changes like move and rename), and likely it is correct, but absence of such API greatly reduces performance of userspace metadata operations.

Having network fileserver in kernel is of course much (MUCH) simpler and faster, but so far its implementation will be postponed a bit.
Initial server will be quite dumb - it will always perform a lookup from the root and always close directory, later it will be possible to add cache of opened directories...

/devel/fs :: Link / Comments (8)


Wed, 19 Dec 2007

Anatomy of the filesystem. ->readdir() callback.

Here I will write simple notes about how some callbacks are used in linux VFS and what filesystem write should implement to be correctly understood by VFS layer.

Let's start from essentially the first callback invoked by FS after fs has been mounted. As name suggests, ->readdir() is used to read directory content. Its prototype looks like this:

static int pohmelfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
where filp is a file structure which is connected to the root inode (which you have to initialize in ->fill_super() callback to be able to mount fs). Dirent is a magic structure, which hosts all directory content you will read, and filldir is a function, which transforms directory names into dirent structure.
Its prototype looks like this:
int filldir(void * __buf, const char * name, int namlen, loff_t offset,
	u64 ino, unsigned int d_type)
and is invoked this way:
size = 1;
if (filldir(dirent, ".", size, filp->f_pos, inode->i_ino, DT_DIR) < 0)
	return -1;
filp->f_pos += size;
I think every step is very straightforward, except last two entries: the former is inode number, which is unique id of the structure, every filesystem has to store it on disk for every inode, obviously in Unix '.' refers to curent dir, so its inode number should be taken from the current dir inode. For '..' directory, which is a parent for given one, filldir() is executed by the following way:
size = 2;
if (filldir(dirent, "..", size, filp->f_pos, parent_ino(filp->f_path.dentry), DT_DIR) < 0)
	return -1;
filp->f_pos += size;
and for some other dir:
size = 8;
if (filldir(dirent, "test_dir", size, filp->f_pos, 14, DT_DIR) < 0)
	return -1;
filp->f_pos += size;
where '14' is inode number for 'test_dir' subdir.
Directory listing for this filesystem will look like this (data from live pohmelfs setup):
# ls -la /mnt/
total 9
drwxr-xr-x  1 root root 4096 1969-12-31 20:02 .
drwxr-xr-x 21 root root 1024 2007-02-08 15:04 ..
drwxr-xr-x  1 root root 4096 1969-12-31 20:02 test_dir

# mount | grep mnt
/dev/hdb1 on /mnt type pohmel (rw)
The last parameter of the filldir() is type of the directory entry, DT_DIR is for directories and it corresponds to 12-15 bits of the stat.st_mode returned from stat() call.

Note, that ->readdir() will be invoked (by ls -la at least) until filp->f_pos stops changing, so after you filled your directory entry and properly updated filp->f_pos, you have to check, that provided filp->f_pos exceeds or not size of the directory (here I mean overall size used by every copied directory entry), and if it does (or is equal), just return 0.

So, how network filesystem should behave here? Answer is pretty simple: it should just send a request to remote server to provide directory listing, copy answer to the allocated buffer and fill directory with provided data. It is possible to cache that data here, but each subsequent ->readdir() has to check on server that data is still valid and was not changed.

With this work pohmelfs becomes a network filesystem, with many interesting features I have in mind, but will open when they got implemented.
This is not intended for mainline inclusion, since Zach Brown's work was first and likely will be more stable and/or feature complete when this stuff become ready.

But nevertheless, stay tuned..

/devel/fs :: Link / Comments (0)


Tue, 18 Dec 2007

Fundamental race between block layer/IO and networking.

This header is about impossibility to work without races with netowork's ->sendpage() method, which is used mostly to transfer IO mapped pages, without either turning off offload capabilities and copying data into new buffer or using own acks in the protocol.

->sendpage() in the optimised case (hardware supports checksum offloading and scater/gather) will not copy content of the page to the new buffer, but instead will increase page's reference counter, so that page could not be freed. When ->sendpage() returns this does not guarantee, that data was sent, received by remote side or whatever, since packet can be queued (in hardware or qdisk), it can be later retransmitted, there is no way to know that data was received until ACK (lets talk about TCP) is received, but there is no API to know that ACK was received. When ACK is received, appropriate packet will be found in the TCP retransmit queue and freed, this will drop page's reference counter.
If user (and there is no other way actually) does expect that after ->sendpage()'s return data can be processed (for example rewritten), then there is non-zero probability that remote side will get this new data, instead of old, which can lead to state machine breaks and data corruption.
One can try to use sendfile() and simultaneously write data to the file - remote side can get mix of the old and new data. One can argue that using proper locking around sendfile() and write will help, but actually it will not - consider the case when we send only single page - after sendfile() returned, data still can be in the queue, so subsequent write, which already does not race with sendfile() itself, but not with data sending, will overwrite data and remote side will get new one instead of old data.

There are two fixes for thei problem: first is not to use ->sendpage() (or use it with copy of the data into new buffer, which is essentially how usual send() works), second is to use protocol specific acknoledgement system, so that any subsequent operation on given data would be postponed not until ->sendpage()/sendfile() returns, but until that ACK received.
Both greatly harm performance.

I would be really glad to find that my conclusions are incorrect.

/devel/fs :: Link / Comments (8)


Climbing evening.

That was very good although again a bit shorter training - most of it was devoted to the complex trace with the start on the horizontal negative slope, which sucked power very quickly, so that at the end (after about 3 hours) I was not able to complete even small parts of it (while doing it quite stable at the begining). Trace requires back and arms especially, so after the training I feel myself tired as hell, which is great of course!
It was very good time there today!

/life :: Link / Comments (0)


Mon, 17 Dec 2007

New release of the distributed storage: Dancing with the smoked neutrino.

Short changelog:

  • new improved mirroring algorithm.
    This algorithm uses sliding window approach for full resync and write log for partial resync.
  • fixed number of typos and debug cleanups
  • update inode size when linear algorithm changes the size of the storage in run time
  • extended number of sysfs files and documentation for them
  • fixed leak in local export node setup
  • name is 'Dancing with the smoked neutrino' now
Overall list of features of the DST can be found on project's homepage.

DST is also exported as a git tree available for clone and pull from here.

Interested reader can test DST with 2.6.23 tree too (it should compile fine, but was not tested).

/devel/dst :: Link / Comments (4)


New distributed storage mirroring algorithm.

Resync logic - sliding window algorithm.

At startup system checks age (unique cookie) of the node and if it does not match first node it resyncs all data from the first node in the mirror to others (non-sync nodes), each non-synced node has a window, which slides from the start of the node to the end. During resync all requests, which enter the window are queued, thus window has to be sufficiently small. When window is synced from the other nodes, queued requests are written and window moves forward, thus subsequent resync is started when previous window is fully completed. When window reaches end of the node, it is marked as synchronized.

If age of the node matches the first one, but log contains different number of write log entries compared to the first node (first node always stands as a clean), then partial resync is scheduled. Partial resync will also be scheduled when log entry pointed by resync index of the node contains error.

Mechanism of this resync type is following: system selects a sync node (checking each node's flags) and fetches a log entry pointed by resync index of the given node and resync data from other nodes to given one. Then it checks the rest of the write log and checks if there are another failed writes, so that next resync block would be fetched for them.

Mirroring log is used to store write request information. It is allocated on disk and in memory (sync happens each time resync work queue fires), and eats about 1% of free RAM or disk (what is less). Each write updates log, so when node goes offline, its log will be updated with error values, so that this entries could be resynced when node will be back online. When number of failed writes becomes equal to number of entries in the write log, recovery becomes impossible (since old log entries were overwritten) and full resync is scheduled.

This does not work well with the situation, when there are multiple writes to the same locations - they are considered as different writes and thus will be resynced multiple times. The right solution is to check log for each write, better if log would be not array, but tree.

/devel/dst :: Link / Comments (0)


Fri, 14 Dec 2007

Linux Test Project on top of DST storage.

# pwd
/mnt/ltp-full-20071130

# ./runltp -p -f fs -d `pwd`/tmp
...
# cat /mnt/ltp-full-20071130/results/results.2007-12-14.11.21.41.17106 
Test Start Time: Fri Dec 14 11:21:41 2007
-----------------------------------------
Testcase                       Result     Exit Value
--------                       ------     ----------
gf01                           PASS       0    
gf02                           PASS       0    
gf03                           PASS       0    
gf04                           PASS       0    
gf05                           PASS       0    
gf06                           PASS       0    
gf07                           PASS       0    

-----------------------------------------------
Total Tests: 57
Total Failures: 0
Kernel Version: 2.6.22-rc5-dst
Machine Architecture: x86_64
Hostname: uganda

# mount | grep mnt
/dev/dst-storage-32 on /mnt type xfs (rw)

# cat /sys/devices/storage/n-0-ffff*/type
R: 192.168.4.81:1025
R: 192.168.4.81:1026
All 'fs' tests completed successfully, although I saw following dump in dmesg:
[ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
[ 8398.609641] turning off the locking correctness validator.
which is XFS bug.

Since DST is quite dumb device, that tests will not find tricky places, but they are good to generate high load on top of given block device.

/devel/dst :: Link / Comments (0)


New release of the userspace network stack.

Changed data reading function, now it does not copy TCP header into user's buffer, only data, and forced packet socket reading path to limit maximum number of packets to be read, which do not match created netchannel.
As usual, new release is available from project homepage.

/devel/networking/unetstack :: Link / Comments (0)


New mirroring module in the distributed storage.

$ git-diff-index --stat HEAD drivers/block/dst/alg_mirror.c
 drivers/block/dst/alg_mirror.c |  745 ++++++++++++++++++++--------------------
 1 files changed, 364 insertions(+), 381 deletions(-)
It is cool and works good in my environment, but (like previous) it forces total mirror resync after main storage node reboot or crash (if it is required, for example when array was not in sync already and main node rebooted).
I want to extend DST mirroring algorithm not to force full resync, but store a log of the writes on each node, so when new array starts, it would check not only age of the nodes (uique id stored at the end of each node, if it does not match, total resync starts), but also write log, so that the latter does not match, only selected number of regions would be synchronized.

Stay tuned...

/devel/dst :: Link / Comments (0)


Thu, 13 Dec 2007

Why pushing project into the kernel is not a main goal?..

One have to have some courage and do not afraid to throw something out and create new things instead of old, even if it will require a lot of efforts and some problems in a short cycle.
So I've just erased mirroring algorithm from DST and will rewrite it mostly from scratch, since I have a very interesting sync algortihm inmind, which will not require clean/dirty bitmap.
Havind DST in kernel would not allow me to have such flexibility...

/devel/dst :: Link / Comments (0)


Wed, 12 Dec 2007

Climbing evening.

It was again a bit late training and thus shorter than usual, but nevertheless it was very saturated - I tried old complex start on the horizontal negative slope and several times managed to complete it fully. That's a very interesting and complex trace itself, but some time ago I tried some of its bits and completed them. I think I can finish it without falls after several trainings, but right now I'm working with the most complex I think part: with power sucking start.
Horizontal negative slope is usually a big problem for me because of my power endurance, it also requires very strong back in some movements, so right now I'm feeling that I still have some muscles in the body and they did not dissapear after sitting in the chair most of the time.
Excellent time!

/life :: Link / Comments (0)


I was a bit pessimistic about DST design bugs.

Things are only bad when resync of the mirror node is in place...
I fixed both issues, but will spent additional time debugging and testing the them, since I do not like how it was done. I think I will rewrite mirroring resync logic.

Subrata Modak of IBM suggested to use Linux Test Project, which I found to have interesting benchmarks, which while being very useful for filesystem development, still can find some bugs in DST.

/devel/dst :: Link / Comments (0)


Shame on me or how complex are design bugs...

I have to admit, that mirroring in DST is not currently well supported.
First, because of a bug I made in the early development stage: in DST there are two objects, which represent a part of the storage, first one is a node, this object contains information about type of the storage and pointers to structure, which represents low level device itself (like block device or network connection). Network connection in turn is represented as a state structure, which contains socket, state machine for transferred data and so on. Nodes are used when block io request comes from the higher layer and states are used when data is transfeerred via network. The former uses fain grained reference counters: when node is being operated on (request is processed), its reference counter is increased, if operations become asynchronous (for example sending queue is full and thus block can not be sent right now), then block request is queued into state's request list and reference counter for the node is dropped. If it reaches zero, node is being freed, which in turn calls exit callback for the state, which flushes the queue of requests.
Things seem simple and correct, but devil is in details - async processing thread can enter at any point into the game and process state too, which leads to bugs.
Second, DST mirroring can ate all your memory during resync, since it does not check amount of free ram in the system and tries to allocate new pages until all memory is used. This is already fixed in the private tree though.
And the last (known) problem is mirror bitmap - it uses single bit for single sector of the device, and although uses vmalloc(), it is still too much of RAM.

Back to fixing.

/devel/dst :: Link / Comments (0)


Tue, 11 Dec 2007

First pohmelfs dmesg and bits of Linux VFS internals.

[ 9941.748766] pohmelfs_alloc_inode, inode: ffff81003bc83ac8.
[ 9941.755070] pohmelfs_read_inode, inode: ffff81003bc83ae0, num: 12,
	inode is regular: 0, dir: 1, link: 0.
[ 9947.667710] pohmelfs_readdir: filp: ffff81003c5ad6a8, inode: ffff81003bc83ae0,
	dirent: ffff81003c5aff38, filldir: ffffffff8027f274.
[ 9950.283976] pohmelfs_readdir: filp: ffff81003e82faa8, inode: ffff81003bc83ae0,
	dirent: ffff81003a00ff38, filldir: ffffffff8027f274.
[10028.705354] pohmelfs_readdir: filp: ffff81003d4f1068, inode: ffff81003bc83ae0,
	dirent: ffff81003e10ff38, filldir: ffffffff8027f274.
[10095.745022] pohmelfs_lookup: dir: ffff81003bc83ae0, dentry: ffff81003b5343a0,
	nameidata: ffff81003e10fe88.
[10095.754922] pohmelfs_lookup: dir: ffff81003bc83ae0, dentry: ffff81003b5343a0,
	nameidata: ffff81003e10fdf8.

uganda:~# mount | grep pohmel
/dev/hdb1 on /mnt type pohmel (rw)

uganda:~# ls -la /mnt/
total 0
It is about 12kb of code just to register own filesystem and provide number of VFS callbacks, so that filesystem could be mounted.
It is not possible to create files or directories since directory lookup method is not implemented (it returns NULL), ls -l does not show any data since ->readdir() callback does not fill directory entries, since there are no such objects in the filesystem at all.

As you understood, this is fairly trivial implementation, which was created just as a reference point. So far it includes stubs for the following VFS methods:
  • basic address space operationsL
    • ->readpage() which reads a page, usually implemented as a generic mpage_readpage(), which uses per-filesystem get_block() callback. This is called via read path, when file's page is not in the page cache yet.
    • ->writepage() - writes a page usually via generic block_write_full_page() helper, which uses per-filesystem get_block() callback. This is called by the VFS core when there is a need to write page from the cache to disk. This happens for example when you call sync and friends.
    • ->prepare_write()/->commit_write() - they are called via write path (for example from generic_file_buffered_write()), this functions has to reserve a space on disk, update related metadata and perform other private filesystem steps for given page, which will be flushed to that on-disk area in ->writepage().
  • basic directory inode and file operations:
    • file operations include ->read() callback, which has to return -EISDIR, and ->readdir(), which has to read directory entries for given inode into provided buffer. Right now it is empty.
    • inode operations are used to create/remote/lookup and perform other tasks on directory content. Readonly filesystems only have to provide ->lookup() callback, which is used to lookup inode for given directory entry. Others have to implement lot more operations: create, lookup, link, unlink, symlink, mkdir, rmdir, mknod, rename, setattr, set of extended attributes operations and so on... Pohmelfs currently does not perform anything at all, but already provide an empty lookpup callback.
  • basic file operations (file operations itself and inode operations for regular files):
    • file operations for regular files are those provided by generic_ro_fops currenly, it includes:
      • ->llseek() - generic_file_llseek() - seek inside file mapping, it just updates files current position and performs some checks, so it does not include anything filesystem specific.
      • ->read() - do_sync_read() - a helper used by read syscall, it will eventually call ->aio_read(), which is generic_file_aio_read() for this file operations, it will call ->readpage() for pages, which are not yet in the page cache
      • ->aio_read(), described above.
      • ->mmap() - generic_file_readonly_mmap() - it will setup a mapping file operations, which include only a fault handler, which in turn will call page_cache_read(), which ends up with ->readpage() calls. Of course mapping is a bit more complex tasks, but from the filesystem point of view that all what we have to know.
      • ->splice_read() - generic_file_splice_read() - this callback is used for splice system calls, which ends up calling the same ->readpage() callback for the set of pages, which are put into spliced buffer of pages.
    • inode operations for regular files is not needed, if it is readonly filesystem (although it can provide some useful callbacks like getting extending and usual attributes), for usual filesystem at least ->truncate() and ->getattr() callbacks are required.

/devel/fs :: Link / Comments (0)


Mon, 10 Dec 2007

I have started laid off process.

Most of the projects have been moved to collegues, talks with management completed.
Just waiting for tiny bits and that's all...

/devel/other :: Link / Comments (2)


PohmelFS.

linux-2.6.fs$ mkdir fs/pohmelfs
linux-2.6.fs$ date
Mon Dec 10 19:38:53 MSK 2007
Stay tuned...

This is a working name of the filesystem, I will think about release name later.
First I will implement a simple base, which will just register itself with the Linux VFS code, so that I will put here some specs about what Linux VFS requires from the filesystem. In parallel it will be used as a base for either network filesystem and/or distributed/local filesystem.

/devel/fs :: Link / Comments (2)


New distributed storage release: Gamardjoba, genacvale!

Short changelog:

  • wakeup state when mirror detected error to seedup reconnect
  • if connecting in csum mode to no-csum server, do not enable csums
  • do not clean queue until all users are removed
  • allow to increase size of the storage in linear add callback (with this change it is possible to add nodes into linear array in real time without stopping storage. Filesystem has to be prepared for the case when underlying device has changed its size. Real-time addon of mirror nodes is also supported)
  • allow to delete gendisk only after device was started
  • dst debug config option
  • Name: Gamardjoba, genacvale! ('Hi friend' in georgian)
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging!

As usual, one can get new release from the project homepage.

/devel/dst :: Link / Comments (0)


Sat, 08 Dec 2007

Pancho Villa.

I spent excellent time in this mexican restourant with friends. We celebrated Perec's birthday: tequilla did flow plentifully, buritos were hot, and hours dissapeared silently.
Excellent time!

/life :: Link / Comments (0)


Fri, 07 Dec 2007

Climbing evening.

It was quite short and not very hard training - I was a bit later than usually, and most of the time I tried quite old but very complex start on the horizontal negative slope. Meantime I talked with instructor and found that start in question does not contain one hold, which was there originally, so that should explain why I fell. I will continue that red trace next time, I even want to put a huge paper around another hold, located where old one was: 'I'm a red hold, I'm just feigning'.

/life :: Link / Comments (0)


Strong checksumms in DST rocks.

Great thanks to person, who suggested me to implement them and Zach Brown, who showed, that Castagnoli crc is a better one than Adler.

I've debugged a setup where system failed to mount XFS filesystem on top of distributed storage, and after turned on strong checksums, system detected they were wrong, so some corruption happend during filesystem setup.
Turning off TSO, RX and TX offload of e1000 nics on machines, which form the storage, fixed the problem.

Strong checksumms rocks!

/devel/dst :: Link / Comments (3)


Distributed storage and long distances.

I've just completed some tests over the distributed system, created on top of usual internet links between machines, located in Moscow, Russia and London, UK.
Remote target was setup, then XFS filesystem created, mounted and some tests ran.
One of the machines (main storage server) is located behind at least one NAT firewall.

/devel/dst :: Link / Comments (4)


The return of syslets.

Zach Brown announced new syslet patchset aimed to simplify and stbilize basic async operations. Syslets is a mechanims of performing syscalls asynchronously - new thread is started when syscall is about to block, execution blocks and old thread is scheduled away to the new one, on behalf of which userspace continues its execution.
Version 7 of the patchset was built on top of indirect syscall, threadlets, userspace function execution and async io was removed from the patchset for simplicity, number of comments and code clarifications were added.
Main goal of the syslets right now is to make fundamental things working right.

Asynchronous IO operations has too long history already - it was implemented as a state machine in KAIO and kevent AIO, kernel supports AIO for directIO operations (userspace requires libaio).
Syslet approach was shown to be in some cases much slower than libaio (which is actually a sync operations for usual files), but it was resolved as unfairness of CFS scheduler, and (iirc) it was fixed/extended.

My main objection against this is the fact, that when you have thousands of actively running applications, system starts sucking badly, but if it is possible to reduce maximum amount of working thread per user to some resonable limit, things will be just fine. Syslets (and its more friendly threadlets user) were supported by Linus and Ingo Molnar, so very likely it will be the default way to do asynchronous IO and other operations.
Right now Zach highlighted following problems:

  • ring buffer of syslet statuses limitations
  • ptrace() problems
  • stale data (when thread issuing a syslet calls for example setuid(), in which case another thread, which actually executes blocked syscall, contains wrong data)
  • problems with sys_clone() and syslets, sys_clone() is actually a mechanism to create a new thread in syslets, so we get a recursion
All above problems are technically not-impossible for resolution, and I think it is not that bad to introduce some simple limitations for users, so that majority of async IO qustions are resolved with this mechanism.

/devel/other :: Link / Comments (0)


B(something)-tree vs RB-tree. On-disk allocations.

In the previous article it was shown, how btree and rbtree behave with allocations are being done in memory. In such conditions btree should suck compared to rbtree, and generally it is true, although in some conditions its insert speed can be even slightly higher htan rbtree.

Now, let's check how they behave when all allocations are performed from disk.
Below graph shows insert speed for both rbtree and btree in such conditions, each node was allocated with 1024+sizeof(node) offset from previous one so that readahead and thus cached disk apges would not influence the results.
Totally 1 million keys were inserted into the tree.
Search speed is roughly the same as with in-memory tests, since most of the tree sat in the ram after insertion.

B(somthing)-tree vs RB-tree. On-disk allocations

High jump around 220 keys is likely a place, where node size becomes bing enough, and amount of them is small enough, so that total tree started to fit the page cache. In some cases there is no such a peak and graph slowly moves to around 40k insertions per second, which likely happens when some background task is actively using page cache flushing away test file's pages from the memory.

/devel/fs :: Link / Comments (0)


The most discouragement-resistant hacker out there.

That is how Jonathan Corbet calls me :)

/devel/other :: Link / Comments (0)


Thu, 06 Dec 2007

Multithreaded filesystem access.

Trees are generally (if not always) very bad in parallel access, since there is no a good strategy what to lock and tree modifications usually requires more than one node changes and in some cases (like b-tree or AVL tree) can lead to changes at every layer.
Thus it is much simpler to lock the whole tree during any changes, but since not every node in the tree is in the main memory and thus has to be fetched from the disk, this can lead to long delays per operation.
Contrary Linux VFS operates with pages, where each page is locked individually.
Similar changes for hash tables (i.e. one lock per hash bucket) actually leads to lower performance since when the whole table is locked by single lock because of bad cache line, containing per-bucket lock, bounces, but this, again, is only applicable to main memory, since usually access to single bucket in the hash table is quite cheap even if it contains several entries.
So, I do not know perfect locking scheme for trees, when they are allocated on the disk, so I will find that knowledge in experiments.

The best solution, which is the most related to the real life, is trivial filesystem of course.
Initially this will be a simple and very small kernel module with basic filesystem in it, so that it could be trivially changed to support on-disk filesystem and network filesystem.

I wanted to put my dirty hands into it quite for a while already, so it is time to start...
Stay tuned!

/devel/fs :: Link / Comments (2)


A simple way to crash machine using XFS and DST.

Let's suppose you want to create an XFS on top of DST array. If you mistakenly will run mkfs.xfs /dev/sda1 (let's suppose you want to create DST storage on top of /dev/sda1 device) and then start DST on top of /dev/sda1:

./dst -n storage -A alg_mirror -d /dev/sda1 -R -s0 -S0
this will overwrite the last sector of the /dev/sda1, where XFS stores its metadata. Mounting XFS after that will lead to almost 100% crash of the machine on 2.6.22 kernels because of some bugs in XFS, which appear when XFS reads corrupted metadata from the last sector.

To work with DST you have to operate with /dev/dst-$storage-$num devices (i.e. run mkfs.xfs /dev/dst-$storage-$num), and not with underlying ones.

/devel/dst :: Link / Comments (0)


Wed, 05 Dec 2007

BTRFS 0.9 release.

Chris Mason announced new release of his btrfs filesystem.
It includes:

  • bigger filesystem block sizes
  • extended attributes (no ACL yet)
  • extent alignment parameter
  • inlining of the file data into btree
  • number of performance and stability improvements
Chris also showed a rough timeline for the filessytem development.

As he pointed, btrfs is still very bad in database loads and does not support multithreaded operations.

As you probably got, implemented inlining of the file data into btree is virtually scaling inodes algorithm, although a bit simpler.
I do like btrfs, and wish a great success to this filesystem. But onlu until I start my own :)
Kidding of course.

/devel/fs :: Link / Comments (0)


Storage hotplugging in DST.

For the interested reader: yes, it is possible to add disks into DST storage on the fly, but be sure that your filesystem supports that (in case of linear setup), mirroring is fairly transparent.
Command to add another node into mirror setup is pretty simple:

./dst -n storage -A alg_mirror -S0 -s0 -a kano -p 1026
Just like adding usual node into the storage before it was started.

Please note, that when adding node which is smaller than current device size, device size will be reduced and this can damage your filesystem!
The same applies to linear setup.

/devel/dst :: Link / Comments (0)


Tue, 04 Dec 2007

DST FAQ.

The most frequently asked question about DST is:

Can you give us a summary of how this differs from using device mapper with NBD or iSCSI?
Answer is quite simple:
From the higher point of view it does not, but it operates quite differently: it has async processing of the requests, thus not blocking, it has different protocol with smaller overhead, supports strong checksums, has in-kernel export server, which supports simple security attributes (i.e. allow to connect, to read or write). It uses smaller amount of memory (zero additional allocations in the common path for linear mapping, not including network allocations, it uses smaller amount of additional allocations for mirroring case). DST supports failure recovery in case of dropped connection (core will reconnect to the remote node when it is ready), thus it is possible to turn off and on remote nodes without special administration steps. DST has simple autoconfiguration at the startup time (support checksums and storage size autonegotiation). It is possible to turn one of the mirror nodes off and use it as a offline backup, since dst mirror node stores data at the end of the storage, so it can be mounted locally.

/devel/dst :: Link / Comments (0)


New distributed storage subsystem release.

This is a maintenance release and includes bug fixes and simple feature extensions only.

Short changelog:

  • fixed bug with XFS metadata update (it can provide slab pages to the DST, so it is not allowed to transfer them using ->sendpage())
  • fixed async error completion path
  • extended netlink communication channel to report errors back to userspace
  • DST name is now "The 10'th dynasty of smuggled slothes"
  • number of fixes for userspace DST target
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging and fixes for userspace DST target and preliminary netlink extension patches.

As usual you can download this release from the homepage.

If you want to try distributed storage this release is a really good candidate to start with.

Enjoy!

Update: This release includes bug fixes for all bugs described here, including uninterruptible sync read operations.

/devel/dst :: Link / Comments (2)


The 22'th century netchannels release.

This is the 22'th release of the netchannels, a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing.

Short changelog:

  • update cached route in the netchannel when it expires
Thanks to Salvatore Del Popolo (delpopolo_dit.unitn.it) for testing.

You can get the latest sources from netchannels homepage.

Userspace network stack is available from own homepage.

/devel/networking :: Link / Comments (0)


Mon, 03 Dec 2007

Climbing evening.

That was a great training, although I completed not that many traces, but they all were good. Several old ones from really simple to quite interesting at the beginning, then number of traverses and boulderings. When Grange arrived it was already quite late, so we made couple of simple traces without the rest in between, and then the greatest trace started: 7a+ (although I was allowed to modify it a bit, so I reduced its categry down to 6c/6c+) over the black holds in the left center sector. I did not finish it (it is not on-sight already quite for a while), since was too tired, but made a key point several times, which flushed me down to the bottom.
I tired as hell and that was a great feeling!

/life :: Link / Comments (0)


Sun, 02 Dec 2007

Distributed filesystem roadmap.

  • Distributed storage. This step is mostly completed, although some bugs are there and there is number of features to be implemented, work is being done on them and it is no the finish line. Feature list include:
    • sync/barrier support
    • error report to usersapce via netlink (patch was made by Matthew Hodgson (matthew_mxtelecom.com)
    • some thoughts about sync operations which can stuck in uninterruptible state if there are some problems with remote noes (Hi NFS), I will create a fix for this issue for DST at least.

    There is a nasty bug in DST currently, which I can not reproduce locally and debug it with Matthew on his setup.
    There is also fair number of fixes for userspace DST target made by him. Great thanks!
  • Local filesystem with very scalable and fast on-disk format, possibilities to have on-line backups, snapshots, no fsck, scalable locking (multithreading reading and writing).
    This had originally a very simialr to btrfs design, but I want to move further and have ability to perform multithreaded and then mutli-machine access to the same files. Call me a looser or wheel reinventer (I would not be where I am if I cared about it), but I want to have a project where I know every single bit to be able to fix things quickly and break something if it is needed for better implementation.
  • linking both network and fs layers together, this will include distributed byte-range locking and cache coherency for client nodes. Bits of this step I described in short discussion with Zach Brown.
This does not mean steps will be completed in the above order, I'm working in parallel in different directions and some parts can appear earlier, so that I would be able to evalute its problems.
Bug fixes has obviously the highest priority.

/devel/fs :: Link / Comments (0)


Meanwhile at appartment development side.

I reached a big milestone yesterday - I completed wallpaper glueing in the hall, room and checkroom. Although it still requires some fixes and bits of work with a knife, it is a huge step forward. I also wanted to paint the walls in the room yesterday, but fell in slack. Today is supposed to be another heavy working day - I will move to the development shop (on the opposite side of Moscow) to get neon cord, water system hatch and ceiling for the bathroom. If I will return not that late, I Will start setting them up, otherwise will paint the room.
I'm curious, when I will have a real vacations and some rest, but I think I found an answer - tomorrow I will start discharge process at work and expect it to be completed in a week or two at most, so I will have about two weeks before the new year celebrations. Most of them will be devoted to the appartment developemnt though.
Well, we will definitely have some rest in an eternity...

/devel/flat :: Link / Comments (0)


Fri, 30 Nov 2007

Climbing evening.

That was a very good one - I tried number of old traces, which I either never climbed or climbed couple of times and dropped. This included simple from the first point of view (and its category), but quite complex on the wall actually, and really complex from the category and damn bloody complex on the wall.
Eventually I even fixed on trace to be a bit simpler, so that it matched its category (7a+) with permissions of the instructors, although I think it became too simple just after single hold change.
That was a good time!

/life :: Link / Comments (0)


Thu, 29 Nov 2007

The 21'th netchannels release.

Netchanel is a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing.

One of its users is userspace network stack.

Short changelog:

  • fixed queue length usage
  • fixed dst release path. Both problems reported by Salvatore Del Popolo (delpopolo_dit.unitn.it)
  • removed nat user
More details can be found on project homepage.

/devel/networking :: Link / Comments (0)


B(something)-tree vs. RB-tree.

This simple benchmarks test btree vs rbtree search and insert speeds, when nodes are allocated in memory.
Btree is (likely) a b+tree where each data node is located at the bottom layer (probably it is a bit different, algorithm does not strickly follow rules described in btree papers. 1 million entries were inserted or searched in the tree.
Graphs below show speed of the given operation depending on number of keys in the btree node (when number of keys is smaller than 20 linear search ofthe key is used, otherwise binary search).

B(something)-0tree vs. RB-tree. Search speed.

B(something)-0tree vs. RB-tree. Search speed.

B(something)-0tree vs. RB-tree. Insert speed.

B(something)-0tree vs. RB-tree. Insert speed.

As you see, btree search is about 2-2.5 times slower than rbtree. This can be clearly described by the fact, that rbtree contains data in each node, while btree only in the lowest nodes, so each btree search requires to travel over all layers (this is roughly equal to log2(number of keys in each node)) multiplied by number of keys in each node (in the worst case), while rbtree needs to perform the same amount of searches only in the worst case, while generally it requires about 2 times less traverses.

Btree insert speed can be even slightly faster than rbtree, since during insertion tree grows from low levels and each new allocation reserves space for several keys, thus greatly reducing number of needed splits or rotations.

I will run the same tests for the case, when all nodes are allocated from disk storage, which should show clear win of the btree approach.
Stay tuned.

/devel/fs :: Link / Comments (0)


Astonishingly screwed tapeworm.

New release of the distributed storage subsystem. This is maintenance release and includes bug fixes only.

Short changelog:

  • use node's size in sectors instead of bytes
  • fixed old/new ages for the first node. Error spotted by Matthew Hodgson (matthew_mxtelecom.com)
  • fixed debug printk declaration
  • new name
Overall list of features of the DST can be found on project's homepage.

/devel/dst :: Link / Comments (4)


Wed, 28 Nov 2007

Slackass.

Yes, you, who sat near the computer instead of moved climbing. Who likely forgot how to hold a rope and how holds look and how to use them. You, who do not know already what is a power in the muscles and how tireness kicks the body. You, who strains only when stands from the chair.
You, who promised to go climbing, and miss trainings one by one, you are a slackass :)

Meanwhile I damaged a foot at the training - I climbed without insurance on the horizontal negative slope and when moved on top of it (only about 4 meters) I fell and managed to unfortunately land right foot into the hole between floor-mats and damage it kicking the floor. I think it is not cracked, but it aches quite noticebly when I move. So likely I will become a slackass too.
It was quite interesting training though - I even climbed high on the wall with the strongest negative slope - I tried new trace on-sight and fell several times during that, although trace is quite simple and holds are big. Also tried several boulderings and above start on the horizontal negative slope several times.
It was a good training.

/life :: Link / Comments (1)


Teams make a business, individuals make innovations.

/other :: Link / Comments (4)


Reducing entropy of (software) bugs in the universe.

Yesterday I added a bug to Fedora Core bugzilla, today I fixed one bug in kernel bugzilla related to IPv6 addrconf.

My carma is clean again.

/devel/other :: Link / Comments (0)


20 hours.

Yesterday I woke up at about 6 o'clock and went back to bad (hammock) today about 2 o'clock. This resulted in properly working (although lack of tricky interesting features) search/insert operations of the given btree. It looks a bit different than usual btree (or b+, b# or b-whatever), right now it supports 64bit keys, but I plan to extend it to 128 bits. So far I only tested it with usual memory and thus it was a bit limited, I will run rb-tree vs. btree benchmark for on-disk and in-memory allocations for milliards of keys.
Stay tuned.

/devel/fs :: Link / Comments (0)


Tue, 27 Nov 2007

Fibre Channel over Ethernet Project.

Robert Love (if I understood correctly it is not that Robert Love who wrote "Linux System Programming", "Linux Kernel Development" and "Linux in a Nutshell" :) announced new Intel's project aimed to allow systems with an Ethernet adapter and a Fibre Channel Forwarder to login to a Fibre Channel fabric (the FCF is a "gateway" that bridges the LAN and the SAN). That fabric login was previously reserved exclusively for Fibre Channel HBAs.
System provides both fibre channel and ethernet transport modules, as long as software target and initiator. Although right now code can not be imported into the tree (small BSD code usage, small amount of documentation, ioctl() usage and kernel/userspace interaction, but there are several git trees, so that interested users could setup a testbed.

Homepage: http://open-fcoe.org/

/devel/other :: Link / Comments (0)


Reproducible GTK (probably buffer overflow) bug in FC7.

Program received signal SIGSEGV, Segmentation fault.
0x00b096e3 in ?? () from /usr/lib/libgdk_pixbuf-2.0.so.0
(gdb) bt
#0  0x00b096e3 in ?? () from /usr/lib/libgdk_pixbuf-2.0.so.0
#1  0x00b026f1 in gdk_pixbuf_composite_color () from /usr/lib/libgdk_pixbuf-2.0.so.0
#2  0x08083ece in gtk_tree_path_free ()
#3  0x0808450d in gtk_tree_path_free ()
#4  0x068c4a91 in ?? () from /lib/libglib-2.0.so.0
#5  0x068c67f2 in g_main_context_dispatch () from /lib/libglib-2.0.so.0
#6  0x068c97cf in ?? () from /lib/libglib-2.0.so.0
#7  0x068c9b79 in g_main_loop_run () from /lib/libglib-2.0.so.0
#8  0x06f20f44 in gtk_main () from /usr/lib/libgtk-x11-2.0.so.0
#9  0x08097d3f in gtk_tree_path_free ()
#10 0x007bff70 in __libc_start_main () from /lib/libc.so.6
#11 0x080532c1 in gtk_tree_path_free ()
(gdb) 
It was obtained during btree debugging - I generated a big graph using Graphviz and tried to see it with gqview, which crashed badly. All updates were installed. x86 arch. I've filled a bug in Fedora bugzilla, but I'm not sure it will be resolved.
Crap - I still can not develop my interesting btree, but I'm very close to the finish.

/devel/other :: Link / Comments (0)


Mon, 26 Nov 2007

Climbing evening.

That was a great training. Although besides couple of warming traverses I only did the same start on the horizontal negative slope, I tried it many, really many, REALLY many times. That sucked all the power, so at the end I was tired as hell, but that is a great feeling.
Met there climbers, which were not there quite for a while so I thought that will not met them again anymore, but no, things changed...
Bloody excellent time!

/life :: Link / Comments (0)


Sat, 24 Nov 2007

Coherent Remote File System.

Zach Brown has an extremely interesting idea of network filesystem implementation.
One can thing about it like NFS client or more proceise as a client-server protocol, which allows clients to have a cache of data instaed of relying on server. This of course requires a cache coherency protocol to be involved in client-server communications, which makes things more complex.
Simply this works as a trivial filesystem, mounted on clients, where each read/write/meta operation is perfomed on top of locally cached data, if data is not preset in the local cache, it is fetched from the server. Client flushes its updated cache to the server in number of various conditions either because of usual writeback process or because of cache coherency process (i.e. when another node reads from the file, updated by given client).

Zach will present it at LCA this February.
So far it is closed Oracle's project (as far as I know open sourcing process in on the way, just like it was with Chris Mason's btrfs), and I strongly want to implement exactly the same idea myself :)
This process will have number of benefits:

  • simple open source filesystem, which can be used as a base for real filesystem development (do not confuse it with virtual filesystems like sysfs or debugfs)
  • ability to extend it for own protocols
  • cache coherency mechanism will be used in distributed filesystem
  • possibility to test byte range locking in a real life
  • implement filesystem bits first in userspace (I do not want to introduce additional mispredicted behavuiour because of FUSE)
Zach, what about small competition? :)
Frankly saying I'm not an expert in cache coherency protocols and filesystem development either (you will not believe me, but last several days I'm trying to implement inteteresting B-tree, but with each day spent on that problem I comment more and more bits in the code and it still does not work the way I want :). With recent trends I believe I will have pretty high-end hardware soon to perform various tests and find common and tricky bottlenecks.

This implementation can be used by various users aimed for distributed systems, but which do not want to have (or bother with) real filesystem developemnt and which are ready to have a server in userspace on top of existing filesystems (in receiving zero-copy project I showed huge problem with in-kernel usage of some of Linux filesystems, especially those which use in-kernel JBD journaling, when it is impossible to preallocate (->prepare_write()) number of pages for given file and then write into them and commit (->commit_write()) at once for maximum performance).

/devel/fs :: Link / Comments (2)


Meanwhile at appartment development side.

I've completed big arc in the room and finished most of the smaller arc for the checkroom (rough strong emery paper rocks), it requires some polishing and eventual wallpaper glueing and painting. I think I will finish this part (as long as painting of all room's walls) next weekend. I want to complete room, checkroom and (hopefully) hall. The latter requires a bit more work - if I will have enough time and glue for ceramics, I will setup ceramic granit floor. I hope to get a ceiling and a water system hatch for the bathroom next weekend too, so that it would be completed too.
When it is ready, I think I will have some rest, or maybe will go a usual way - proceed with development of the kitchen. It is not that complex and will not require a lot of time actually, even if I will (and I want to) install hinged ceiling there too.

/devel/flat :: Link / Comments (0)


Tue, 20 Nov 2007

Crazy company wanted!

I tired of my paid work. I really like all people here, but when I'm assigned to do tasks, which can be completed in several hours without major thinking, without interest and without good understanding for what it is needed and will it be used at all, and that happens for the last years constantly, I feel really frustrated.

If you read this, then very likely you know what I can do and how I behave (frequently not very good and friendly), and thus understand my intentions.

I want to work on my own projects first. If you believe that they correlate with your business and want to pay me for doing that with some influence over TODO list, then feel free to drop me a mail.
There are probably some issues with the process, which we can discuss privately.

/devel/other :: Link / Comments (4)


Maintenance release of the distributed storage subsystem.

It contains only following bug fix:

  • Cleanup sysfs files on error path. Patch by Chris Madden (chris_reflexsecurity.com)
You can find the latest release on the project homepage.

/devel/dst :: Link / Comments (0)


Mon, 19 Nov 2007

Kind of working...

Hacking on getting motion JPEG (Morgan codec) live dataflow from adv202 hardware codec.
One can watch resulted several seconds SWF 'movie' (hardware around captured by small analog camera connected to mentioned above codec on AMCC PPC 405gpr cpu board), about 1.5 Mb.

/devel/other :: Link / Comments (2)


Sat, 17 Nov 2007

Meanwhile on appartment development side: the concrete jungle king.

Yes, I made it, I installed lavatory pan and wash-stand in the bathroom, although it is not finished yet (I did not complete glueing ceramic tiles on the small wall with door and part of the wall where water/sewerage hatch is located).
This required to complete sewerage and water projects in the bathroom, when I will get my camera back, I will make some photos - water system is not that trivial: it contains of filters for hot and cold water, counters, headers (water collectors). It does not yet support boiler, but I will set it up soon.
One of my fellows live near the good development shop, so he will bring me a special hatch, where I will install tiles; neon cord for my hinged ceiling and ceiling for bathroom.
I think I will complete it this month and it would be great if I install a shower cabin, since washing without it is a real pain in the ass.
Tomorrow I will start polishing my main arc at the room's door area and will glue bits of wallpaper I removed when made it, I also plan to start painting on the wallpapers tomorrow (and of course my 'uchuu'). If there will be enough time I will also develop small arc in the checkroom and install chains for coatrack.
The nearest future plan include hall development (ceramic granite on the floor, wallpapers and pains on the walls, fortunately I already finished painting the ceiling, this time it does not have any special details) and move to the kitchen setup.
I also want to make a table as soon as possible and start developing interesting things at home.

Appartment development sucks really lots of my power, but I do like it very much. Although I frequently break thins and then have to fix that and move forward, it is a good way I believe.

/devel/flat :: Link / Comments (0)


Fri, 16 Nov 2007

ARM MMU domains.

Grange brought me Xscale board, so I will start MMU domains feature implementation for 2.6 kernel.
This is a new area for me, and it is quite time limited - I have to return board in about two weeks, so either I will complete and submit it for inclusion into 2.6 kernel tree, or abandon because of lack of the time.
Board requires 24 V power setup and I do not have it right now (even do not have two 12 V cords), so I will postpone powering till monday.

/devel/arm :: Link / Comments (0)


Climbing evening.

That was a very good training, Grange finally completed his manager's tasks (i.e. slacking on the meetings and fscking brains of the subordinates) so climbing training was very interesting. Not from the beginning, since he was late, but nevertheless.
Anyway, training was started with usual warming traverses, then I moved to the negative slope and tried several complex starts. I did that on the wall created for bottom rope insurance, where I did not climb quite for a while already. It took quite a bit of time and power, so when I started to climb upstears on the walls with Grange, I was not very fresh.
Nevertheless I managed to complete several old traces and tried couple of new ones, simpler one I finished without serious problems, I probably can say I finished it on-sight, although fell couple of times, but just because I was already too tired, not of the absence of technique.
Another new trace was quite complex one with start on the horizontal negative slope, which I did previously only partially: horizontal slope or rest of the trace. I tried to combine them, but fell, although I believe I will finish it next time if will start early when have anough power.
That was really good time there!

/life :: Link / Comments (0)


Thu, 15 Nov 2007

Ground points of the filesystem development.

1. Data read/write rebalance in the filesystem.
When it is possible to add/remove storages from the system, there is a clear question about theirs utilisation. First, when you have your data spread over different nodes/storages, reading will always be faster, since it can be performed in parallel.
From another point of view, this can lead to heavy data fragmentation, if done incorrectly (like in case of tightly packet data in the first place, which after spreading will require heavy write/update overhead).
So, this is a good solution for read-mostly setups, but is a bad choice for write-mostly cases.
The cleanest solution for this issue I see is to use copy-on-write sematic, which implies that each new write will be placed to the new location. Thus in case of new storage added to the filesystem, it will be readily utilized for new writes, which in turn can work with delayed allocation and extents heavily reducing fragmentation.
Reading is a bit more trickier, ideally data should be spread over the new storage, but having large contiguous regions for the same file is a huge win because of read-ahead logic and the way disks work, so only fragmented files have to be moved around. Here we enter defragmentation land, which is very small and easy in copy-on-write design - file should be read and written to get a new contiguous region, or special operation should be introduced to do essentually the same, without writing to the data (like do that on sync or flush).

So, to summarise my ideas, the only needed thing for having high-performance read and write in case of multiple (or extendible) storages is to have copy-on-write semantic behind IO logic with correctly implemented balancing algorithms (like proper delayed allocation and extent usage).
This is a first base point of my filesystem design.

2. Locking.
Obviously, the less locks you have, the less time you will spent in busy loops (zero in the perfect case).
Thus main design principle is to allow multiple IO (simultaneous reads and writes) and metadata (file creation/deletion and so on) operations.
While multiple readers are handled just fine in Linux kernel via generic_file_aio_read() all writers are stuck in generic_file_aio_write()'s inode->i_mutex, which effectively blocks multithreaded writing to the same file. But inode->i_mutex should only guard metadata updates actually, not writing itself, so this issue has to be resolved in any filesystem, aimed for high performance applications (no filesystem in Linux kernel tries to avoid grabbing inode->i_mutex for writes currently). Getting into account number of hacks I implemented for network without touching a lot of core code, I'm pretty sure I will be able to do so for own filesystem only.

3. Motivation.
I do strongly believe that it is impossible to make a really good things when you are forced to do them. So, my idealism says me, that when you are paid to do the work, it will not be completed in the best way. Do not confuse, when you get money for things you do for yourself or on your own intention, they are completely different approaches.

4. Fun.
It has to be fun. If project starts sucking the power without good feedback, it has to be completed to the next milestone and frozen. If something is not interesting, it should be avoided.

That were my rules for success filesystem project, the last two items obviously apply to any other project.

Stay tuned :)

/devel/fs :: Link / Comments (0)


CEPH distributed storage.

It was announced on LWN and kerneltrap recently.
I already wrote about this filesystem, after that I found (from discussion with Zach Brown) that this filesystem does not have a byte-range locking and when number of threads write to the same file, they become sync writes (i.e. no cache coherence protocols involved). I'm also not sure what this is about: I/O workloads should be done with the client cache off because the writeback is too non-deterministic.

That was my envy comments :), now good news.
First, Sage Weil (an author) works full-time on this project and funds it from own web hosting company, so it is possible to attract developers for money (he even hired someone to write kernel client instead of FUSE one). Second, it has completed design and working implementation (although some design issues are questionable).
So, likely it is a good choice to take a look for you, if you are searching for the solution which should be ready shortly.

/devel/dst :: Link / Comments (0)


Wed, 14 Nov 2007

Climbing evening.

That was very good training, although quite short - after usual warming traverses I tried number of starts of various traces on the negative horizontal slope, where later tried several traces. I found that besides horizontal slope I can do them pretty easily already, so I continue to develop power endurance on the negative slope. I can not say there is a major progress in that area, but I can complete some startes on the negative horizontal slope, which I previously could not, so likely there is some gain in that exercises.

/life :: Link / Comments (0)


Perfect bugs.

A recent thread started by Natalie Protasevich, who is a kernel bugzilla master now, shows a number of bugs which were reporeted recently to different kernel subsystems.
Andrew Morton replied marking essentially most of the bugs (if not almost all) as being not responded by developers at all and some words about decreased kernel quality. This rised quite heavy (void actually) discussion about how this should be fixed (not bugs, but 'the process') and so on (I deleted the whole thread after read about a one quarter of all messages if not less).

There are two interesting moments of fixing bugs, which I want to highlight here.

First, do not even expect someone will look at your bug just because of that. I like to fix bugs, I really like to do it, but having 'no reply from developers' behind does not force anyone to start doing so. And that is frustrated, when work is being done, and done pretty good, but instead kernel leaders say that there was no reply from developers. This frustration and complete wrongness of such approach was showed in the thread, and I hope was gotten into account.

Another issue is a bug quality. I have number of friends, who are able to read minds, but right now they are all on vacations, so it is pretty hard to determine what the bug is (like 'I used 3 years old kernel and it works bad/crashes/destroy my data/whatever'). Yes, providing a bug is not that trivial and simple, if you want it to be fixed, please help us to do so, do not throw it and expect things to get changed.
Really perfect bug has a description and a test case. While I wrote this entry I found how performance regression, reported by Nick Piggin, was analyzed by David Miller, tested by Nick, problem was found and bug got fixed by Herbert Xu. Just because it was a good bug report, with tight cooperation with reporter.
If it would not be fixed right now (americans will go to bed soon or should be there already), I wanted to fix it myself just because it contain perfect description of the problem (perfomance degradation using special benchmark tool) and it is possible to (easily) perform the same tests locally (tool is a tbench benchmark ran over lopback).

And I pretty sure, I even insist, assure and even can prove, that when bug is reported correctly and there is a way for developers to catch the problem, it will be fixed immediately.

Of course it is not always possible (like when bug is in the driver and only limited number of people have hardware), but even then reporter should do a bit (just a really small bit) of work - find a maintainer of the driver (it is easy - check MAINTAINERS file in the source tree or search for driver name in the mail archives) and kick it. Provide a lot of info, maybe resend bug report several times (yes, people frequently forget about such things like fixing own bugs), copy appropriate mail lists.
If you have enough background, start helping developers a little bit more: like use git-bisect to find exact commit which caused a regression, if it is recent bug; or add debug prints to determine where driver stops working; or just run different (as simple as possible) tests to show exact condition where problem occurs so that developers could reproduce problem using own setup.

This are really simple things to get bugs fixed, and we do can fix 'the process' without all those words, just by doing things.

/devel/other :: Link / Comments (0)


Tue, 13 Nov 2007

Moved to development shop.

To get some small things - screws, drills over ceramic, nuts, chains and so on, also got bits of colour to draw uchuu on the wall.
Searched for bath cabin, but "Leroy Merlin" shop has only crappy stuff, only maybe couple of cabins were good, but they contain things I really do not want to have there like radio, own shower sockets, seat and so on... I want a simple thing: two glass walls (maybe rounded) with good mechanical parts and fixes.
Still searching...

/devel/flat :: Link / Comments (0)


Mon, 12 Nov 2007

Climbing evening.

It was very good training - I tired very much climbing over several new traces on the vertical wall but with start on the horizontal negative balcony. End of the training was devoted to simplified climbing - I started not at the balcony, but higher on the vertical part, but even that sucket power very noticebly on the new traces.
Anyway, th