|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Mon, 31 Dec 2007
Mostly completed appartment development.
I finished painting, glueing, covering, lights and called friends.
There is number of tasks to be completed, but mostly very
small, so they can be postponed for a while.
I feel myself really excited about my loft, it
looks very interesting for me and I do not have anything
I would like to change or fix.
Groovy!
/devel/flat :: Link / Comments (0)
Sun, 30 Dec 2007
Meanwhile at appartment development side.
A lot of changes. Huge step forward was made today (and yesterday night).
Right now I completed my table (although it has only isngle layer of varmish
and not finished rim), mostly finished kitchen (there was not enough
wallpapers, but I painted ceiling, glued all wallpapers, which I had,
and will setup floor cover tomorrow), finished paintings in the room
(I have a blue wall now) and hall
(no uchuu
yet). So, the only really needed thing is to remove huge amount of dust and garbage
from the appartments and then setup floor cover.
I think I'm ready for New Year celebration and amount of work I made will
absolutely end up with a good celebration.
/devel/flat :: Link / Comments (0)
Sat, 29 Dec 2007
POHMELFS abbreviation.
POHMELFS stands for Parallel Optimized
Host Message Exchange Layered
File System.
And it has a metadata cache on client now. It contains just
pohmelfs inodes, which are indexed by
three tuples: first contains of
name hash, parent inode number and length of the string (this guarantees,
that there will be no identical tuples), second tuple is
inodes number and the last one is offset in the address space of the parent.
Cache update operation is independant from its usage, althoguh both are guarded
by the same lock.
/devel/fs :: Link / Comments (0)
Fri, 28 Dec 2007
Table development.
It happend that I changed my table design once more, now
it has single right angle. Since my wood door contains
holes between wood plates, I decided not to remove orgalite
(paper filled with glue) plates from top and bottom of the
wood plates, but it does not soack up a mordant, so I wll
get a coloured varnish and cover table plate again.
It is quite hard to work with wood, especially with
straigh places without plane using only knife and electric jigsaw,
so I will buy one too. I also need set of chisels.
Given that, table is still in very early development stage,
but one can check a preliminary photo (made by phone,
so quality is not very good)
here.
Since table is postponed I will paint walls in the hall and
try to clean my loft a bit, I would also like to setup a boiler,
start the last arc (instead of the kitchen door), but it is too
loud process, so likely will postpone it for tomorrow too.
It will be busy day, and quite short actually - Mephody and Irin arrive,
which is a start of the NY celebration process!
/devel/flat :: Link / Comments (0)
Thu, 27 Dec 2007
Meanwhile at appartment devlopment side: blue wall and table.
Yes, I've made it - now I have a blue wall (colour is called 'royal marine',
and although it does not look like a sea (I was not near the
sea so many years already that do not even recall when I was last time,
maybe it was changed, and I did not see an ocean at all, so I will
change that too), it looks great. But to made my feelings worse
devil told me to get not enough colour, so I actually have only 3 quarters
of the wall painted. Will fix it tomorrow or in a day when will
move to development shop to get LED cord for the
ceiling. So, painting in the room and hall will be finished very soon.
The same roughly applies to my
table
development. Not that I made too big progress, but I cleaned my old enter wood door
(which is a base for the table), put to the floor and painted table contour on it.
It looks very expressive and completely different from above pictures: there
are no straight lines (although its base is letter 'L'),
it will have only single leg (I bought it today) at the
end of the longer side, its opposite side will be attached to the walls
and will be in a level with window-sill. Maybe later I will replace single leg
with leg from the floor to ceiling - that side of the table is essentially round,
so it will be convenient to put there some round (glass) shelves.
I tried to saw the table using my electric jigsaw, but it is quite loud process
and it is about 24-00 in Moscow already,
so I postponed it for tomorrow or later. It has to be completed this year,
since I need a table to put a lot of things on it (for example fir made of
lots of empty beer bottles and jars I have here).
Since I have no chairs, I will made couple of long benches if will have enough
of materials (there are two doors here - bigger one (about 200x90 sm)
is used as a base for the table, smaller one (200x60 sm) - for smaller part
of letter 'L', the rest will be used for benches).
I could get wood plates in development shop (it has a lot of interesting
types there), but I have some troubles getting it home
without a car and do not want to wait for delivery (which will took about a week).
I also got a water hatch for my bathroom today, but I will not set it up, since
I have no glue for ceramic tiles, so better to devote this time to other interesting
tasks.
Expect some photos of my loft closer to NY time...
/devel/flat :: Link / Comments (0)
Wed, 26 Dec 2007
New release of the distributed storage: Groundhogs strike back: no New Year for humans!
Short changelog:
- mirroring algorithm improvements
- debug cleanups
- extended mirroring initialization
- documentation update
- name is 'Groundhogs strike back: no New Year for humans' now
As usual, one can get patch or pull changes from the project
homepage.
/devel/dst :: Link / Comments (2)
CDMA (EVDO) vs GPRS.
Good people gave me SkyLink CDMA modem (model USB CNU-550 with EVDO support)
to test internet conection in Linux and compare it against GPRS.
Well, here are my conclusions:
- CDMA works always while GPRS really sucks in the middle of the day: well, it has to be proven
tomorrow, but I tested CDMA modem at about 19-00 and it worked ok, while MTS (Mobile TeleSystems
in Russia) at about 12-00 worked very bad (I connected quickly, but ssh login took enormously long time).
- CDMA speed is usually about 10-16 kb/s, while GPRS is usually can not be higher than 1-2 kb/sec.
More on this: I believe CDMA session can have higher speeds (pppd session requests about
900 kbit/sec), behaviour of initial login (I saw quite a few of them on different initial speeds
during usual work and userspace network stack testing)
shows there is some limitation on server or hardware side (i.e. SkyLink either because of
special tariff scale or driver limitations, I was told there is no 16 kb/sec limitation
with given card though) side (note on testing different congestion
control algorithms and develop own if needed: speed downgrades to less than 10kb/sec frequently
during download of the big file. SkyLink is a very interesting source of data for such development:
it looks like its RTT is quite high for default (new CUBIC) congestion control, at least low-traffic but
very-small-latency-wanted source (
mutt over ssh on remote host
without serious traffic shaping) works quite bad in this setup.
Very likely it is just an empty speculation and problem is in hardware on the server (i.e.
it has support for bulk streaming access at high speeds (16 kb/sec), but fails to work
with low-latency applications, which work in small packets ping-pong environment)).
- CDMA USB CNU-550 modem works ok in Linux (modulo above issues) with this peers/pap-secrets
files without any problems.
Anyway, CDMA SkyLink is much faster and more smooth than MTS GPRS, so decision
about what is better is quite obvious...
/other :: Link / Comments (3)
Tue, 25 Dec 2007
Continuing CRFS debates.
Zach Brown again shed
some light on his CRFS desing and implementation. Let's compare facts with
my thoughts.
The most exciting news is that
CRFS caches not only data but metadata too on the client, which is flushed to server on writeback.
That is what allows to have 4-6 times higher performance in metadata intensive
operations.
Another news is actually quite bad for majority of the potential CRFS users -
userspace server is btrfs
specific, which can be another gain in the benchmarks (although should
be noticebly smaller than metadata caching part). Server does not require
any additional patches, but since it is btrfs specific, it likely works
on top of ramdisk (when test was perfromed with RAM storage), not tmpfs.
Userspace server has exclusive access to given block device, so it is not allowed
to simultaneously mount it via usual way (probably it is possible to mount
it read-only whlist it is used by CRFS).
Client kernel module only depends on ->write_begin()/->write_end()
patchset by Nick Piggin, which was added to mainline
recently.
Batching of network requests happens naturally in request/reply protocol,
but reply contains not only single request, but set of them, since client caches
metadata, it can check if data is in the cache or not and update it if needed.
Getting that knowledge, let's summarize given bits:
- CRFS is btrfs specific, while pohmelfs is supposed to be fs-agnostic. This CRFS feature
allows to have faster (probably even noticebly faster) access to on-disk data. Do not think
it is a bad sign, consider it as a client-server filesystem, no one claims AFS is bad, since
there is only AFS specific kernel server. Here is the same, but server is in userspace.
From another point of view, not allowing to work with the same btrfs volume locally can be a
show-stopper for some users.
- Metadata caching. That rocks. It has to be implemented.
- Extended request/reply protocol: i.e. do not reply with only single data (if it was not
explicitly requested), but try to combine objects. The most obvious example is
->readdir()
callback, when each request from client should transfer multiple objects, which will then
be cached.
I think I was corect in most if not all prognosis about CRFS, probably I should try weather next time...
Given that, I have a clean expectations of what pohmelfs should have and which results we should expect.
CRFS project is a serious step forward in this area, so it is very exciting to work with its ideas
and move further.
Stay tuned!
/devel/fs :: Link / Comments (2)
Mon, 24 Dec 2007
Climbing evening.
That was hard. That was really bloody hard, but great training.
I climbd high over number of new traces - first for warming
I tried something new without label (new yellow
trace in the left verticall sector), it happend to be
quite complex trace, so I completed it with couple of falls
since did not know exact holds of the trace. Next several traces
were the same part of the complex trace started on the horizontal
negative slope, but I skipped that part, since wanted to know
how it behaves higher, I already knew that its start is very
complex and fully corresponds to its category (red 7a trace in the middle sector).
There were also couple of simpler traces I made for warming and at the very end
to completely flush the power and fasten blood.
It was excellent time!
/life :: Link / Comments (0)
First CRFS (cache coherent remote file system) results.
Zach Brown posted
first public results of his CRFS filesystem.
He compared NFS and CRFS when remote storage is on disk (likely btrfs) and in ram (tmpfs) for two operations:
big number of file/dir creations (a lot of metaoperations) with small write (untarring kernel archive)
and reading all that data into RAM.
In both tests CRFS is noticebly faster: metadata operation test (untarring kernel archive) is 4 times faster
for disk storage and about 6 times faster for ram, CRFS reading is about 1.8 times faster than NFS.
Very impressive results, although without knowledge of the CRFS internals it is quite hard
to tell, where and how such gain was created, so I will handwave here :)
When CRFS will be opened (if wit will), we will check my thoughts..
First, since there was a tmpfs test, then userspace server does not use anything btrfs specific (like open
by inode), although there is a possibility, that btrfs exports some ioctls or kernel was patched, right now
I will not consider this as a fact. So, first, userspace server can work on top of any filesystem.
Second, reading is only 2 times faster, while metadata operations is 4-6 times faster. Zach says
it is limited by disk speed, so this means metadata was heavily cached. There is a question, though, does
server see the last metadata change or it will be sent to server only when another client will access
cached data (so caches will become coherent), getting into account, that NFS always sends metadata changes,
it looks like CRFS does not. If it is correct, than there is a question, does it need to send metadata updates
at all until sync or flush started.
Third, userspace server is fast. With logic I
described for pohmelfs server,
I think it will not be able to compete, so there is a place for thoughts.
Fourth, network protocol in CRFS batches requests. This can be done either because of special transactional layer
between VFS callbacks and network or because of the way VFS callbacks work, for example data is not sent
in ->commit_write() callback, but only in ->writepage() and ony if there is a strong demand
on that. The same applies to metadata operations - how are they batched and network communication reduced
to get 4-6 times performance increase? The most simple case is never send them at creation time at all,
but only when writeback for files started (or cache-coherence algorithm requires), so when for example directory
is created only notification about dirty parent dir is sent, and when new file is created in this new dir, content
of the directory is transferred.
Anyway, from features above pohmelfs currently does not have anything, it is actually read-only, but I already
see where it can be improved - for example directory listing (->readdir()
callback) is invoked for each access (i.e. each ls /mnt forces directory content resending), since
pohmelfs does not cache it.
There is fair number of changes I want to implement to catch with CRFS (I think so :), so stay tuned, I will
implement basic functionality first and will run the same tests too...
Making bets? I vote for slower than NFS speeds, because of bad userspace support and no
caching of the metadata.
But pohmelfs is developed only 3 days, it is quite young... So, stay tuned.
/devel/fs :: Link / Comments (0)
Sun, 23 Dec 2007
Continuing appartments development.
I think I finished ceramic tiles glueing, at least for this year: first, I have no glue
anymore, second, I have to glue only one vertical line with 2.5 tiles width,
where vater hatch will be located, and since I do not have a hatch, I do not glue tiles.
Dirty work in bathroom has been essentially completed - I will fill 2mm holes
between tiles with plaster and attach ceiling soon, and that will be the end.
Next dirty work is ceramic granite in the hall and checkroom - that will take some time,
but since I have no glue and not sure it will be delivered this year, I can postpone this
task too. So, main issues are painting finishing and table.
When my head is aching I frequently think out something really interesting and new,
so I have a new design for the table in the mind, if I will complete it, that will be really great.
First, table is not movable, but attached to wall (potentially two walls in the corner),
end of the table, which does not touch the wall is round and have singe (better steel tube) leg,
another part, which touches the wall has a turn, so table looks like letter 'L' with smaller part
attached to wall (and window), bigger part can be accessible from both sides.
Or something like that...
/devel/flat :: Link / Comments (0)
GPRS sucks!
Even MTS one is so slow... Although it is enough to
read emails and check some news (you know, I have infinite patience
which almost never reaches its end), but I want a normal connection.
I know, there is no Ded Moroz (Santa Claus), so there will be no fast
internet until past New Year vacations (10 days in Russia),
when I will start kicking local ISPs again.
/other :: Link / Comments (0)
Sat, 22 Dec 2007
Meanwhile at appartment development side.
I painted most of the room, and then decided to make one wall either ultramarine
or just marine blue. Just because I want, so waiting for the colour, otherwise
root would be completed.
Also glued some bits of ceramic tiles in the bathroom - it almost ready too.
Today I spent most of the time cutting tiles using
corner-grinding machine
to make different forms for corners, hatches, door and so on. Became dirty as hell,
but completed all but corners and ceiling. Hopefully will finish them tomorrow
(or likely not :).
/devel/flat :: Link / Comments (0)
Fri, 21 Dec 2007
Anatomy of the filesystem ->readpage() callback.
This callback is used to read page from the storage to RAM. It has following prototype:
static int pohmelfs_readpage(struct file *file, struct page *page)
Where file is an object associated with opened in userspace file,
and page is a page where filesystem has to put data.
On-disk filesystems usually use VFS helpers (like mpage_readpage()
or block_read_full_page()), which maps page into set of buffer_head
objects, which are then submitted to block layer, where next level of reading from the
disk happens. This mapping is implemented via per-filesystem get_block()
callback.
Pohmelfs does not follow this standard, since it does not know, which filesystem is
on the remote side, and since there is no block device under it. So it just
uses request/reply protocol to get given page from the remote host. Page structure
already contains its offset from the begining of the file (from the beginning of the
address space actually), and it is locked, so simultaneous access is not possible,
so we only need to fetch data and mark page (if copy was successful) is uptodate.
Simple.
Here is the result:
server $ md5sum /tmp/ltp-full-20071130.tgz
77bf4032c10c03e858512a5a90c05015 /tmp/ltp-full-20071130.tgz
client # md5sum /mnt/tmp/ltp-full-20071130.tgz
77bf4032c10c03e858512a5a90c05015 /mnt/tmp/ltp-full-20071130.tgz
/devel/fs :: Link / Comments (0)
Anathomy of the filesystem. ->lookup() and ->read_inode() callbacks. First pohmelfs results.
I talked about ->readdir() callback
previously,
now its time to get other two the most significant callbacks in the VFS lyer.
I call them the most significant (three), since without them it is impossible to
mount and get data from filesystem, they have to be implemented for any FS.
Ok, let's first look at ->lookup().
It has following prototype:
struct dentry *pohmelfs_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
As name suggests, this callback is used to lookup inode for given directory entry.
One can check struct dentry, it contains qstr field, which in turn
has char array containg name, it also has its length and hashed value (used in dentry cache).
When inode number is found for given directory entry, inode has to be allocated and filled
by metainformation. It then should be added into dentry:
err = -ENOMEM;
inode = iget(dir->i_sb, cmd->ino);
if (!inode)
goto err_out_free;
kfree(data);
d_add(dentry, inode);
That's all for this callback. Pohmelfs uses simple request/reply protocol to get inode for given name,
userspace server is rather dumb and contains linked list (it will be changed to tree) of all object names
in given directory, so it looks parent directory up, and then finds given name in the dir, then it sends
data to client. This operation can be potentially fast (only two tree lookups - one to get parent dir
in the main tree and one to find object in the dir).
Pohmelfs client in future can cache received information, so that subsequent access to the same dir would
not require rather slow network operations. Right now it does not.
Second callback is ->read_inode(). As name suggests, this has to read
inode's metainformation from disk to RAM. It has following prototype:
static void pohmelfs_read_inode(struct inode *inode)
quite simple. Folowing members have to be filled in this callback:
- i_mode - file mode (file/dir/somthing, access rights)
- i_nlink - number of links to this inode
- i_uid/i_gid - uid/gid of the owner
- i_blocks - number of blocks allocated for this object on disk
- i_rdev - if object is not regular file, this will hold device numbers
- i_size - size of the object
- i_version - used by some filesystems to show that given inode is dead (or not uptodate)
- i_blkbits - 1 shifted left by this number results in filesstem block size
- i_mtime/i_atime/i_ctime - modify/access/create time for given inode
- i_fop - file operations for given inode, this operations include read/write/readdir/aio_read and so on
- i_op - inode operations, this includes lookup
- a_op - address space operations, this include readpage/writepage/sync_page/prepare_wrte/commit_write operations>
Pohmelfs uses simple request/reply protocol to get this information from the remote server (except
various operations).
Having that, one can create simple
$ wc -l fs/pohmelfs/*.[ch]
120 fs/pohmelfs/config.c
218 fs/pohmelfs/dir.c
417 fs/pohmelfs/inode.c
96 fs/pohmelfs/net.c
169 fs/pohmelfs/netfs.h
1020 total
network filesystem, which allows to read data from the remote server
$ wc -l ./fserver/*.[ch]
267 cfg.c
750 fserver.c
581 list.h
390 rbtree.c
164 rbtree.h
2152 total
Note, that rbtree.[ch] and list.h I just got from kernel sources.
Here is an example on client machine:
# ./cfg -a 192.168.4.81 -p 10250 -i 0
# mount -t pohmel /dev/hdb1 /mnt
# ls -l /mnt/
total 88
drwxr-xr-x 2 root root 4096 2007-12-21 15:01 bin
drwxr-xr-x 4 root root 3072 2007-12-21 15:01 boot
drwxr-xr-x 11 root root 3780 2007-12-21 15:01 dev
drwxr-xr-x 105 root root 12288 2007-12-21 15:01 etc
drwxr-xr-x 6 root root 4096 2007-12-21 15:01 home
drwxr-xr-x 14 root root 4096 2007-12-21 15:01 lib
drwx------ 2 root root 16384 2007-12-21 15:01 lost+found
drwxr-xr-x 2 root root 4096 2007-12-21 15:01 media
drwxr-xr-x 2 root root 0 2007-12-21 15:01 misc
drwxr-xr-x 4 root root 28 2007-12-21 15:01 mnt
drwxr-xr-x 2 root root 0 2007-12-21 15:01 net
drwxr-xr-x 2 root root 4096 2007-12-21 15:01 opt
dr-xr-xr-x 197 root root 0 2007-12-21 15:01 proc
drwxr-x--- 9 root root 4096 2007-12-21 15:01 root
drwxr-xr-x 2 root root 4096 2007-12-21 15:01 sbin
drwxr-xr-x 5 root root 0 2007-12-21 15:01 selinux
drwxr-xr-x 3 root root 4096 2007-12-21 15:01 srv
drwxr-xr-x 5 root root 4096 2007-12-21 15:01 storage1
drwxr-xr-x 7 root root 4096 2007-12-21 15:01 storage2
drwxr-xr-x 12 root root 0 2007-12-21 15:01 sys
drwxrwxrwt 20 root root 4096 2007-12-21 15:01 tmp
drwxr-xr-x 13 root root 4096 2007-12-21 15:01 usr
drwxr-xr-x 23 root root 4096 2007-12-21 15:01 var
# mount | grep mnt
/dev/hdb1 on /mnt type pohmel (rw)
Believe me or not, that is exactly content of the '/' on the my desktop,
which is used as a server.
Next step is readpage/writepage/prepare_write/commit_write callbacks, which will allow
to read and write files.
Stay tuned.
/devel/fs :: Link / Comments (0)
Thu, 20 Dec 2007
open-by-inode() vs. name lookup in network filesystems.
Network filesystem is a tricky bustard - depending on where it is implemented
(kernel or userspace) it is very different. By 'very' I mean really complex differences.
In kernel inode, or basic object's identity, always exists for all objects
checked before (until special steps completed, when inode is dropped, but usually
it stays alive - for example when you traverse some dir, inodes for every object
you checked continue to exist, even if you already do not use that directory.
When file is opened, inode will be attached to file, when file will be closed, inode
will live. This is a fundamental feature of the split of directory entries and inodes -
directory entries are linked into the tree, which we can see, but inodes
are shadowed objects behind that entries.
In userspace things are completely different: there are no indes, but only files,
identified by file descriptors. That's all. So, when kernel performs a lookup,
it checks some name in the inode with given number - i.e. it perfoms in-kernel
reference-by-inode operation, but in userspace there is no API (except rare special cases,
which I think Zach uses in
CRFS,
and that is likely good speedup for Btrfs)
to get file handler by inode number. Basically userspace should have either
opened file descriptor for parent directory, or perform a reverse lookup,
create a path and open directory to check if some object exists there, since
userspace can only work with file descriptors.
open-by-inode was marked by Linus Torvalds as fundamentally broken
because of number of reasons (namely because of races with directory layout changes
like move and rename), and likely it is correct, but absence of such API
greatly reduces performance of userspace metadata operations.
Having network fileserver in kernel is of course much (MUCH) simpler and faster,
but so far its implementation will be postponed a bit.
Initial server will be quite dumb - it will always perform a lookup from the
root and always close directory, later it will be possible to add cache of opened directories...
/devel/fs :: Link / Comments (8)
Wed, 19 Dec 2007
Anatomy of the filesystem. ->readdir() callback.
Here I will write simple notes about how some callbacks are
used in linux VFS and what filesystem write should implement
to be correctly understood by VFS layer.
Let's start from essentially the first callback invoked by FS after
fs has been mounted. As name suggests, ->readdir()
is used to read directory content. Its prototype looks like this:
static int pohmelfs_readdir(struct file *filp, void *dirent, filldir_t filldir)
where filp is a file structure which is connected to the root inode (which you
have to initialize in ->fill_super() callback to be able to mount fs).
Dirent is a magic structure, which hosts all directory content you will read,
and filldir is a function, which transforms directory names
into dirent structure.
Its prototype looks like this:
int filldir(void * __buf, const char * name, int namlen, loff_t offset,
u64 ino, unsigned int d_type)
and is invoked this way:
size = 1;
if (filldir(dirent, ".", size, filp->f_pos, inode->i_ino, DT_DIR) < 0)
return -1;
filp->f_pos += size;
I think every step is very straightforward, except last two entries:
the former is inode number, which is unique id of the structure, every filesystem
has to store it on disk for every inode, obviously in Unix '.' refers to curent
dir, so its inode number should be taken from the current dir inode. For '..' directory,
which is a parent for given one, filldir() is executed by the following way:
size = 2;
if (filldir(dirent, "..", size, filp->f_pos, parent_ino(filp->f_path.dentry), DT_DIR) < 0)
return -1;
filp->f_pos += size;
and for some other dir:
size = 8;
if (filldir(dirent, "test_dir", size, filp->f_pos, 14, DT_DIR) < 0)
return -1;
filp->f_pos += size;
where '14' is inode number for 'test_dir' subdir.
Directory listing for this filesystem will look like this (data from live pohmelfs setup):
# ls -la /mnt/
total 9
drwxr-xr-x 1 root root 4096 1969-12-31 20:02 .
drwxr-xr-x 21 root root 1024 2007-02-08 15:04 ..
drwxr-xr-x 1 root root 4096 1969-12-31 20:02 test_dir
# mount | grep mnt
/dev/hdb1 on /mnt type pohmel (rw)
The last parameter of the filldir() is type of the directory entry, DT_DIR
is for directories and it corresponds to 12-15 bits of the stat.st_mode returned from
stat() call.
Note, that ->readdir() will be invoked (by ls -la at least) until
filp->f_pos stops changing, so after you filled your directory entry and properly updated
filp->f_pos, you have to check, that provided filp->f_pos exceeds or not
size of the directory (here I mean overall size used by every copied directory entry), and if it does
(or is equal), just return 0.
So, how network filesystem
should behave here? Answer is pretty simple: it should just send a request to remote server to provide
directory listing, copy answer to the allocated buffer and fill directory with provided data. It is possible
to cache that data here, but each subsequent ->readdir() has to check on server that data
is still valid and was not changed.
With this work pohmelfs becomes a network filesystem,
with many interesting features I have in mind, but will open when they got implemented.
This is not intended for mainline inclusion, since Zach Brown's work was first
and likely will be more stable and/or feature complete when this stuff become ready.
But nevertheless, stay tuned..
/devel/fs :: Link / Comments (0)
Tue, 18 Dec 2007
Fundamental race between block layer/IO and networking.
This header is about impossibility to work without races with
netowork's ->sendpage() method, which is used mostly
to transfer IO mapped pages, without either turning off offload capabilities
and copying data into new buffer or using own acks in the protocol.
->sendpage() in the optimised case (hardware supports checksum offloading
and scater/gather) will not copy content of the page to the new buffer, but instead
will increase page's reference counter, so that page could not be freed. When
->sendpage() returns this does not guarantee, that data was sent, received by remote
side or whatever, since packet can be queued (in hardware or qdisk), it can be later retransmitted,
there is no way to know that data was received until ACK (lets talk about TCP)
is received, but there is no API to know that ACK was received. When ACK is received,
appropriate packet will be found in the TCP retransmit queue and freed, this will drop page's
reference counter.
If user (and there is no other way actually) does expect that after
->sendpage()'s return data can be processed (for example rewritten),
then there is non-zero probability that remote side will get this new data, instead
of old, which can lead to state machine breaks and data corruption.
One can try to use sendfile() and simultaneously write data to the
file - remote side can get mix of the old and new data. One can argue that using proper locking
around sendfile() and write will help, but actually it will not -
consider the case when we send only single page - after sendfile() returned,
data still can be in the queue, so subsequent write, which already does not race with
sendfile() itself, but not with data sending, will overwrite data and
remote side will get new one instead of old data.
There are two fixes for thei problem: first is not to use ->sendpage()
(or use it with copy of the data into new buffer, which is essentially how
usual send() works), second is to use protocol specific acknoledgement
system, so that any subsequent operation on given data would be postponed not until
->sendpage()/sendfile() returns, but until that ACK received.
Both greatly harm performance.
I would be really glad to find that my conclusions are incorrect.
/devel/fs :: Link / Comments (8)
Climbing evening.
That was very good although again a bit shorter training -
most of it was devoted to the complex trace with the start on the
horizontal negative slope, which sucked power very quickly, so that
at the end (after about 3 hours) I was not able to complete even small
parts of it (while doing it quite stable at the begining).
Trace requires back and arms especially, so after the training I feel
myself tired as hell, which is great of course!
It was very good time there today!
/life :: Link / Comments (0)
Mon, 17 Dec 2007
New release of the distributed storage: Dancing with the smoked neutrino.
Short changelog:
- new improved mirroring algorithm.
This algorithm uses sliding window approach for full resync
and write log for partial resync.
- fixed number of typos and debug cleanups
- update inode size when linear algorithm changes the size of the
storage in run time
- extended number of sysfs files and documentation for them
- fixed leak in local export node setup
- name is 'Dancing with the smoked neutrino' now
Overall list of features of the DST can be found on project's
homepage.
DST is also exported as a git tree available for clone and pull from
here.
Interested reader can test DST with 2.6.23 tree too
(it should compile fine, but was not tested).
/devel/dst :: Link / Comments (4)
New distributed storage mirroring algorithm.
Resync logic - sliding window algorithm.
At startup system checks age (unique cookie) of the node and if it
does not match first node it resyncs all data from the first node in
the mirror to others (non-sync nodes), each non-synced node has a
window, which slides from the start of the node to the end.
During resync all requests, which enter the window are queued, thus
window has to be sufficiently small. When window is synced from the
other nodes, queued requests are written and window moves forward,
thus subsequent resync is started when previous window is fully completed.
When window reaches end of the node, it is marked as synchronized.
If age of the node matches the first one, but log contains different
number of write log entries compared to the first node (first node always
stands as a clean), then partial resync is scheduled.
Partial resync will also be scheduled when log entry pointed by resync
index of the node contains error.
Mechanism of this resync type is following: system selects a sync node
(checking each node's flags) and fetches a log entry pointed by resync
index of the given node and resync data from other nodes to given one.
Then it checks the rest of the write log and checks if there are
another failed writes, so that next resync block would be fetched for
them.
Mirroring log is used to store write request information.
It is allocated on disk and in memory (sync happens each time
resync work queue fires), and eats about 1% of free RAM or disk
(what is less). Each write updates log, so when node goes offline,
its log will be updated with error values, so that this entries
could be resynced when node will be back online. When number of
failed writes becomes equal to number of entries in the write log,
recovery becomes impossible (since old log entries were overwritten)
and full resync is scheduled.
This does not work well with the situation, when there are multiple
writes to the same locations - they are considered as different
writes and thus will be resynced multiple times.
The right solution is to check log for each write, better if log
would be not array, but tree.
/devel/dst :: Link / Comments (0)
Fri, 14 Dec 2007
Linux Test Project on top of DST storage.
# pwd
/mnt/ltp-full-20071130
# ./runltp -p -f fs -d `pwd`/tmp
...
# cat /mnt/ltp-full-20071130/results/results.2007-12-14.11.21.41.17106
Test Start Time: Fri Dec 14 11:21:41 2007
-----------------------------------------
Testcase Result Exit Value
-------- ------ ----------
gf01 PASS 0
gf02 PASS 0
gf03 PASS 0
gf04 PASS 0
gf05 PASS 0
gf06 PASS 0
gf07 PASS 0
-----------------------------------------------
Total Tests: 57
Total Failures: 0
Kernel Version: 2.6.22-rc5-dst
Machine Architecture: x86_64
Hostname: uganda
# mount | grep mnt
/dev/dst-storage-32 on /mnt type xfs (rw)
# cat /sys/devices/storage/n-0-ffff*/type
R: 192.168.4.81:1025
R: 192.168.4.81:1026
All 'fs' tests completed successfully, although I saw following dump in dmesg:
[ 8398.605691] BUG: MAX_LOCK_DEPTH too low!
[ 8398.609641] turning off the locking correctness validator.
which is XFS bug.
Since DST is quite dumb device, that tests will not find tricky places, but they are good
to generate high load on top of given block device.
/devel/dst :: Link / Comments (0)
New release of the userspace network stack.
Changed data reading function, now it does not copy TCP header into
user's buffer, only data, and forced packet socket reading path
to limit maximum number of packets to be read, which do not match
created netchannel.
As usual, new release is available from project
homepage.
/devel/networking/unetstack :: Link / Comments (0)
New mirroring module in the distributed storage.
$ git-diff-index --stat HEAD drivers/block/dst/alg_mirror.c
drivers/block/dst/alg_mirror.c | 745 ++++++++++++++++++++--------------------
1 files changed, 364 insertions(+), 381 deletions(-)
It is cool and works good in my environment, but (like previous) it
forces total mirror resync after main storage node reboot or crash (if it is
required, for example when array was not in sync already and main node rebooted).
I want to extend DST mirroring algorithm not to force full resync, but store a log
of the writes on each node, so when new array starts, it would check not only
age of the nodes (uique id stored at the end of each node, if it does not match,
total resync starts), but also write log, so that the latter does not match, only
selected number of regions would be synchronized.
Stay tuned...
/devel/dst :: Link / Comments (0)
Thu, 13 Dec 2007
Why pushing project into the kernel is not a main goal?..
One have to have some courage and do not afraid to throw something
out and create new things instead of old, even if it will require a lot
of efforts and some problems in a short cycle.
So I've just erased mirroring algorithm from DST and will rewrite it mostly
from scratch, since I have a very interesting sync algortihm inmind,
which will not require clean/dirty bitmap.
Havind DST in kernel would not allow me to have such flexibility...
/devel/dst :: Link / Comments (0)
Wed, 12 Dec 2007
Climbing evening.
It was again a bit late training and thus shorter than usual,
but nevertheless it was very saturated - I tried old complex
start on the horizontal negative slope and several times
managed to complete it fully. That's a very interesting and complex
trace itself, but some time ago I tried some of its bits and completed
them. I think I can finish it without falls after several trainings,
but right now I'm working with the most complex I think part: with
power sucking start.
Horizontal negative slope is usually a big problem for me because of
my power endurance, it also requires very strong back in some movements,
so right now I'm feeling that I still have some muscles in the body and
they did not dissapear after sitting in the chair most of the time.
Excellent time!
/life :: Link / Comments (0)
I was a bit pessimistic about DST design bugs.
Things are only bad when resync of the mirror node is in place...
I fixed both issues, but will spent additional time debugging and testing
the them, since I do not like how it was done. I think I will rewrite mirroring
resync logic.
Subrata Modak of IBM suggested to use
Linux Test Project, which I found to
have interesting benchmarks, which while being very useful for filesystem
development, still can find some bugs in DST.
/devel/dst :: Link / Comments (0)
Shame on me or how complex are design bugs...
I have to admit, that mirroring in DST is not currently well supported.
First, because of a bug I made in the early development stage: in DST there
are two objects, which represent a part of the storage, first one is a node,
this object contains information about type of the storage and pointers to
structure, which represents low level device itself (like block device or network
connection). Network connection in turn is represented as a state structure,
which contains socket, state machine for transferred data and so on.
Nodes are used when block io request comes from the higher layer and
states are used when data is transfeerred via network. The former uses
fain grained reference counters: when node is being operated on (request is processed),
its reference counter is increased, if operations become asynchronous
(for example sending queue is full and thus block can not be sent right now),
then block request is queued into state's request list and reference counter for
the node is dropped. If it reaches zero, node is being freed, which in turn
calls exit callback for the state, which flushes the queue of requests.
Things seem simple and correct, but devil is in details - async processing thread
can enter at any point into the game and process state too, which leads to bugs.
Second, DST mirroring can ate all your memory during resync, since it does not check
amount of free ram in the system and tries to allocate new pages until all memory is used.
This is already fixed in the private tree though.
And the last (known) problem is mirror bitmap - it uses single bit for single sector
of the device, and although uses vmalloc(), it is still too much of RAM.
Back to fixing.
/devel/dst :: Link / Comments (0)
Tue, 11 Dec 2007
First pohmelfs dmesg and bits of Linux VFS internals.
[ 9941.748766] pohmelfs_alloc_inode, inode: ffff81003bc83ac8.
[ 9941.755070] pohmelfs_read_inode, inode: ffff81003bc83ae0, num: 12,
inode is regular: 0, dir: 1, link: 0.
[ 9947.667710] pohmelfs_readdir: filp: ffff81003c5ad6a8, inode: ffff81003bc83ae0,
dirent: ffff81003c5aff38, filldir: ffffffff8027f274.
[ 9950.283976] pohmelfs_readdir: filp: ffff81003e82faa8, inode: ffff81003bc83ae0,
dirent: ffff81003a00ff38, filldir: ffffffff8027f274.
[10028.705354] pohmelfs_readdir: filp: ffff81003d4f1068, inode: ffff81003bc83ae0,
dirent: ffff81003e10ff38, filldir: ffffffff8027f274.
[10095.745022] pohmelfs_lookup: dir: ffff81003bc83ae0, dentry: ffff81003b5343a0,
nameidata: ffff81003e10fe88.
[10095.754922] pohmelfs_lookup: dir: ffff81003bc83ae0, dentry: ffff81003b5343a0,
nameidata: ffff81003e10fdf8.
uganda:~# mount | grep pohmel
/dev/hdb1 on /mnt type pohmel (rw)
uganda:~# ls -la /mnt/
total 0
It is about 12kb of code just to register own filesystem and provide number of VFS
callbacks, so that filesystem could be mounted.
It is not possible to create files or directories since directory lookup method is not
implemented (it returns NULL), ls -l does not show any data since ->readdir()
callback does not fill directory entries, since there are no such objects in the filesystem
at all.
As you understood, this is fairly trivial implementation, which was created just as a reference point.
So far it includes stubs for the following VFS methods:
- basic address space operationsL
->readpage() which reads a page, usually implemented as a generic mpage_readpage(),
which uses per-filesystem get_block() callback. This is called via read path,
when file's page is not in the page cache yet.
->writepage() - writes a page usually via generic block_write_full_page()
helper, which uses per-filesystem get_block() callback. This is called by the VFS
core when there is a need to write page from the cache to disk. This happens for example when
you call sync and friends.
->prepare_write()/->commit_write() - they are called via write path (for example from
generic_file_buffered_write()), this functions has to reserve a space on disk,
update related metadata and perform other private filesystem steps for given page, which will be
flushed to that on-disk area in ->writepage().
- basic directory inode and file operations:
- file operations include
->read() callback, which has to return -EISDIR, and
->readdir(), which has to read directory entries for given inode into provided buffer.
Right now it is empty.
- inode operations are used to create/remote/lookup and perform other tasks on directory content.
Readonly filesystems only have to provide
->lookup() callback, which is used to lookup
inode for given directory entry. Others have to implement lot more operations: create, lookup, link, unlink,
symlink, mkdir, rmdir, mknod, rename, setattr, set of extended attributes operations and so on...
Pohmelfs currently does not perform anything at all, but already provide an empty lookpup callback.
- basic file operations (file operations itself and inode operations for regular files):
- file operations for regular files are those provided by
generic_ro_fops currenly,
it includes:
->llseek() - generic_file_llseek() - seek inside file mapping, it just updates
files current position and performs some checks, so it does not include anything filesystem specific.
->read() - do_sync_read() - a helper used by read syscall, it will eventually call
->aio_read(), which is generic_file_aio_read() for this file operations,
it will call ->readpage() for pages, which are not yet in the page cache
->aio_read(), described above.
->mmap() - generic_file_readonly_mmap() - it will setup a mapping file operations,
which include only a fault handler, which in turn will call page_cache_read(), which
ends up with ->readpage() calls. Of course mapping is a bit more complex tasks, but from
the filesystem point of view that all what we have to know.
->splice_read() - generic_file_splice_read() - this callback is used for splice
system calls, which ends up calling the same ->readpage() callback for the set of pages,
which are put into spliced buffer of pages.
- inode operations for regular files is not needed, if it is readonly filesystem (although it can provide
some useful callbacks like getting extending and usual attributes), for usual filesystem at least
->truncate() and ->getattr() callbacks are required.
/devel/fs :: Link / Comments (0)
Mon, 10 Dec 2007
I have started laid off process.
Most of the projects have been moved to collegues, talks with management completed.
Just waiting for tiny bits and that's all...
/devel/other :: Link / Comments (2)
PohmelFS.
linux-2.6.fs$ mkdir fs/pohmelfs
linux-2.6.fs$ date
Mon Dec 10 19:38:53 MSK 2007
Stay tuned...
This is a working name of the filesystem, I will think about release name later.
First I will implement a simple base, which will just register itself with the Linux
VFS code, so that I will put here some specs about what Linux VFS requires from the
filesystem. In parallel it will be used as a base for either
network filesystem
and/or distributed/local filesystem.
/devel/fs :: Link / Comments (2)
New distributed storage release: Gamardjoba, genacvale!
Short changelog:
- wakeup state when mirror detected error to seedup reconnect
- if connecting in csum mode to no-csum server, do not enable csums
- do not clean queue until all users are removed
- allow to increase size of the storage in linear add callback
(with this change it is possible to add nodes into linear array
in real time without stopping storage. Filesystem has to be prepared
for the case when underlying device has changed its size.
Real-time addon of mirror nodes is also supported)
- allow to delete gendisk only after device was started
- dst debug config option
- Name: Gamardjoba, genacvale! ('Hi friend' in georgian)
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging!
As usual, one can get new release from the project homepage.
/devel/dst :: Link / Comments (0)
Sat, 08 Dec 2007
Pancho Villa.
I spent excellent time in this mexican restourant with friends.
We celebrated Perec's birthday: tequilla did flow plentifully, buritos
were hot, and hours dissapeared silently.
Excellent time!
/life :: Link / Comments (0)
Fri, 07 Dec 2007
Climbing evening.
It was quite short and not very hard training - I was a bit later
than usually, and most of the time I tried quite old but very complex
start on the horizontal negative slope. Meantime I talked with instructor
and found that start in question does not contain one hold, which was
there originally, so that should explain why I fell. I will continue that
red trace next time, I even want to put a huge paper around another hold, located
where old one was: 'I'm a red hold, I'm just feigning'.
/life :: Link / Comments (0)
Strong checksumms in DST rocks.
Great thanks to person, who suggested me
to implement them and Zach Brown, who showed, that
Castagnoli crc is a better one than Adler.
I've debugged a setup where system failed to mount XFS filesystem on top of distributed storage,
and after turned on strong checksums, system detected they were wrong, so some corruption
happend during filesystem setup.
Turning off TSO, RX and TX offload of e1000 nics on machines, which form the storage, fixed the problem.
Strong checksumms rocks!
/devel/dst :: Link / Comments (3)
Distributed storage and long distances.
I've just completed some tests over the distributed system,
created on top of usual internet links between machines,
located in Moscow, Russia and London, UK.
Remote target was setup, then XFS filesystem created, mounted
and some tests ran.
One of the machines (main storage server) is located behind at least
one NAT firewall.
/devel/dst :: Link / Comments (4)
The return of syslets.
Zach Brown announced
new syslet patchset aimed to simplify and stbilize basic async operations.
Syslets is a mechanims of performing syscalls asynchronously - new thread
is started when syscall is about to block, execution blocks and old thread
is scheduled away to the new one, on behalf of which userspace continues its
execution.
Version 7 of the patchset was built on top of indirect syscall, threadlets,
userspace function execution and async io was removed from the patchset for simplicity,
number of comments and code clarifications were added.
Main goal of the syslets right now is to make fundamental things working right.
Asynchronous IO operations has too long history already - it was implemented
as a state machine in KAIO and kevent AIO,
kernel supports AIO for directIO operations (userspace requires libaio).
Syslet approach was shown to be in some cases much slower than libaio (which is actually
a sync operations for usual files), but it was resolved as unfairness of CFS scheduler,
and (iirc) it was fixed/extended.
My main objection against this is the fact, that when you have thousands of actively
running applications, system starts sucking badly, but if it is possible to reduce maximum
amount of working thread per user to some resonable limit, things will be just fine.
Syslets (and its more friendly threadlets user) were supported by Linus and Ingo Molnar,
so very likely it will be the default way to do asynchronous IO and other operations.
Right now Zach highlighted following problems:
- ring buffer of syslet statuses limitations
ptrace() problems
- stale data (when thread issuing a syslet calls for example
setuid(),
in which case another thread, which actually executes blocked syscall, contains wrong data)
- problems with
sys_clone() and syslets, sys_clone() is actually
a mechanism to create a new thread in syslets, so we get a recursion
All above problems are technically not-impossible for resolution, and I think it is not
that bad to introduce some simple limitations for users, so that majority of async IO qustions
are resolved with this mechanism.
/devel/other :: Link / Comments (0)
B(something)-tree vs RB-tree. On-disk allocations.
In the previous
article it was shown, how btree and rbtree behave with allocations are being
done in memory. In such conditions btree should suck compared to rbtree, and generally it is
true, although in some conditions its insert speed can be even slightly higher htan rbtree.
Now, let's check how they behave when all allocations are performed from disk.
Below graph shows insert speed for both rbtree and btree in such conditions,
each node was allocated with 1024+sizeof(node) offset from previous one
so that readahead and thus cached disk apges would not influence the results.
Totally 1 million keys were inserted into the tree.
Search speed is roughly the same as with in-memory tests, since most of the tree
sat in the ram after insertion.

High jump around 220 keys is likely a place, where node size becomes bing enough,
and amount of them is small enough, so that total tree started to fit the page cache.
In some cases there is no such a peak and graph slowly moves to around 40k insertions
per second, which likely happens when some background task is actively using page
cache flushing away test file's pages from the memory.
/devel/fs :: Link / Comments (0)
The most discouragement-resistant hacker out there.
That is how Jonathan Corbet calls me :)
/devel/other :: Link / Comments (0)
Thu, 06 Dec 2007
Multithreaded filesystem access.
Trees are generally (if not always) very bad in parallel access,
since there is no a good strategy what to lock and tree modifications
usually requires more than one node changes and in some cases (like b-tree
or AVL tree) can lead to changes at every layer.
Thus it is much simpler to lock the whole tree during any changes, but since
not every node in the tree is in the main memory and thus has to be fetched
from the disk, this can lead to long delays per operation.
Contrary Linux VFS operates with pages, where each page is locked individually.
Similar changes for hash tables (i.e. one lock per hash bucket) actually leads
to lower performance since when the whole table is locked by single lock because
of bad cache line, containing per-bucket lock, bounces, but this, again, is only
applicable to main memory, since usually access to single bucket in the hash table
is quite cheap even if it contains several entries.
So, I do not know perfect locking scheme for trees, when they are allocated
on the disk, so I will find that knowledge in experiments.
The best solution, which is the most related to the real life, is trivial filesystem of course.
Initially this will be a simple and very small kernel module with basic filesystem
in it, so that it could be trivially changed to support on-disk filesystem and
network filesystem.
I wanted to put my dirty hands into it quite for a while already, so it is time to start...
Stay tuned!
/devel/fs :: Link / Comments (2)
A simple way to crash machine using XFS and DST.
Let's suppose you want to create an XFS on top of DST array.
If you mistakenly will run mkfs.xfs /dev/sda1 (let's suppose
you want to create DST storage on top of /dev/sda1 device)
and then start DST on top of /dev/sda1:
./dst -n storage -A alg_mirror -d /dev/sda1 -R -s0 -S0
this will overwrite the last sector of the /dev/sda1,
where XFS stores its metadata. Mounting XFS after that will lead
to almost 100% crash of the machine on 2.6.22 kernels because of some
bugs in XFS, which appear when XFS reads corrupted metadata from the
last sector.
To work with DST you have to operate with /dev/dst-$storage-$num
devices (i.e. run mkfs.xfs /dev/dst-$storage-$num), and not with
underlying ones.
/devel/dst :: Link / Comments (0)
Wed, 05 Dec 2007
BTRFS 0.9 release.
Chris Mason announced
new release of his btrfs filesystem.
It includes:
- bigger filesystem block sizes
- extended attributes (no ACL yet)
- extent alignment parameter
- inlining of the file data into btree
- number of performance and stability improvements
Chris also showed
a rough timeline for the filessytem development.
As he pointed, btrfs is still very bad in database loads and does not support
multithreaded operations.
As you probably got, implemented inlining of the file data into btree
is virtually scaling inodes
algorithm, although a bit simpler.
I do like btrfs, and wish a great success to this filesystem. But onlu until I start my own :)
Kidding of course.
/devel/fs :: Link / Comments (0)
Storage hotplugging in DST.
For the interested reader: yes, it is possible to add disks
into DST storage on the fly, but be sure that your filesystem supports that
(in case of linear setup), mirroring is fairly transparent.
Command to add another node into mirror setup is pretty simple:
./dst -n storage -A alg_mirror -S0 -s0 -a kano -p 1026
Just like adding usual node into the storage before it was started.
Please note, that when adding node which is smaller than current device size,
device size will be reduced and this can damage your filesystem!
The same applies to linear setup.
/devel/dst :: Link / Comments (0)
Tue, 04 Dec 2007
DST FAQ.
The most frequently asked question about DST is:
Can you give us a summary of how this differs from using device mapper with NBD or iSCSI?
Answer is quite simple:
From the higher point of view it does not, but it operates quite differently:
it has async processing of the requests, thus not blocking, it has
different protocol with smaller overhead, supports strong checksums, has
in-kernel export server, which supports simple security attributes (i.e.
allow to connect, to read or write). It uses smaller amount of memory
(zero additional allocations in the common path for linear mapping,
not including network allocations, it uses smaller amount of additional
allocations for mirroring case).
DST supports failure recovery in case of dropped connection (core will
reconnect to the remote node when it is ready), thus it is possible to
turn off and on remote nodes without special administration steps. DST
has simple autoconfiguration at the startup time (support checksums and
storage size autonegotiation). It is possible to turn one of the mirror
nodes off and use it as a offline backup, since dst mirror node stores
data at the end of the storage, so it can be mounted locally.
/devel/dst :: Link / Comments (0)
New distributed storage subsystem release.
This is a maintenance release and includes
bug fixes and simple feature extensions only.
Short changelog:
- fixed bug with XFS metadata update (it can provide slab pages to the
DST, so it is not allowed to transfer them using
->sendpage())
- fixed async error completion path
- extended netlink communication channel to report errors back to userspace
- DST name is now "The 10'th dynasty of smuggled slothes"
- number of fixes for userspace DST target
Great thanks to Matthew Hodgson (matthew_mxtelecom.com) for debugging and
fixes for userspace DST target and preliminary netlink extension patches.
As usual you can download this release from the homepage.
If you want to try distributed storage this release is a really good candidate to start with.
Enjoy!
Update: This release includes bug fixes for all bugs described
here,
including uninterruptible sync read operations.
/devel/dst :: Link / Comments (2)
The 22'th century netchannels release.
This is the 22'th release of the netchannels, a peer-to-peer protocol
agnostic communication channel between hardware and users. It uses
unified cache to store channels, allows to allocate buffers for data
from userspace mapped area or from other preallocated set of pages
(like VFS cache). All protocol processing happens in process context.
Users of the system can be for example userspace - it allows to receive
and send traffic from the wire without any kernel interference, to
implement own protocols and offload its processing to the hardware.
This idea was originally proposed and implemented by Van Jacobson.
This patchset (with userspace netowrk stack) is a logical continuation
of the idea with move to the full peer-to-peer processing.
Short changelog:
- update cached route in the netchannel when it expires
Thanks to Salvatore Del Popolo (delpopolo_dit.unitn.it) for testing.
You can get the latest sources from netchannels homepage.
Userspace network stack is available from own homepage.
/devel/networking :: Link / Comments (0)
Mon, 03 Dec 2007
Climbing evening.
That was a great training, although I completed not that many traces,
but they all were good. Several old ones from really simple to quite
interesting at the beginning, then number of traverses and boulderings.
When Grange arrived it was already
quite late, so we made couple of simple traces without the rest in between,
and then the greatest trace started: 7a+ (although I was allowed to modify
it a bit, so I reduced its categry down to 6c/6c+) over the black holds
in the left center sector. I did not finish it (it is not on-sight already
quite for a while), since was too tired, but made a key point several times,
which flushed me down to the bottom.
I tired as hell and that was a great feeling!
/life :: Link / Comments (0)
Sun, 02 Dec 2007
Distributed filesystem roadmap.
- Distributed storage. This step is mostly completed, although some bugs are there
and there is number of features to be implemented, work is being done on them
and it is no the finish line. Feature list include:
- sync/barrier support
- error report to usersapce via netlink (patch was made by Matthew Hodgson (matthew_mxtelecom.com)
- some thoughts about sync operations which can stuck in uninterruptible state if there are some
problems with remote noes (Hi NFS), I will create a fix for this issue for DST at least.
There is a nasty bug in DST currently, which I can not reproduce locally and debug it with Matthew
on his setup.
There is also fair number of fixes for userspace DST target made by him. Great thanks!
- Local filesystem with very scalable and fast on-disk format,
possibilities to have on-line backups, snapshots, no fsck, scalable
locking (multithreading reading and writing).
This had originally a very simialr to btrfs design,
but I want to move further and have ability to perform multithreaded and then
mutli-machine access to the same files. Call me a looser or wheel reinventer (I would not be
where I am if I cared about it), but I want to have a project where I know every single bit
to be able to fix things quickly and break something if it is needed for better implementation.
- linking both network and fs layers together, this will include
distributed byte-range locking and cache coherency for client nodes. Bits of this step
I described in short discussion
with Zach Brown.
This does not mean steps will be completed in the above order, I'm working in parallel in
different directions and some parts can appear earlier, so that I would be able to evalute
its problems.
Bug fixes has obviously the highest priority.
/devel/fs :: Link / Comments (0)
Meanwhile at appartment development side.
I reached a big milestone yesterday - I completed wallpaper glueing
in the hall, room and checkroom. Although it still requires
some fixes and bits of work with a knife, it is a huge step forward.
I also wanted to paint the walls in the room yesterday, but fell in slack.
Today is supposed to be another heavy working day - I will move
to the development shop (on the opposite side of Moscow) to get neon cord,
water system hatch and ceiling for the bathroom. If I will return not that late,
I Will start setting them up, otherwise will paint the room.
I'm curious, when I will have a real vacations and some rest, but I think I found an answer -
tomorrow I will start discharge process at work and expect it to be completed
in a week or two at most, so I will have about two weeks before the new year
celebrations. Most of them will be devoted to the appartment developemnt though.
Well, we will definitely have some rest in an eternity...
/devel/flat :: Link / Comments (0)
|