|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Thu, 31 Jan 2008
POHMELFS release notes.
One can grab release notes, my thoughts (a bit chaotic) and code
here (POHMELFS core)
and local-only-cache hack here.
Please note that POHMELFS is less than one month old, so do not
be too severe with it :)
And I'm going to have some fuel about this release, it was hard, but bloody cool!
/devel/fs :: Link / Comments (0)
First POHEMLFS version, codename water:50ml, has been released.
A small benchmark of the local cached mode:
$ time tar -xf /home/zbr/threading.tar
POHMELFS NFS v3 (async)
real 0m0.043s 0m1.679s
Which is damn 40 times!
Excited? Below is a bucket with ice for you and me.
Of course this will not be _that_ huge difference in a real world, when
tested archives are larger (this one if a git archive of my
userspace threading
library), which is very small. Since it is so small there is no writeback
cache flushing.
But you got the key :)
And that version will not be released, since it uses so heavy hack,
called local cache, which is never synced with remote server. Actually one
can consider this as tmpfs or something like that. Code supports sync,
but since inode generation process is very different, files and dirs can not
be blindly synced to the ext3 fs. So, I will release POHMELFS as two patches:
first one is a network filesystem implementation with
write-through cache, when object is first created on the remote side and then
populated to the local cache. This one is slow.
Second patch is a hack to disable writeback caching and implement local caching
only, which is very fast.
After that I will start thinking about how to generically solve the problem with
syncing local changes with remote server. This, among others, will allow offline work
with automatic syncing after reconnect.
This is not intended for inclusion, CRFS
is a bit ahead of POHMELFS, but it is not generic enough (because of above problem)
and works only with BTRFS.
And, btw, I changed name conventions, since having set of volumes from 50ml to 1 liter
is not enough for serious development, I will prepend a liquid name for each raw. So, it will
be water:{50ml, 100ml ... 1 liter}, tea {50 ml ... 1 liter} ... spirit {50ml ... 1 liter}.
Amount of different "waters" I know should be enough for this project :)
Stay tuned!
/devel/fs :: Link / Comments (0)
Nasty dentry abuse or...
... searching for rakes by stepping on them in a dark room. That is how I can describe
the process of hunting for obscure bugs in filesystem code.
Preface 1.
System locks hardly without any single message in dmesg, although all kernel
hacking options are enabled in config. System responses to ping, but there
is no way to login or to do somthing by local user.
Preface 2.
I recall, things were cool.
Bisecting is not my friend today, since fair number of fixes was added and
while I can find situation, when new bug does not exist, old ones can kill
the system, so I decided to manually check every patch in git I added
for the last days. Since I do not know VFS enough, there are several things
I just copied from other filesystems (most of them do it that way),
so I started to drop some bits out of that code in pohmelfs.
Eventually I found, that lookup, which fails to find requested dentry
in most filesystems adds NULL inode into dentry either via d_add()
or via d_splice_alias(). Both look harmless, except that dentry
with NULL inode exists in the dentry cache. Maybe it is good and there is
some other bug in pohmelfs, but after I added it I started to get that obscure
freezes (it is quite easily reproducible with almost 100% probability in some test),
and some times general protection fault happend in VFS code during umount.
So, I just removed code, which adds NULL inode into dentry via d_add()
and things are good again. I do not know how frequently this can happen in local filesystem,
but fact is fact, after removing this code pohmelfs behaves excellent (modulo its speed).
Edited to add: no, somthing wrong still exists in the system, although I'm not sure for whom
to blame:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.23-pohmelfs #4
-------------------------------------------------------
bash/4116 is trying to acquire lock:
(&journal->j_list_lock){--..}, at: [] journal_try_to_free_buffers+0xd4/0x187 [jbd]
but task is already holding lock:
(inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (inode_lock){--..}:
[] __lock_acquire+0xa66/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] __mark_inode_dirty+0xce/0x147
[] __set_page_dirty+0xd0/0xdf
[] mark_buffer_dirty+0x8b/0x92
[] __journal_temp_unlink_buffer+0x174/0x17b [jbd]
[] __journal_unfile_buffer+0xb/0x15 [jbd]
[] __journal_refile_buffer+0x6a/0xe3 [jbd]
[] journal_commit_transaction+0xf46/0x11eb [jbd]
[] kjournald+0xb5/0x1c1 [jbd]
[] kthread+0x3b/0x63
[] kernel_thread_helper+0x7/0x10
[] 0xffffffff
-> #0 (&journal->j_list_lock){--..}:
[] __lock_acquire+0x952/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] journal_try_to_free_buffers+0xd4/0x187 [jbd]
[] ext3_releasepage+0x68/0x74 [ext3]
[] try_to_release_page+0x33/0x44
[] __invalidate_mapping_pages+0x74/0xe0
[] drop_pagecache+0x70/0xd8
[] drop_caches_sysctl_handler+0x36/0x4e
[] proc_sys_write+0x6b/0x85
[] vfs_write+0x82/0xb8
[] sys_write+0x3d/0x61
[] syscall_call+0x7/0xb
[] 0xffffffff
other info that might help us debug this:
2 locks held by bash/4116:
#0: (&type->s_umount_key#11){----}, at: [] drop_pagecache+0x38/0xd8
#1: (inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8
stack backtrace:
[] show_trace_log_lvl+0x1a/0x2f
[] show_trace+0x12/0x14
[] dump_stack+0x16/0x18
[] print_circular_bug_tail+0x5f/0x68
[] __lock_acquire+0x952/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] journal_try_to_free_buffers+0xd4/0x187 [jbd]
[] ext3_releasepage+0x68/0x74 [ext3]
[] try_to_release_page+0x33/0x44
[] __invalidate_mapping_pages+0x74/0xe0
[] drop_pagecache+0x70/0xd8
[] drop_caches_sysctl_handler+0x36/0x4e
[] proc_sys_write+0x6b/0x85
[] vfs_write+0x82/0xb8
[] sys_write+0x3d/0x61
[] syscall_call+0x7/0xb
=======================
Although it does not contain any signs of pohmelfs, it still can be related...
/devel/fs :: Link / Comments (0)
BTRFS subvolumes.
Chris Mason created a short specification
for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks
on several devices and use tricky algoritms to distributed the load
between storages.
Overall this is excellent idea, but specification rises some questions and I belive
it is too heavily tied to ZFS design.
I will drop my thoughts here, which may be completely wrong though.
Here are some features btrfs will support with subvolume implementation:
Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents;
Checksum failure resolution by using a mirrored copy; Striped data extents and others.
They are clear targets for block layer, but there are following notes on why it is not:
If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum
failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of
the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but
against data itself (i.e. it has to be recalculated after read), since during transfer data can
be damaged and it is not that rare condition. Thus checksums from different mirror can be both
be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.
If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big
address space, it would not have sufficient information to allocate mirrored copies on different devices.
Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST
supports such iteraction for example.
Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem:
there is pool of devices, and there are allocation policies for each one in the following form (just an example):
files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3,
small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed
number of storages.
And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation,
since a screenshot of high-level design
clearly has Mac's shadows and fonts :)
/devel/dst :: Link / Comments (0)
Vodka, martini, absinthe.
All were damn cool. I first time tried vodka with martini,
shaked, but not mixed. Perfect. James Bond knows the things.
I mixed about half vodka nad half martini, so overall it was a bit softer than
pure fire water, and with excellent a bit dry and sweet after-taste of martini.
I recommend.
Absinthe was very good as usual too. Although it is very strong (around 70% of
alchohol) it is very tasty because of thujone (do not know real amount if it,
since usually vendors put wrong specs at bottles). It can be drunk without additional
food or drinks. Better with burnt sugar - resulted caramel adds very interesting
sweet taste.
/life :: Link / Comments (0)
Wed, 30 Jan 2008
POHMNELFS first release: 50 ml.
Is suspended for tomorrow. Kernel side is fully ready and was quite
actively tested (I think I found lots of tricks used by ext2 and others
when they maintain link counters and process inodes). There are some issues
with local caches, which I will think about later (there are two caches
right now - one is for gloabal hash to inode conversation, another one is
per-inode, it contains hash to inode number only keys, for example
it contains hard links and directories like '.' and '..', other usual
directories and files exists in both caches, the latter cache is used
for ->readdir() implementation, since it is also indexed
by offset field).
I will not release code today because of userspace server, which is so
utterly bad (for every single operation it has to traverse tree of the
objects and to open/close each parent's file descriptor), so it screams
for rewriting. At least for the initial rewrite it will open every single
object it contains, so that requests from remote client would not require
lots and lots of tree traversals. I know that this is not a good solution,
but the only good solution is to move server into the kernel too, but it
will take several days to complete, so it will be scheduled for future versions.
Also found that debugging in Xen is a nightmare: first, it does not support oprofile
(at least the latest version constantly says me "No sample file found" when I try
to see the report), second, it is buggy - I have (it looks so) two xen domains with identical kernels,
one of them regulary freezes in so much obscure places, that it is impossible
to debug it correctly. And then I lost (some ldap problems which do not allow
to login to that domain anymore) the first setup, which worked good... Third,
Xen setup I have is slow (very damn slow), fourth, it is unfair testing,
since different domains can eat all cpu during one or another test and that will
not be easily detected.
So, for initial testing it is enough, but real development will require real hardware.
Stay tuned...
/devel/fs :: Link / Comments (0)
Tue, 29 Jan 2008
CRFS release plans.
Zach Brown will release
his CRFS (Cache/Coherent Remote File System) at
LCA this friday.
My congratulations!
This does not change any plans about pohmelfs though,
first version of which will be released today or tomorrow :)
/devel/fs :: Link / Comments (0)
Mon, 28 Jan 2008
Is Lustre dead after Sun's acquisition?
It looks so.
First, because new (1.8) release, which will see the light this summer, will run as a userspace application
(I want to highlight, that it is a parallel (!) high-performace (!!) filesystem (!!!))
on top of ZFS, which is slow
(and a high-end test: zfs is faster than ufs
only in single setup, which says how bad solaris vfs cache is, although it is a speculation only)
filesystem, designed and first implemented in Sun, then ported to
Linux via zfs-fuse project.
Userspace zfs runs slower
than kernel one in most cases, actually it is faster than kernel zfs only in single test and difference is close to error rate (about 5-8%).
Sun posted that tests to lustre-devel couple of months ago.
Second, because kernel support of Lustre (it is based on ext3) is "too complex" for Sun,
and thus will be dropped after 2.0 release (end of the year):
... because it removes the burden of having to maintain kernel patches for Linux.
The encumbrance of kernel patches has made development and debugging of Lustre considerably
more complex than in user space; it has slowed our support for new Linux kernels and distros;
and it's even been the source of some nasty regressions when unsupported kernel APIs changed
from under us.
Btw, Lustre 2.0 will support clustered metadata, which will allow metadata-intensive
operations to scale greatly.
Such situation is perfect for the new distributed filesystem development!
/devel/fs :: Link / Comments (2)
POHMELFS naming conversion and the first release date.
I've just decided, that
POHMELFS
will not use traditional versioning (1, 2, 3 or 0.1, 0.2 and so on) system,
but completely new, related to its name.
As you probably know,
POHMELFS stands for Parallel Optimized
Host Message Exchange Layered
File System, so it is very logical to use following naming converstion:
50 ml, 100 ml, 0.3l, pint, 0.5l, 0.7l, 1 liter and so on...
The first release is scheduled for this week. it will not include cache coherency
algorithm implemented, but will have completely new and faster local cache.
Stay tuned!
/devel/fs :: Link / Comments (2)
Sat, 26 Jan 2008
Meanwhile at appartment development side.
I think I completed my bathroom cleaning, ceiling installation
and so on, the only missing thing there is single corner
of 3 tiles wide - I have no glue for ceramic tiles so can not
finish it yet, but eventually things will be completed.
Today I even installed washing machine there. Likely
for all of you this sounds like a complete craziness, but I lived
with hand washing more than a year already in my loft, now I think
I entered 21 century. Not tested yet though.
Next task is either to get glue for ceramic tiles and finish bathroom
and hall, or implement book/bottle shelves. My camera will be ready
(hopefully soon, I already confirmed repair price), so expect
a lot of photos of the place I used to live for a while arelady.
I really like it.
/devel/flat :: Link / Comments (0)
Fri, 25 Jan 2008
Climbing evening.
That was quite small training, muscles were not tired as usual,
but fingers both on hands and feet were rubbed quite hard.
The former becaue of lack of trainings, the latter because of
huge holes in the shoes. Nevertheless I'm tired enough
to say that tis wa a very good training, which contained
number of new starts, completed without serious problems,
and number of boulderings. At the very end I started to perform
several starts of the same trace on the negative slope
and move down fom about 4-5 meters over different holds,
that was repeated in a hort loop of three rounds, which sucked
the last power and rubbed the last skin on the fingers to feel
myself comfortable. It took about 3 hours including sauna, shower
and climbing, and that was an excellent time.
/life :: Link / Comments (0)
POHMELFS got correct rmdir support.
That was quite easy, somehow when directory is being removed, it requires
to drop its reference counter twice and drop one for higher layer directory.
Files do not require that (or there is a bug in my code): they only drop
own counter.
Also started link()/symlink() implementation. The former
has a folowing problem: userspace server has a mapping between inode
number and object name, when link() is executed, it creates
new object, which refers to the existing inode with different name, so code
fails. I will think about how to implement it withouth creating dentry/inode
cache on the server side, but that will be another argument against userspace
server and for kernelspace one. In kernel all those operations should be very
straightforward and fast.
symlink() require new operation (i.e. new network structure to
be transferred), which will include symlink name, name of the object it
refers to (this can be arbitrary string) and parent directory entry.
Should not be complex to implemnt.
After all this things are completed, I will perform
LTP testing on top of it,
and then run some benchmarks...
Stay tuned.
/devel/fs :: Link / Comments (0)
Thu, 24 Jan 2008
POHMELFS development progress.
I've perfomed number of tests (before electricity was shut down), which included
untar, execution and compilation of small objects, they all went perfectly
fine except directory removal, it has some troubles because I only decrement
number of links in object, not including directory itself, so funny things can
be observed during unmounting (like 100% cpu usage produced likely by dentry cache
processing code in VFS). I also found how crappy Debian Etch (or
at least installation I have) is - I do not know
why, but every ls operation tries first to access ldaprc
file in every directory I ran it. If you would see which files gcc compiler
wants to see in compilation directory for simple fstatat()
testing application...
But overall it looks ok, so far without cache coherency protocol involved,
but I think I have pretty clear idea on how to implement it correctly.
I'm looking at this and recall that not that long ago I wanted to get a linux kernel
hacker position in some company, they developed multi-layer cache system
(i.e. vanilla page cache in memory, then lower level cache on disk and finally
tape storage) and asked me about my experience. It was quite miserable (and I would
not say I suddenly became brilliant :).
Then it was a question about what inode is...
And now I develop my own network filesystem, then local and distributed - how
interesting things move with time, what will be next?... I belive that everything
what happend was excellent, and will be even better.
/devel/fs :: Link / Comments (0)
Grange has a birthday!
My congratulations and wishes for better year than ever!
/life :: Link / Comments (1)
Tue, 22 Jan 2008
Meanwhile at appartment development side.
I spent the whole day filling holes between ceramic tiles
(2-3 mm) with the special water-proof plaster. It is about 23:00 in Moscow,
but things are not yet completed, but I will postpone it for a while,
since tired to work in dust today.
Also got lots of wood plates to either finish table (although they
are thiner than table itself) or start book/bottle shelves
development.
I got about 12 meters of 10 and 13 sm wood plates (three plates of two meters each
for each size), each has 2 sm thickness. That should be enough for smaller installation,
but since I want to have a et of X-shelves about 2 meters high, I will get
more when finish to work with thoe ones. Likely I will start it this weekend.
/devel/flat :: Link / Comments (0)
New DST release: Succumbed to live ant.
This is a maintenance release only and it contains
only following change:
- do not allocate big enough address structure on the stack
during local export node initialization
Great thanks to Serge Leschinsky and Konstantin Kalin for testing.
As usual one can get the latest version from
project homepage
or via git tree.
/devel/dst :: Link / Comments (8)
Sat, 19 Jan 2008
Fire in the office.
Not where I'm sitting right now, but couple of floors above. Smell of smoke,
several fire brigades, a lot of water on higher floors...
Crap.
Update: after some talks with security, official version is about development process,
which happend on the 5'th floor (welding of the main support columns), which resulted
in the fire around lagging, which resulted in a more serious fire on the 5'th floor.
As I was told, 4'th floor was heavily flooded (there were 5 fire brigades), but third one,
where I sat, had its electricity on, so likely nothing major happened there.
/life :: Link / Comments (0)
Fri, 18 Jan 2008
POHMELFS got initial writing support.
$ ls -l /mnt/tmp/
total 0
$ echo asdasdasdzxczxczxcqeqweqwe > /mnt/tmp/test
$ sync
$ ls -l /mnt/tmp/
total 0
-rw-r--r-- 1 zbr users 27 Jan 18 22:29 test
$ cat /mnt/tmp/test
asdasdasdzxczxczxcqeqweqwe
$ mount | grep pohmel
qweqwe on /mnt type pohmel (rw)
The same data in on server, and it was only written there after sync
was executed, i.e. exactly in ->writeback() callback and thus via page cache.
I will describe it in details in the next post.
To be completed (simnple!) FS I have to implement inode operations for special files and link support, both
are quite simple (and probably can be postponed), the most interesting idea I have to think about
is metadata caching (so far it is write-through cache, which is not optimal, I want write-back one).
Next complex task is cache coherency algorithms. It will be started after testing (including performance)
of the initial POHMELFS implementation withouth cache coherency involved at all.
Stay tuned!
/devel/fs :: Link / Comments (0)
Anatomy of the filesystem. Object creation and removal.
Let's first discuss object creation. It is pretty simple,
each directory inode has inode_operations
structure, which contains ->create()/->mkdir()
callbacks. Prototype of both looks like this:
static int pohmelfs_create(struct inode *dir, struct dentry *dentry, int mode,
struct nameidata *nd);
static int pohmelfs_mkdir(struct inode *dir, struct dentry *dentry, int mode);
Where dir is parent directory inode, dentry
is directory entry structure, which contains inode for given object
(dentry->d_inode, it is NULL for the object being created, since there is no
inode yet for the given dentry),
its name (dentry->d_name)
and lots of other interesting fields, which are not that interesting for
filesystem creation. FS code should allocate space for the new entry and
add it there.
At the end one has to fill dentry with new inode info, it can be done either by
d_add(dentry, &npi->vfs_inode);, or more correct by
d_instantiate(dentry, &npi->vfs_inode);, which is called from d_add(),
which then adds dentry into hash chains. Ext2 also multiple times marks inode as dirty, the same does minix.
This operation has no effect on network filesystem, afaics, but for block based filesystems
it adds inode into dirty list. However, practice shows that d_instantiate(dentry, &npi->vfs_inode);
is not enough, and d_add(dentry, &npi->vfs_inode); should be called for network
filesystem.
Object removal is essentially the same. There are following callbacks invoked by VFS layer,
when object is being deleted: ->unlink() and ->rmdir().
The former is called for usual files, nodes and so on, the latter - when you
call rmdir(). Both have following prototype:
static int pohmelfs_unlink(struct inode *dir, struct dentry *dentry);
static int pohmelfs_rmdir(struct inode *dir, struct dentry *dentry);
Where dir is parent directory inode and dentry contains directory entry,
which in turn has inode pointer and name of the object.
Filesystem should remove appropriate object from the disk, update
its fields and mainly offsets, used in the
->readdir()
callbacks.
All described callbacks should return negative error value or zero in case of correct completion.
/devel/fs :: Link / Comments (0)
Fedora upgrade sucks.
Since I got fast internet connection I decided to upgrade Fedora Core 7, installed
on my laptop to its next version via yum. Machine has 256 Mb of RAM and 512 Mb of swap,
so I expeted there should not be problems, but I was wrong - it ended up with OOM condition
and yum got stuck, so I killed it (with SIGKILL signal, since it did not respond
to anything else). Subsequent runs end up with trnsaction check error where some
packets (likely just installed) conflict with other ones (probably with old versions),
and I expect, that machine will be unusable after susped/resume or reboot, and there
is no way to rollback installed packets...
Although I started downgrade process, it is about 3 o'clock in Moscow, so I will move
to bed and hopefully things will be resolved this morning.
Another serious design problem of the whole yum system is its dependency tracking system.
It requires to download 3-5 Mb sqlite database almost every time one wants to install
any single packet, which can even do not have any dependencies to be resolved,
or when its size is about several kilobytes.
/devel/other :: Link / Comments (0)
Thu, 17 Jan 2008
Meanwhile at appartment development side.
I've entered civilized world - I have an Internet at home now,
and not crappy slow gprs link, but real bloody fast connection.
As additional steps I completed hinged clulstered ceiling in the bathroom,
although its pieces will be removed and placed back after some time,
since I did not finish tile glueing yet, since I have no glue...
I also got some technics, which makes my living even more comfortable.
Also created an intersting design for book shelves, they will be shared
with bottle cellar and will look like a lot of letters 'X' near each other,
or like grating rotated to 45 degrees, it will be abou 1.5-2 meters high.
I will made it of wood with the missing part of the
table soon
(hopefully I will get materials this weekend). There is also a plan to get
tile glue and setup ceramic granite in the hall, after this task is finished,
I can say the whole development process is really about its end.
/devel/flat :: Link / Comments (0)
Wed, 16 Jan 2008
Filesystems and disk caches.
It is known that disk caches are generally very bad for data
integrity in case of various hardware failures or power outages.
It looks like even the most safe filesystem will have hard time recovering
in such cases.
Alan Cox describes
how Ext3 behaves in such situation: if powerfail during write damages the sector, ext3 can
not recover; powerfail during write may cause random numbers to be returned on read, buf
fsck should handle that; ext3 should survive if powerfail damages some sectors
around sector which was written. All above does not happen always and bad things
can happen in every case.
XFS have even more serious
damage in case of powerfail.
/devel/fs :: Link / Comments (0)
Tue, 15 Jan 2008
POHMELFS development progrees.
If you are curious about strange delay in POHMELFS development do not think
it is closed or stuck, there is number of things I'm working on in this network
filesystem and delay is only because of administrivia steps about my testing environment
and things like that...
Now it seems things settled down and I have some news.
First, it supports object creation in the filesystem, so far only regular files, but
directories, links and directories is just a matter of additional flags, so it is simple.
Second, it supports object removal (tested on files only though). It does not support
file writing yet, and all metadata operations described above (removing and creation)
perform network sending and receiving (removing can be done in local cache only).
I will write more detailed explaination of the operations involved just after directory/link
creation is ready, likely tomorrow.
/devel/fs :: Link / Comments (0)
BTRFS 0.10 has been released.
Chris Mason announced
new release of the BTRFS filesystem.
According to changelog, this version contains pretty serious changes:
- on-disk format changes, now it supports back references from every data and metadata blocks.
This allows future extensions like implementation of the on-line fsck
(a question rises, why is it ever needed for COW FS?) and to allow data migration between different
devices.
- online resizing (including shrinking)
- in-place conversation from ext3 to btrfs :) Although it is offline only, it is a very good
step for easier migration for users.
The conversion program uses the copy on write nature of Btrfs to preserve the
original Ext3 FS, sharing the data blocks between Btrfs and Ext3 metadata.
Btrfs metadata is created inside the free space of the Ext3 filesystem, and it
is possible to either make the conversion permanent (reclaiming the space used
by Ext3) or roll back the conversion to the original Ext3 filesystem.
- data=ordered support. (Probably it is option of the transactin log journal)
- mount options to disable checksumming and COW (the latter explains a lot about
fsck and journalling)
- barrier supports
From the changelog observation only, it looks really impressive, my congratulations for the
project, although list of not fixed bugs worries a bit, but I'm pretty sure, things will be fixed.
/devel/fs :: Link / Comments (0)
Direct IO with filesystem from the kernel and fast mapping for loop device.
Although every bit of the system is easily accessible from the kernel,
it is quite hard to do filesystem related tasks, which are generally only
performed from the userspace. For example to read and write files. Actually
one can call the whole sys_open()/sys_read()/sys_write() path
from the kernel, but it is quite slow and ineffective.
Likely the most common example is loop block device driver, which allows
to make a usual file to look like a block device, so one can
mount if, create files there and so on.
With time loop driver became more and more complex, I recall I my first
block layer driver (async block device,
which was similar to loop device, but allowed to perform a lot of operations
asynchronously, it was used to test acrypto
crypto system) was based on it.
Loop device is quite slow, so Jens Axboe (block layer maintainer) came into the game and
extended it to support much faster mapping of the blocks to read/write from the kernel,
than existing.
His first version was extended by Chris Mason (btrfs
author among other), which basically moved mapping code into the filesystem,
so address space operations were extended to include new callbacks called
->map_extent() and ->extent_io_complete().
The former is used to map offset inside the file into extent. Basically extent is a bigger than
a block area on the disk, so far it is not supported by mainline tree (at least 2.6.24 tree),
so one can consider this callback is a mapping from file offset into block number. Usually it is
implemented by filesystem specific ->get_block() callback. Extent part of the patchset
adds a special tree of extents, which can be addressed by offset in the address space, if there
is no extent in the tree, it can be inserted. Extent creation is implemented via ->get_block().
Second callback, ->extent_io_complete(), is only used to invoke calling layer, when
IO is completed, so far it is only used to show when hole filling is completed. Actually I do not know,
how this callback can be used by classical filesystem, but copy-on-write ones should benefit greatly,
since they automatically get a completion, which is async, so higher-layer tree can be updated. Classical
filesystems already handle this situation though. Since it is only implemented for hole filling, it looks
like a little hack :)
Here is Jens' first presentation,
and here is Chris' presentation
of the extent mapping code used to implement fast mapping in loop device.
/devel/fs :: Link / Comments (0)
Mon, 14 Jan 2008
Climbing evening or bringing body in shape with the liver.
It was not that simple training, although I finished number
of very interesting traces including two completely new for me,
one of them I finished on-sight, another one failed since did not
know where correct holds were placed. I was quite surprised, that
I can climb even not that bad after all vacations and related
volume consumptions. I almost killed climbing shoes, so will need to
get new ones, since climbing is not that pleasant, when big finger
looks to the universe and touch the wall via huge hole.
Anyway, it was really good first training in this year!
/life :: Link / Comments (0)
Sun, 13 Jan 2008
I have a hacker's hat now!
It was presented to me at New Year celebration by Mephody and Irin.
People look strangely at me, fortunately I do not wear it on the streets.
It looks remotely similar to Alan Cox's
one. I do not have such beard though. And brain. But I'm working on the latter.
Abr and Tanya presented me a hugely useful thing - USB IO port system, which has multiple
GPIO and other IO output bits, controlled via USB. And it has to be soldered first. And there
is only Windows application, so this will require bits of reverse engineering (if there is no
open protocol, I did not check that though). Great thanks!
I also have a t-short with my evolution process from the monkey to the Man and then to the computer
geek.
Since I do not have a camera yet to make photos of my presents and developed appartments, but stay tuned,
it should be ready soon too...
/life :: Link / Comments (2)
POHMELFS filesystem development progress.
So far it is not that big - I'm still trying to setup 3 testing machines (I do not
have physical access to them, so there should be a way to reboot them, check console
output and so on), actually it is 3 small Xen domains on remote machine, so things
are a bit more complex, but it is not enough for initial testing.
Since pohmelfs testing is postponed a bit I started distributed cache coherency system designing and hacking.
So far I will implement so called MESI
protocol, which is used, for example, in IA-32 SMP machines. There is number of problems,
since distributed system is vere different compared to bus-driven SMP machine, for example
there is no way in remotely sane distributed system for single node to snoop data requests
made by other nodes, but this trick is heavily used to catch requests for modified cache lines.
Cache coherency protocol is a very interesting problem itself, so I developing it first as a standalone
application, which will be scalability tested against huge number of users. Then I will integrate it into
pohmelfs.
I do not have fast internet at home anymore, I returned SkyLink modem and will use crappy
GPRS until good internet connection setup, so this make things a bit more complex too...
Right now I have less time hacking things, since quite a lot of spare time is being
eaten by some others, but I expect to be in shape soon.
So, if you do not see frequent update, its just a fluent time, things will be ok.
Stay tuned...
/devel/fs :: Link / Comments (0)
Thu, 10 Jan 2008
How to fix Debian upgrade process with "A non-dpkg owned copy of the libc6-i686 package was found." error.
I've checked preinst script in Debian libc6_2.7-5_i386.deb and found,
that above error
only accurs if either /lib/tls/i686/cmov/libc.so.6 or /lib/i686/cmov/libc.so.6
file exists, my system has the former, which was a symlink to libc-2.3.6.so. I removed
that link and upgrade process from etch to testing was successfully completed.
I performed above steps on two different machines, one of which runs own 2.6.23 kernel and another
one 2.6.18 Debian's one. The former booted successfully and the latter does not, so take that into account,
since it looks some kernel changes (from 2.6.18 to 2.6.22 Debian testing) resulted in unbootable machine.
/devel/other :: Link / Comments (4)
Why don't I like Debian.
Because it breaks my dreams.
Actually only a single dream: a dream about perfect life.
I always wanted to believe that Debian is able to perform
an easy upgrade between major versions, since it has so much hated/loved
stable/testing/unstable split. I know, Fedora, SuSE and others can not
perfrom a major leap between versions using only command line tools.
Sometimes they can (especially Fedora on x86),
but Debian (in my dreams) has to do that always.
And it has just fucked my sweat dream:
Do you want to upgrade glibc now? [Y/n]
A non-dpkg owned copy of the libc6-i686 package was found.
It is not safe to upgrade the C library in this situation;
please remove that copy of the C library and try again.
dpkg: error processing /var/cache/apt/archives/libc6_2.7-5_i386.deb (--install):
subprocess pre-installation script returned error exit status 1
What in the hell does it mean? How in the hell is this possible? I do not know,
but since today I hate Debian.
I tried number of things to cure the situation, but failed,
I'm pretty sure, there is a probability, that my hands are connected to the ass,
and I only think and believe that they are connected to shoulders.
After about 3-4 hours of this crap I eventually removed libc6 package from my
installation and immediately everything stopped to work:
# ls
bash: /bin/ls: No such file or directory
The only reason to break seems-to-be-cool Debian Etch installation was its too old
glibc (libc6) package, which does not contain openat() and friends syscalls,
which are extensively used in pohmelfs userspace server.
And here
is a reason. I do not know level of correctness of this change, but it does not allow to upgrade glibc
(and more generally perform dist-upgrade action) from etch to testing in my setup.
/devel/other :: Link / Comments (6)
My testing environment.
Just like good old days: several machines with 256 MB of ram
and 1-3 MB/sec connection to and between them. Things are not that bad,
there are several Xeon (E5345) machines around with infiniband cards and
several GB of RAM, but that requires setup, installation and so on, so
right now it is enough to have smaller systems, which compile
small kernel about 30 minutes and untar it 4 minutes, I do not hurry.
/devel/other :: Link / Comments (0)
Wed, 09 Jan 2008
Cached metadata operations on clients and remote server.
Things are not that simple actually - there is no way to work with offline
server with existing filesystems - since every existing filesystem uses own
inode generation methods, clinet disconnected from the server can not create
new objects in its cache since its inode numbers will not corespond ones,
which would be created if server is online. When network filsystem is only bound
to the single server filesystem it can use the same logic and then only resolve some
problems when multiple clients created different objects with the same inode numbers
while server was offline, with single client there would be no such problem at all.
So, to correctly implement new object creation I've completed non-cached create/remove
methods for the objects.
Right now I'm waiting for server setup to start testing new features (file writing support
and file/dir creation/removal), I hope it will be done today, so that I could share
problems and interesting results found during this stage.
/devel/fs :: Link / Comments (0)
2008 Linux Storage and Filesystem Workshop.
I was invited to LSF workshop,
but it is quite hard for me to attend. Not even counting visa problems, travel and
other such small things.
As kernel summit showed to me, this is actually a very personal meeting, i.e. people
come there to met with other persons which have something to talk about. Mostly it is
all about personal contact I think.
I do not have enough personal contacts in the community actually, so there will be
quite a little amount of people to talk about different things, that I belive is a main
reason.
I think we will have very interesting talks by emails, irc and the like first :)
/devel/fs :: Link / Comments (0)
Tue, 08 Jan 2008
Write support in POHMELFS.
My network filesystem got file writing support, which is rather trivial
right now - ->prepare_write()/->commit_write() callbacks
do nothing, but ->writepage() method sends data to the server.
It uses very simple request/reply protocol to report errors on the server
side, and does not include any cache coherency mechanisms yet. Since
only ->writepage() is used, data always stays in the client's cache
and only is being sent to the server when local system wants (for example
when system requires to flush some data to the storage or when it wants
more memory).
Next step is to implement metadata operations - directory entry creation/modifications
(like file/directory create/remove/move, link/unlink and so on) and file metadata operations
(like attributes management and truncation).
After this tasks are completed (I expect it to be finished quite soon, it is not that
complex to operate on local cached entries), cache coherency protocol will enter the game.
So far it will be quite simple: each client will have a number of states associated for each
inode, so when one or another is changed, server will be notified and when another client
is about to access modified data it will be synced to server.
Another task is to test clients scalability: when there are multiple users working on the same
client of the pohmel filesystem, how well network filesystem performs? Is locking too coarse?
It is right now - there is a single lock, which guards each network operation, and should not
be changed except by introducing multiple sockets, which is quite bad decision imho, since
network is supposed to be a bottleneck (or remote storage speed, but that can be changed
by switching to faster storage) in this scenario, so having too fain grained locks for different
network operations does not change anything at all. Local cache, which contains inodes,
can be operated using three different tuples (I described them
previously), but there are
two locks: one lock for offset based searches (offset inside address space of the inode, for example
reading directory content, where each directory entry in the stream is located by its offset in given
stream), and another lock for more generic operations like searching for inode by its number or by hash
of its name in the parent direntry (including length and parent inode number). Although both former
operations are supposed to be very fast (it is about O(log2(N)),
where N is total number of inodes in the filesystem), practice can break that dream, since that speed
can be too low for very dense filesystems.
The last one is userspace server, which is quite simple so far and likely have own bottlenecks. One of the crazy
ideas is to move it into the kernel, so that lookup of the inode (file or directory in the userspace)
could be very fast. It will also reduce number of unneded copies (there is number of them - I use
simple send()/recv() instead of mapping and generally there is at least one unneded,
but unavoidable in userspace, copy from kernelspace to userspace).
Some work should be performed with server redundancy - right now there is no failover recovery neither on clients
(I do not know about any filesystem which supports that though, do not confuse that with NFS.
All operations with local cache will succeed of course, but reading from the remote side
will stall), nor on servers (if server fails, clients can not
proceed with work, since there are no other servers which could catch the data and metadata flows.
It has to be fixed).
Anyway, there is number of interesting tasks to complete, and I expect to have something to show quite soon...
/devel/fs :: Link / Comments (0)
Sat, 05 Jan 2008
Enough!
For slacking, I'm tired not to do something interesting, so
I've just returned.
Because of some changes in the envoronment, I have no access
to my old setups, so it will take some time to resurrect old
installations. I still have all sources though, so will
continue all my developments.
I will also continue appartment development process, although postpone
some of its bits for a while - it is possible to live (although
it was possible before too, maybe a little spartan) there with
quite a lot of comfort, but there is number of things to finish.
Overall, I'm back to business. Stay tuned!
/devel :: Link / Comments (0)
Tue, 01 Jan 2008
Happy New Year!
2007 was very interesting year for me, I made and started
so many cool things and 2008 will be for sure even better.
I'm having so good time now, and really like how it goes!
Thanks a lot for all my friends for theirs time and just for themselfs,
thanks to angels and daemons, which guarded me and created troubles,
thanks for everyone who are around so I made my life how it is.
Happy New Year!
/other :: Link / Comments (0)
|