|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Thu, 31 Jan 2008
POHMELFS release notes.
One can grab release notes, my thoughts (a bit chaotic) and code
here (POHMELFS core)
and local-only-cache hack here.
Please note that POHMELFS is less than one month old, so do not
be too severe with it :)
And I'm going to have some fuel about this release, it was hard, but bloody cool!
/devel/fs :: Link / Comments (0)
First POHEMLFS version, codename water:50ml, has been released.
A small benchmark of the local cached mode:
$ time tar -xf /home/zbr/threading.tar
POHMELFS NFS v3 (async)
real 0m0.043s 0m1.679s
Which is damn 40 times!
Excited? Below is a bucket with ice for you and me.
Of course this will not be _that_ huge difference in a real world, when
tested archives are larger (this one if a git archive of my
userspace threading
library), which is very small. Since it is so small there is no writeback
cache flushing.
But you got the key :)
And that version will not be released, since it uses so heavy hack,
called local cache, which is never synced with remote server. Actually one
can consider this as tmpfs or something like that. Code supports sync,
but since inode generation process is very different, files and dirs can not
be blindly synced to the ext3 fs. So, I will release POHMELFS as two patches:
first one is a network filesystem implementation with
write-through cache, when object is first created on the remote side and then
populated to the local cache. This one is slow.
Second patch is a hack to disable writeback caching and implement local caching
only, which is very fast.
After that I will start thinking about how to generically solve the problem with
syncing local changes with remote server. This, among others, will allow offline work
with automatic syncing after reconnect.
This is not intended for inclusion, CRFS
is a bit ahead of POHMELFS, but it is not generic enough (because of above problem)
and works only with BTRFS.
And, btw, I changed name conventions, since having set of volumes from 50ml to 1 liter
is not enough for serious development, I will prepend a liquid name for each raw. So, it will
be water:{50ml, 100ml ... 1 liter}, tea {50 ml ... 1 liter} ... spirit {50ml ... 1 liter}.
Amount of different "waters" I know should be enough for this project :)
Stay tuned!
/devel/fs :: Link / Comments (0)
Nasty dentry abuse or...
... searching for rakes by stepping on them in a dark room. That is how I can describe
the process of hunting for obscure bugs in filesystem code.
Preface 1.
System locks hardly without any single message in dmesg, although all kernel
hacking options are enabled in config. System responses to ping, but there
is no way to login or to do somthing by local user.
Preface 2.
I recall, things were cool.
Bisecting is not my friend today, since fair number of fixes was added and
while I can find situation, when new bug does not exist, old ones can kill
the system, so I decided to manually check every patch in git I added
for the last days. Since I do not know VFS enough, there are several things
I just copied from other filesystems (most of them do it that way),
so I started to drop some bits out of that code in pohmelfs.
Eventually I found, that lookup, which fails to find requested dentry
in most filesystems adds NULL inode into dentry either via d_add()
or via d_splice_alias(). Both look harmless, except that dentry
with NULL inode exists in the dentry cache. Maybe it is good and there is
some other bug in pohmelfs, but after I added it I started to get that obscure
freezes (it is quite easily reproducible with almost 100% probability in some test),
and some times general protection fault happend in VFS code during umount.
So, I just removed code, which adds NULL inode into dentry via d_add()
and things are good again. I do not know how frequently this can happen in local filesystem,
but fact is fact, after removing this code pohmelfs behaves excellent (modulo its speed).
Edited to add: no, somthing wrong still exists in the system, although I'm not sure for whom
to blame:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.23-pohmelfs #4
-------------------------------------------------------
bash/4116 is trying to acquire lock:
(&journal->j_list_lock){--..}, at: [] journal_try_to_free_buffers+0xd4/0x187 [jbd]
but task is already holding lock:
(inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (inode_lock){--..}:
[] __lock_acquire+0xa66/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] __mark_inode_dirty+0xce/0x147
[] __set_page_dirty+0xd0/0xdf
[] mark_buffer_dirty+0x8b/0x92
[] __journal_temp_unlink_buffer+0x174/0x17b [jbd]
[] __journal_unfile_buffer+0xb/0x15 [jbd]
[] __journal_refile_buffer+0x6a/0xe3 [jbd]
[] journal_commit_transaction+0xf46/0x11eb [jbd]
[] kjournald+0xb5/0x1c1 [jbd]
[] kthread+0x3b/0x63
[] kernel_thread_helper+0x7/0x10
[] 0xffffffff
-> #0 (&journal->j_list_lock){--..}:
[] __lock_acquire+0x952/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] journal_try_to_free_buffers+0xd4/0x187 [jbd]
[] ext3_releasepage+0x68/0x74 [ext3]
[] try_to_release_page+0x33/0x44
[] __invalidate_mapping_pages+0x74/0xe0
[] drop_pagecache+0x70/0xd8
[] drop_caches_sysctl_handler+0x36/0x4e
[] proc_sys_write+0x6b/0x85
[] vfs_write+0x82/0xb8
[] sys_write+0x3d/0x61
[] syscall_call+0x7/0xb
[] 0xffffffff
other info that might help us debug this:
2 locks held by bash/4116:
#0: (&type->s_umount_key#11){----}, at: [] drop_pagecache+0x38/0xd8
#1: (inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8
stack backtrace:
[] show_trace_log_lvl+0x1a/0x2f
[] show_trace+0x12/0x14
[] dump_stack+0x16/0x18
[] print_circular_bug_tail+0x5f/0x68
[] __lock_acquire+0x952/0xc48
[] lock_acquire+0x7a/0x94
[] _spin_lock+0x38/0x62
[] journal_try_to_free_buffers+0xd4/0x187 [jbd]
[] ext3_releasepage+0x68/0x74 [ext3]
[] try_to_release_page+0x33/0x44
[] __invalidate_mapping_pages+0x74/0xe0
[] drop_pagecache+0x70/0xd8
[] drop_caches_sysctl_handler+0x36/0x4e
[] proc_sys_write+0x6b/0x85
[] vfs_write+0x82/0xb8
[] sys_write+0x3d/0x61
[] syscall_call+0x7/0xb
=======================
Although it does not contain any signs of pohmelfs, it still can be related...
/devel/fs :: Link / Comments (0)
BTRFS subvolumes.
Chris Mason created a short specification
for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks
on several devices and use tricky algoritms to distributed the load
between storages.
Overall this is excellent idea, but specification rises some questions and I belive
it is too heavily tied to ZFS design.
I will drop my thoughts here, which may be completely wrong though.
Here are some features btrfs will support with subvolume implementation:
Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents;
Checksum failure resolution by using a mirrored copy; Striped data extents and others.
They are clear targets for block layer, but there are following notes on why it is not:
If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum
failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of
the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but
against data itself (i.e. it has to be recalculated after read), since during transfer data can
be damaged and it is not that rare condition. Thus checksums from different mirror can be both
be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.
If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big
address space, it would not have sufficient information to allocate mirrored copies on different devices.
Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST
supports such iteraction for example.
Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem:
there is pool of devices, and there are allocation policies for each one in the following form (just an example):
files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3,
small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed
number of storages.
And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation,
since a screenshot of high-level design
clearly has Mac's shadows and fonts :)
/devel/dst :: Link / Comments (0)
Vodka, martini, absinthe.
All were damn cool. I first time tried vodka with martini,
shaked, but not mixed. Perfect. James Bond knows the things.
I mixed about half vodka nad half martini, so overall it was a bit softer than
pure fire water, and with excellent a bit dry and sweet after-taste of martini.
I recommend.
Absinthe was very good as usual too. Although it is very strong (around 70% of
alchohol) it is very tasty because of thujone (do not know real amount if it,
since usually vendors put wrong specs at bottles). It can be drunk without additional
food or drinks. Better with burnt sugar - resulted caramel adds very interesting
sweet taste.
/life :: Link / Comments (0)
|