Zbr's days.
January
Sun Mon Tue Wed Thu Fri Sat
   
31    
2008
Months
Jan
Oct Nov Dec

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Thu, 31 Jan 2008

POHMELFS release notes.

One can grab release notes, my thoughts (a bit chaotic) and code here (POHMELFS core) and local-only-cache hack here.

Please note that POHMELFS is less than one month old, so do not be too severe with it :)

And I'm going to have some fuel about this release, it was hard, but bloody cool!

/devel/fs :: Link / Comments (0)


First POHEMLFS version, codename water:50ml, has been released.

A small benchmark of the local cached mode:

$ time tar -xf /home/zbr/threading.tar

	POHMELFS	NFS v3 (async)
real    0m0.043s	0m1.679s
Which is damn 40 times!

Excited? Below is a bucket with ice for you and me.

Of course this will not be _that_ huge difference in a real world, when tested archives are larger (this one if a git archive of my userspace threading library), which is very small. Since it is so small there is no writeback cache flushing.
But you got the key :)

And that version will not be released, since it uses so heavy hack, called local cache, which is never synced with remote server. Actually one can consider this as tmpfs or something like that. Code supports sync, but since inode generation process is very different, files and dirs can not be blindly synced to the ext3 fs. So, I will release POHMELFS as two patches: first one is a network filesystem implementation with write-through cache, when object is first created on the remote side and then populated to the local cache. This one is slow.
Second patch is a hack to disable writeback caching and implement local caching only, which is very fast.

After that I will start thinking about how to generically solve the problem with syncing local changes with remote server. This, among others, will allow offline work with automatic syncing after reconnect.

This is not intended for inclusion, CRFS is a bit ahead of POHMELFS, but it is not generic enough (because of above problem) and works only with BTRFS.

And, btw, I changed name conventions, since having set of volumes from 50ml to 1 liter is not enough for serious development, I will prepend a liquid name for each raw. So, it will be water:{50ml, 100ml ... 1 liter}, tea {50 ml ... 1 liter} ... spirit {50ml ... 1 liter}. Amount of different "waters" I know should be enough for this project :)

Stay tuned!

/devel/fs :: Link / Comments (0)


Nasty dentry abuse or...

... searching for rakes by stepping on them in a dark room. That is how I can describe the process of hunting for obscure bugs in filesystem code.

Preface 1.
System locks hardly without any single message in dmesg, although all kernel hacking options are enabled in config. System responses to ping, but there is no way to login or to do somthing by local user.

Preface 2.
I recall, things were cool.

Bisecting is not my friend today, since fair number of fixes was added and while I can find situation, when new bug does not exist, old ones can kill the system, so I decided to manually check every patch in git I added for the last days. Since I do not know VFS enough, there are several things I just copied from other filesystems (most of them do it that way), so I started to drop some bits out of that code in pohmelfs.
Eventually I found, that lookup, which fails to find requested dentry in most filesystems adds NULL inode into dentry either via d_add() or via d_splice_alias(). Both look harmless, except that dentry with NULL inode exists in the dentry cache. Maybe it is good and there is some other bug in pohmelfs, but after I added it I started to get that obscure freezes (it is quite easily reproducible with almost 100% probability in some test), and some times general protection fault happend in VFS code during umount.

So, I just removed code, which adds NULL inode into dentry via d_add() and things are good again. I do not know how frequently this can happen in local filesystem, but fact is fact, after removing this code pohmelfs behaves excellent (modulo its speed).

Edited to add: no, somthing wrong still exists in the system, although I'm not sure for whom to blame:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.23-pohmelfs #4
-------------------------------------------------------
bash/4116 is trying to acquire lock:
 (&journal->j_list_lock){--..}, at: [] journal_try_to_free_buffers+0xd4/0x187 [jbd]

but task is already holding lock:
 (inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (inode_lock){--..}:
       [] __lock_acquire+0xa66/0xc48
       [] lock_acquire+0x7a/0x94
       [] _spin_lock+0x38/0x62
       [] __mark_inode_dirty+0xce/0x147
       [] __set_page_dirty+0xd0/0xdf
       [] mark_buffer_dirty+0x8b/0x92
       [] __journal_temp_unlink_buffer+0x174/0x17b [jbd]
       [] __journal_unfile_buffer+0xb/0x15 [jbd]
       [] __journal_refile_buffer+0x6a/0xe3 [jbd]
       [] journal_commit_transaction+0xf46/0x11eb [jbd]
       [] kjournald+0xb5/0x1c1 [jbd]
       [] kthread+0x3b/0x63
       [] kernel_thread_helper+0x7/0x10
       [] 0xffffffff

-> #0 (&journal->j_list_lock){--..}:
       [] __lock_acquire+0x952/0xc48
       [] lock_acquire+0x7a/0x94
       [] _spin_lock+0x38/0x62
       [] journal_try_to_free_buffers+0xd4/0x187 [jbd]
       [] ext3_releasepage+0x68/0x74 [ext3]
       [] try_to_release_page+0x33/0x44
       [] __invalidate_mapping_pages+0x74/0xe0
       [] drop_pagecache+0x70/0xd8
       [] drop_caches_sysctl_handler+0x36/0x4e
       [] proc_sys_write+0x6b/0x85
       [] vfs_write+0x82/0xb8
       [] sys_write+0x3d/0x61
       [] syscall_call+0x7/0xb
       [] 0xffffffff

other info that might help us debug this:

2 locks held by bash/4116:
 #0:  (&type->s_umount_key#11){----}, at: [] drop_pagecache+0x38/0xd8
 #1:  (inode_lock){--..}, at: [] drop_pagecache+0x48/0xd8

stack backtrace:
 [] show_trace_log_lvl+0x1a/0x2f
 [] show_trace+0x12/0x14
 [] dump_stack+0x16/0x18
 [] print_circular_bug_tail+0x5f/0x68
 [] __lock_acquire+0x952/0xc48
 [] lock_acquire+0x7a/0x94
 [] _spin_lock+0x38/0x62
 [] journal_try_to_free_buffers+0xd4/0x187 [jbd]
 [] ext3_releasepage+0x68/0x74 [ext3]
 [] try_to_release_page+0x33/0x44
 [] __invalidate_mapping_pages+0x74/0xe0
 [] drop_pagecache+0x70/0xd8
 [] drop_caches_sysctl_handler+0x36/0x4e
 [] proc_sys_write+0x6b/0x85
 [] vfs_write+0x82/0xb8
 [] sys_write+0x3d/0x61
 [] syscall_call+0x7/0xb
 =======================
Although it does not contain any signs of pohmelfs, it still can be related...

/devel/fs :: Link / Comments (0)


BTRFS subvolumes.

Chris Mason created a short specification for the subvolumes in BTRFS. Subvolumes allow filesystem to allocate blocks on several devices and use tricky algoritms to distributed the load between storages.
Overall this is excellent idea, but specification rises some questions and I belive it is too heavily tied to ZFS design.

I will drop my thoughts here, which may be completely wrong though.

Here are some features btrfs will support with subvolume implementation: Mirrored metadata, configurable up to N mirrors (where N > 2); Mirrored data extents; Checksum failure resolution by using a mirrored copy; Striped data extents and others.

They are clear targets for block layer, but there are following notes on why it is not:

If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of the filesystem blocks, and so they are not able to verify the data they return.
Well, that's not entirely correct, since checksum has to be checked not against other mirror, but against data itself (i.e. it has to be recalculated after read), since during transfer data can be damaged and it is not that rare condition. Thus checksums from different mirror can be both be wrong, but equal, which without recalculating can sign that everything is ok, while it does not.
Recalculating block checksum can be faster for smaller blocks than reading it from other disk.

If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big address space, it would not have sufficient information to allocate mirrored copies on different devices. Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Actually it is very simple. DST supports such iteraction for example.

Instead I propose and will use following scheme for subvolumes (I like the name) in local filesystem: there is pool of devices, and there are allocation policies for each one in the following form (just an example): files with '*.jpg' pattern are allocated from device 1, '*.log' from device 2, metadata is stored on device 3, small files are allocated on device 4, and so on. Then each device has own policy on mirroring its data to needed number of storages.

And, a side note, it looks like Chris Mason uses Mac OSX for development or at least for writing documentation, since a screenshot of high-level design clearly has Mac's shadows and fonts :)

/devel/dst :: Link / Comments (0)


Vodka, martini, absinthe.

All were damn cool. I first time tried vodka with martini, shaked, but not mixed. Perfect. James Bond knows the things. I mixed about half vodka nad half martini, so overall it was a bit softer than pure fire water, and with excellent a bit dry and sweet after-taste of martini. I recommend.

Absinthe was very good as usual too. Although it is very strong (around 70% of alchohol) it is very tasty because of thujone (do not know real amount if it, since usually vendors put wrong specs at bottles). It can be drunk without additional food or drinks. Better with burnt sugar - resulted caramel adds very interesting sweet taste.

/life :: Link / Comments (0)