|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Sun, 30 Mar 2008
To SSD or not to SSD.
Couple of days ago I talked with person, who ordered 4 high-end 128G SSD disks
to create RAID for testing purposes, seek time for that devises is 0.1ms.
Each one costs about $4k. His main workload is databases, i.e. random reads and writes,
so we calculated that theoretically it has to be about 14 times faster than
high-end scsi disks with 3.5 ms seek latency and about 100Mb/ssequential access speed
in given
workload for processing random data at 8-16kb chunks (usual 'page' in sql servers).
Besides the fact, that putting 14 disks into mirror will
be as fast as single ssd disk (theoretically), it will be 14 times more reliable
and likely have smaller price,
main workload is to replace RAM with SSD, not disks with SSD.
My prognosis is that SSD will be at most 2-3 times faster (if will be fater
at all, since its theoretical performance advantages can be killed by FS)
than SCSI disk for
given workload, and as is, it is not a breakthrough technology.
If I'm wrong (it will be tested likely next week with
sysbench read-write benchmark),
I will buy a good bottle of whiskey for us, otherwise...
/devel/fs :: Link / Comments (5)
Continue DST roadmap.
So, I have to admit that I rethought my
opinion
about mirroring/redundancy at filesystem layer - it is useful for lots of cases,
and modulo bugs in
DST
mirroring (mostly a leak, which I can not find in my lab,
and network/block layer race,
which exists in sendfile() for years and just strikes DST a lot,
which has a workaround though) I decided to rewrite mirroring algorithm in a way
it could be used in other projects.
There is also an idea of how to fix abovementioned network/block layer race in a
very non-disturbing manner, which was privately called soft
DST barriers.
Idea is to replace skb destructor with private one, which will commit that
pages are no longer used (for example call bio_endio() or
release splice buffer), this callback will be installed only for special sockets,
which provide it (like DST, sendfile() or any other
->sendpage() users like samba). Idea was
not killed on its roots,
which is a good start sign.
/devel/dst :: Link / Comments (8)
Thu, 27 Mar 2008
Filesystem as a database or database in filesystem.
I actually do not understand what prevents filesystem writers to implement
trivial interface and access library for metadata manipulations,
which would allow not only path lookup,
but also lookup for various keys, for example stored in extended attributes.
Yes, it requires filesystem changes, but I can not believe it is impossible
or even too complex.
Need to think...
/devel/fs :: Link / Comments (2)
Distributed storage roadmap.
DST
project was quiet for a while, but actually it is not.
There is a bug in mirror algorithm, which I consider to rewrite. Not becuase of
this bug, but because it will be used in special setup, where its extension required.
Consider a high-available *SQL cluster with multiple storage nodes combined into
mirror and several main systems, which operate with database software. Unfortunately
only single main system works with queries, other has to be turned on when first one fails.
Task is to create a system, which will automatically switch between main nodes and
recover if either main nodes or storage nodes become unavailable, so that the whole
system does not stop if something wrong happend with machines. It has to scale
to tens of nodes as a must and later hundreds without problems.
This is not a performance scalability solution - so far only single node should be able to
collect multiple data nodes into storage, and if that node fails it has to be switched,
but so far I do not know any working and free solution for the problem. But solution created
for the main node switching can be used in cases when any server (for example metadata server
in cluster) failed and has to be switched.
It will also force me to finally implement barriers in DST.
As a possible helper for availability messages
I consider abandoned CARP-like
protocol (in userspace).
/devel/dst :: Link / Comments (0)
Wed, 26 Mar 2008
Climbing evening.
That was hard training, since I climb once per week (this year so far) only,
I can not get my the best shape, mostly in resistance part, so fingers
were rubbed quite quickly...
Nevertheless I finished my
jumpings
and reached the needed hold. Jump actually was not that high - about 30-40 santimeters,
but it should be done from holds which are about 2 meters below the final hold,
so it was not that simple, especially when holds for arms are only 10 santimeters
higher than that for legs, and main body's chakra (it is a nice name for the ass
in my own public dictionary) really does not like to fly and wants to land.
Then I did number of various starts, some with jumps, others were just usual
complex traces without additional requirements from me...
Evening sauna, shower and great pleasure of the day.
Excellent time!
/life :: Link / Comments (0)
Added maildir benchmark results.
The simulation works on each filesystem in the following stages:
- The empty filesystem is created and mounted.
- The directory structure is created, with no files.
- A single delivery simulator and retrieval simulator are run
simultaneously. The script waits for each of the simulators to finish,
and then runs the sync command before proceding to the next
step.
- The above step is repeated with 2, 4, 8, and then 16 delivery simulators.
Delivery Simulator.
The delivery simulator does actual maildir deliveries to the given directory:
- It writes a file with a unique file name to the tmp subdirectory.
- It fsyncs the newly written file.
- It renames the file into the new subdirectory.
- It fsyncs the new subdirectory (to ensure that
directory is actually on disk, as most Linux filesystems don't
automatically perform this action during the rename).
More details on original page.
Briefly saing, it is multithreaded maildir simulation.
And results
are quite different compared to for example postmark: very good results from xfs, jfs and reiserfs.
There are no ext2 and btrfs filesystems, since perl's fsync says that
filedescriptor opened there is invalid:
Invalid argument at /root/fs_bench/maildir_fsbench/fsbench/fake-deliver line 38.
Interested reader can check sources and show me a problem, but ext2 worked pretty fine with
2.6.20 kernel and to date glibs/perl/whatever was in Debian.
Anyway, results can be found at contest
homepage.
Now all testing is over.
Main conclusion: things got worse compared to 2.6.20 and there was no major breakthrough in filesystem development at least
from perfomance point of view.
/devel/fs :: Link / Comments (0)
Additional XFS test with slightly diferent mount/mkfs options.
mkfs: -d agcount=75 -l size=64m
mount: logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync
Postmark results:


Results are slightly better than
previous
xfs run, although barriers are turned off, which I blame to be the main reason. Other
filesystems did not turn off directory atime also.
Anyway, even with this results XFS is still much worse than any other FS (except reiserfs)
for this workload.
/devel/fs :: Link / Comments (0)
Tue, 25 Mar 2008
Filesystem contest results.
Interested reader can check out results
of the ext2/3/4, reiserfs, reiser4, jfs, xfs and btrfs fight for the first prices
in dbench,
iozone,
postmark,
maildir performance bench
and simple file creation micro-benchmark.
It does not contain maildir benchmark, I will add it tomorrow or later today,
xfs has yet not completed and no graphs.
As a conclusion: nothing major changed since
previous contest,
new btrfs filesystem behaves not that bad in some cases,
but quite slow in others... Nothing changed.
Does it mean, that we need something new?
/devel/fs :: Link / Comments (3)
POHMELFS status.
I've started mostly from scratch, I think it is a good sign,
when project can be rewritten without any pain to implement a really
interesting ideas instead of having multiple crutches all over the
place. This also means that it is not that complex, so I do not regret
about dropped code.
Now it is in a very testing stage without network protocol at all,
but I test new paradigm in the pohmelfs: its inodes will not be hashed
into global hash table, but instead will be placed into local
trie-like structure, which (optionally) will allow RCU-fied lookup.
Something similar to data structure created for
multidimensional trie
used for unified socket lookup patch.
I very like
two-hash
approach, but since there is no proof (yet) it will work for all possible cases,
I will first implement radix-like tree to store object names. Network
protocol will also operate on full-length pathes, which actually can be
a bad idea, I will see.
Another uber cool feature of the full-path approach
is ability to create number of directories, which form a path to given
object, in a single command, i.e. when client sends a network command
to create object /a/b/c/d/file, there is no need to send
separate commands to create /a, /a/b and so on,
it can be done automatically by server. This requires to send not only
path though, but also information about permissions for each subdir.
/devel/fs :: Link / Comments (2)
Second filesystem contest is over.
Although I plan to run additional couple of tests for
btrfs,
namely all tests for nodatacow option and without ssd option,
which will likely take part of the day. But all others were
already completed, so expect nice graphs tomorrow.
There was number of surprises during that testing. For example
reiser4 constantly freezes the test box in dbench workload
with 150-200 threads. There are no messages in dmesg, but nothing
is turned on in kernel hacking section of the config. Both
btrfs and reiser4 are very slow creating and writing into
lots of small (4k) files. Reiser4 is two times faster than btrfs,
the latter creates/writes/syncs/closes about 10 files per second
average when 10k-30k files are created one-by-one.
Ext4 is also slower than any other (except above two) filesystem
in this microbenchmark.
Something strange was made during 2.6.20-2.6.24 kernel: above file
creation microbenchmark produced much worse results for all
filesystems (magnitude of 10 in some cases) compared to previous
contest.
Maybe sync code was implemented correctly, I do not know...
I will likely drop maildir
benchmark results, since perl script which works there constantly tells me,
that fsync() has invalid parameter...
So, wait about 12 hours (I have to have some sleep: do not mix
absinthe with different red wines and beer, when I did that yesterday/today
night, it was quite tasty, but not todays morning)
/devel/fs :: Link / Comments (0)
Mon, 24 Mar 2008
BTRFS got subvolumes support.
Subvolumes
are block devices on top of which btrfs
can be created. This is first known filesystem in Linux which can be built on
top of multiple block devices. Chris Mason renamed his unstable branch to
really-really unstable because of that. It is possible to put devices into
mirror or striping mode, although it is far from being clear from short
mail description.
Although support for mirror and striping in filesystem is questionable feature,
ability to create filesystem on top of multiple block devices with per-device
allocation policies is a huge step in Linux filesystem development.
/devel/fs :: Link / Comments (2)
Thu, 20 Mar 2008
Second filesystem contest has been started.
So far I removed maildir
test and file creation benchmark, the former requires manual start in my
scripts, the latter requires some filesystems to be removed from the run,
namely Reiser4 and BTRFS, both are very slow creating and writing into lots of small
(4k) files. XFS is probably also a candidate, although with optimizations, described below,
it behaves much better than with default options and 2.6.21 tree.
So, we have dbench,
iozone and
postmark queued...
Testing is being performed with 2.6.24.3 tree, Reiser4 was ported from the latest
breakout of -mm tree (requires lots of manual patching to be started on recent kernels).
BTRFS was taken from the unstable
branch, since it is the same as 0.13 AFAICS. All other filesystems were taken from the
vanilla tree.
There are following optimisations for the filesystems:
- XFS: mkfs:
-d agcount=1 -l size=128m,version=2, mount: noatime,logbsize=256k,
as suggested by Dave Chinner
- EXT4: mkfs: none, mount:
data=writeback,noatime,extents
- EXT3: mkfs: none, mount:
data=writeback,noatime
- EXT2: mkfs: none, mount:
noatime
- JFS: mkfs: none, mount:
noatime
- REISER4: mkfs: none, mount:
noatime
- REISER3 aka REISERFS: mkfs:
--format 3.6, mount: noatime
- BTRFS: mkfs:
-l 4k -n 4k, mount: noatime,nodatasum, for postmark also added ssd option,
as suggested by Chris Mason
First results are expected to be ready tomorrow evening or even (past)weekend... Although all runs
are being performed automatically, nice graphs
generating requires manual start. Then I will proceed with
maildir
test and file creation benchmark.
/devel/fs :: Link / Comments (0)
Wed, 19 Mar 2008
Climbing evening: jumping-jumping-jumping.
That was really cool training today: the most exciting
part was lots of jumps. It was not a new trace, but a special
hold on the balcony, so that it could be gotten from lower positions
with a jump. I spent more than a hour jumping from different holds
to the finish one, although did not succed in the main jumping direction.
Instead I damaged a shoulder, rubbed fingers on feet and arms, tired as
hell and got zillion units of pleasure. Also finished couple of simple
and quite complex old traces to the mix.
That was excellent time!
/life :: Link / Comments (0)
I have a very bad carma: hardware specification of the testing machines.
3 Intel E7520 systems, each one has two 3Ghz Xeon CPUs with HT enabled and EDAC bits,
4 Gb of RAM, Adaptec AIC7902 Ultra320 SCSI adapter. Disks:
FUJITSU MAU3036NC 15k rpm 32 Gb system disk (will also be used in testing), two of them
will be installed in mirror later,
SEAGATE ST3300007LC 10k rpm 300 Gb testing disk.
The former has about 90 MB/s linear read speed, the latter - 75 MB/s.
About 5 minutes to fully compile and link loadable kernel.
Pretty neat machines, and I managed to lost three system disks already, doesn't
it say about my bad carma? Without any load, without kernel changes, without anything...
Is it because they are called devfs[123] and thus striking problems
like that old virtual filesystem, which eventually died a torture death?
Waiting again... Since one machine is still alive, will start filesystem contest
tomorrow, development will be a bit postponed.
/devel/other :: Link / Comments (1)
Temptation is over...
One man, 12 nights (13 days), one bottle of cuban rum and
little bits of scotch whisky, 82 'House M.D' series... feels good.
Meanwhile got three 2-way Xeon servers with 2-4 (I forgot) gigs of RAM and
gigabit link between them. Not bad for start.
Also gathered lots of power and inspiration, so, here is a plan:
- second filesystem contest,
now I will test btrfs 0.13 and btrfs-unstable in addition to previous
ext[234], reiser[34], jfs and xfs running for the first prizes in
dbench,
iozone,
postmark,
maildir performance bench
and simple file creation micro-benchmark. Results will show need for the
yet another local filesystem. Making bets?
- fix two problems in distributed storage:
there is a leak in mirror resync and unability to start a storage if config contains
wrong network addresses.
- rewrite
core pohmelfs algorithms to make it not good, but really good. This change will
make it first against
CRFS and
CacheFS :)
POHMELFS is not where I want it to be right now.
- fix HIFN driver
bug.
Lots of stuff scheduled to be started tomorrow (actually today: it is about 3:30 AM here),
but unlikely to happen - there is a plan to go climbing.
But nevertheless, stay tuned, lots of interesting stuff is coming!
/other :: Link / Comments (2)
Sun, 16 Mar 2008
Struggle against a temptation.
I can not win in a fight with some issues (or better
call theirs real names: temptations), so... there is
an old method to solve this problem:
if you can not win against some temptation, just fall for it
That is what I'm doing for the last couple of days: I already
watched 2.5 seasons of "House M.D." series and expect the last
ones to complete soon...
That is why I do not write about real hacking problems I work with,
but... stay tuned, I'm just accumulating a really strong power.
/other :: Link / Comments (2)
Sat, 15 Mar 2008
Linux sucks? I believe I already told that.
Sorry, but:
# mount /dev/dvd /mnt
...^C
# dmesg | tail
[ 853.189807] sr 1:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
[ 853.189822] sr 1:0:0:0: [sr0] Sense Key : Medium Error [current]
[ 853.189832] sr 1:0:0:0: [sr0] Add. Sense: No seek complete
[ 853.189843] end_request: I/O error, dev sr0, sector 9180408
[ 853.189852] Buffer I/O error on device sr0, logical block 1147551
...
# dd if=/dev/dvd of=/tmp/data bs=1M
# mount -o loop /tmp/data /mnt
# ls /mnt
Doctor House
So I can not mount dvd via mount, but can do the same after sequential
read of the dvd into the file. This error with seeking looks like problem with
hardware _or_ linux driver. I know hardware sucks, but if sequential read works
I can not understand why any other does not...
Sigh, it is 21 century on the street iirc...
Main problem is that anything other sucks even more. And everyone is guilty,
for example there is a bug in hifn 795x hardware crypto accelerator
driver I wrote, which is in the mainline, or two in
DST
project, or aywhere else. I wish world to be perfect...
/devel/other :: Link / Comments (3)
Fri, 14 Mar 2008
Why binary trees are bad. New cache structure for pohmelfs.
I already found experimentally that write-through cache scales very badly,
even noticebly worse than without cache at all for some workloads, so an ideal
solution
does not involve any kind of write-through operations notably no synchronous commands,
which require immediate response.
This means that inode numbers will differ on client and server, so there should be
some kind of tracked dependency between them so that operations on different machines
can be done in sync. Initial though was to use binary tree to store pointers to appropriate
inodes, which would be indexed on server and clients by combination of hashes of inode (direntry)
and its parent data. Even embedded systems can easily have millions of inodes, so choice
was thought to be correct from the first point of view. Now I think different since there
is a serious problem with indexing of such a tree.
Since the only information common to both client and server is object name it should be used
as a key, maybe not name directly, but its hash, that does not matter at this point. Here comes a
problem with binary tree choice: in binary tree there is no connection between real parent in the
filesystem and parent in the binary tree, so there will be serious problems when we will put two
different object with the same name into binary tree - there will be a conflict. To solve
this problem we should use some information about where this object is placed, i.e. information
about its parent directory. Using parent name hash as a part of the key in the binary tree
does not solve problem too, since there might exist multiple directories with the same name and
the same object in it. We could solve the problem by putting into key hash of the object's name
and hash of parents key (which in turn is hash of the name and hash of its parent key), so this
recursive hashing would end up at the highest level (i.e. root directory). This works, but there
might be scalability problem with the following issues:
- server has to either cache opened directories or reopen it one-by-one when accessing an object
- when object is moved/renamed all keys of its children and parent has to be changed
which is unacceptible. So new solution was thought of.
So far I have two ideas:
- kind of radix tree
- multi-layer hash tables indexed by double name hash
While the former is kind of obvious, the latter is quite interesting but very simple idea. Consider
that each directory has a hash table of its children, it is indexed by double hash of child's name.
We need double hash to remove possibility of collision (I can not prove mathematically (maybe only yet)
that there are two hashes which will not allow simultaneous collision in both, but feel quite strongly
that such hash pairs exist) and to use them in network commands. Commands can be optimised either
to use full path if it is short enough (just sent a path string during writeback or readpage as a
path to where data belongs) or use an array of hashes of the path elements instead of '/' separated
names. Hash tables actually have to be changed to different data structure capable of hosting not only
small hash values, but full 32 or 64 bit hashes. It can be a binary tree or judy array, something similar
to what was used in
unified socket storage. The former looks a bit excessive.
Using such approach it is possible to lookup an object with O(k) operations where 'k' is number of directories
in a path, very usually it is smaller than 10, which for binary tree corresponds to as much as 1024 inodes,
which is too small for the real system.
This approach (especially when full path is being sent) allows to eliminate mentioned above scalability problems.
Implementation start is scheduled for today, but I have to think about details first.
/devel/fs :: Link / Comments (4)
Wed, 12 Mar 2008
Climbing evening.
That was quite for a while that I did not climb, so today I've reopened
a season. It was surprisingly good training, although I did not that much,
but only couple of warming traverses and three new traces, which I tried before
though. One of them was noticebly complex, but extremely interesting.
For those who visit Skala-city
it is trace over yellow holds in the right corner with 6c+ category.
Complex, and I did not even expect to complete it without falls, but
everything was better than expected - of course I fell, but eventually completed.
At the I moved to the sauna and shower. Excellent finish of the day!
I like it.
/life :: Link / Comments (0)
(Cache) Coherent Remote File System sources are available now.
Zach Brown has
announced
CRFS source code openess.
CRFS is a network filesystem which works
with remote BTRFS volume and supports
cache on clients.
Here is a brief set of features CRFS supports:
- the user space server exports a private BTRFS volume
- the network protocol operates on ranges of BTRFS disk items
- the kernel client provides posix semantics by operating on items
- the server can grant and revoke client caches of data and metadata
CRFS protocol is very tied to how BTRFS is organized. For example there is natural
batching of some commads like the recursive delete commands, since btrfs keys
placed one-by-one, so there is no need for additional command to be sent, instead
the first one can be extended to cover wider key range.
As you might notice pohmelfs
was started as a competitor to crfs project, because the latter is interesting and was closed. Right
now pohmelfs has set of very interesting features crfs does not and likely will not support (like offline
working, different server filesystem support), also its todo list has plenty of very interesting
stuff, so it will not be closed. Instead I plan to proceed the competition (which is a bit
complex for me, since it is first filesystem I write and essentially I did not know what inode
before) and fully complete pohmelfs. Although I subscribed to crfs-devel :)
My new shiny servers will be installed today, so tomorrow I will start (re)implementation of
the ground ideas of
pohmelfs.
Stay tuned!
/devel/fs :: Link / Comments (3)
Tue, 11 Mar 2008
Do you know what patience mean?
No, its not yoga, I talked with people who did it regulary... It is not talks
with trolls, they are harmless. Trying to understand VFS or ext3, which sources
are encrypted in linux tarball, is interesting. Playing chess and 'go'
is just a warming.
Real patience is checked when you are cooking a ravioli.

That is how I spent a monday:
- half of a hour to make a pastry
- half of a hour to cook up a minced meat
- about 10-15 minutes to roll up small piece of pastry to thin circle
- about 20-30 minutes to make 10 raviolis
So, to make above set of them took as long as 3 hours, so not that much.
That was a theory, in practice pastry for each third or forth 'blank' wanted to run away
and pastry did not allowed to 'glue' its parts together, which took 5-10 minutes
for each one to 'fix'. In some cases it ended up with smaller examples of 'Resident Evil'
creatures, like which you can find at the right part of the above picture.

Looks good? And really tasty.
So, forget yoga and chess - real patience is trained
by cooking.
/life :: Link / Comments (9)
Mon, 10 Mar 2008
Touching someone's ego.
There are two ways to point to mistakes made by others:
tell them "hey, you have an error :)" and "you have error here and there".
This can be ended up with
"yeah, that's a good fix" and "you do not know the things, stops doing this things".
It looks like there is no difference in the first messages, but results
will be very contrast. If you do not care
about communication with the person who made a mistake, but only cares
about things got fixed, there is no difference on how to point to the error,
but be ready (although you do not care) that person will reply to you
quite aggressively and can resist to make a solution if it is not vital
for the things being discussed. It is of course wrong and kind of childish,
but that is how people very frequently reply. If you care about communication
with the person in question, then speak like you want to be spoken back.
It is not very simple actually, but do not expect an easy solution with
teaching tone.
This reminds me how kernel maintainers reply to people who make some contribution.
There are very good people who start a discussion friendly even if patch
or question is really wrong, this can end up with sending to mail archive or faq.
There are persons (we all know who) which only replies: this is wrong,
you have a race. Sometimes a race places can be pointed.
So, be cool with others and do not pretent to be the smartest one.
That of course touches me too... I'm frequently a hard one to talk with.
/other :: Link / Comments (0)
Fri, 07 Mar 2008
Just a random though on photography.
I believe that photographers which only make black and white photos
are not as good as those who do not fear to make coloured photos.
BW ones almost every time are good, while coloured are usually not that
interesting, and changing them to BW frequently fixes the shot.
Just a humble opinion...
/other :: Link / Comments (10)
A gentle hint.

I am antisocial. Not always, but frequently. And never in a good known company.
Got a number of whiskey drops (solely for cure purpose) and made this creature.
A gentle hint: officially.
/other :: Link / Comments (3)
Thu, 06 Mar 2008
Got 4 seasons of "House M.D."
Only completed half of the 4'th season...
And there is fair number of "South Park" unwatched yet.
Work seems to be stopped for a while...
Does anyone know how to watch them all not wasting a time?
/life :: Link / Comments (0)
POHMELFS: was done just wrong!
So, last several days devoted mostly to thinking about the things and some
experiments with them lead me to the headline conclusion: pohmelfs was done
just wrong!
Its network ping-pong protocol is wrong, its inode resync logic and overall
need for inode number change is wrong, its writeback logic is wrong (btw, why
Linux VFS calls writeback for inode after it calls writeback for inode's pages?
This leads to the inode number resync code duplication and fair number of problems),
its userspace server cache is wrong (well, its userspace server is a braindamage,
but that does not prevent it from being wrong too), and the most important: it becomes complex,
so I frequently have to read my own code multiple times to understand what I meant here or
there.
That just has to be changed (mostly just removed)!
Thinking about all that crap lead me to the more phylosophical conclusion: any network
protocol which requires precise acknowledge for a packet is broken. Point.
TCP is not broken, since it can send acks for multiple packets. TCP can aggregate on both
sides of the connection (which can lead to the huge
performance increase
as was observed in userspace network
stack over netchannels),
so it is a stream, not a ping-pong, although its policy for ack generation is not always the best decision.
Out of curiosity, why original ping and traceroute commands were not implemented as TCP applications
which would catch ack/rst packets?
So, anything ping-pong like is just broken. Never ever use that logic at all, since it breaks performance
and ability to extend. More to the game, it breaks ability to create real duplex communication,
since while you expect an ack you can get data from the other peer for different command.
So, brilliant idea (yes, I sometimes get them from the deep abyss of the mindless) is to convert POHMELFS
protocol into two real streams: from clinet to server and completely independent stream from server to client.
It has zillions of benefits, but lets see how it is going to be implemented and what will be fully broken in the fileystem.
First, there will not be resync logic. At all. Each inode (and its number) on the client will not correspond
to any inode object on the server, so local inodes will never be synced with the server one. Instead cache of the objects
on the server side will be indexed by special keys containing name, length and other parameters needed for unique number generation.
Client inode number will never be sent to the server, so object creation will have only single direction: just send a packet.
If there is unrecoverable error, connection can be broken, so subsequent command sending would reconnect or make some
changes. Things like permissions will be guarded by the client, there might be no space problem though.
Second, commands, which require feedback from the server, like reading directory content will become completely
asynchronous, so feedback from the server will not be exactly a sync reply for given command, instead
we can wait until directory content was populated and start providing it back to VFS.
Third, and the main, there is a possibility for the stream commands both from client and server. Since clients
now do not require sync ack/reply, they can be batched to the maximum performance, but that is not a main feature,
really interesting is ability to receive a stream of commands from the server, so each ot them can be parsed
independently from the original client command state. This allows to implement cache coherency protocol without major
pain and have a high perfomance stream of data from server to client.
More to the game is ->sendpage()/sendfile(), which are
broken
without proper acknowledge, so to fix the issue I plan to submit a socket extension patch, which will call
appropriate registered callback when page reference counter is about to be dropped, which automatically means
data was received on the remote side. This kind of acknowledge does not break connection down more than
simple unidirectional bulk transfer, so it is fast.
So, started deleting lots of code and implement needed bits, the nearest future will show how broken my approach is.
This rises a question about design vs. evolution... I actually prefer the former, but frequently end up with the
latter (like this decision about network protocol, which is a design, but only after several evolution steps
in wrong direction). This reminds me kernel evolution
topic, which does not actually show anything good for the kernel: there are lots of dead-end evolutional branches which
believe they are the top of the progress, maybe mankind is one of them...
That was a lyrical digression, so back to business!
/devel/fs :: Link / Comments (0)
Sun, 02 Mar 2008
Removing arbitrary size directory with single network command in POHMELFS.
All operations in pohmelfs
are made locally and are populated back to the server during writeback time (or via cache coherency
algorithm, which is not implemented fully yet). POHMELFS uses
writeback cache in all its power, which allows to remove directory of arbitrary size
using only single network command.
During unlink/rmdir time local object is removed and potentially destroyed, while short reference
of what it was is stored in a sync list of the parent, which is marked as dirty. So, when writeback
hits parent directory of the just removed object, it sends all information of the removed objects to the server.
So, when directory with arbitrary number
of subdirs and other objects is recursively removed locally, information is not sent, but added to appropriate
parent subdirs, which are removed in own turn, so when the whole subdir is removed, only single object
becomes dirty - parent of the just-removed directory, which contains information of the removed
dir. Message about this will be sent later (on writeback or because cache coherency protocol), which will
force server to remove the whole subdir recursively. This is much faster than sending information about
every single object being removed during recursive removal of the directory.
Of course if writeback starts hitting pohmelfs inodes during deletion time it is possible that not only
information about the highest removed directory will be sent, but also about some underlying subdirs, but
that does not matter a lot, since this is a very short condition (inode is in dirty list and yet not removed
by the recursive removal) and number of such inodes is still much smaller compared to overall number of removed
objects.
Actually cache coherency algorithm is the last serious thing to implement in pohmelfs I think. There are bugs
of course and some feature extensions, but major milestone will be set after this got implemented.
Stay tuned!
/devel/fs :: Link / Comments (0)
Sat, 01 Mar 2008
Celebrating WiJo's birthday.
We had a small chillout in 'The last drop' and at home later (and earlier, since celebration
bagan right at 00:00, although Wijo said he was born at 10:00, that did not matter already).
Besides other interesting presents,
he got a bottle of Hennesy XO (yes, I was lazy enough and presented that simple stuff).
So had a chance to compare (before it was quickly finished) it with other cognacs
I drunk before. Well, my opinion about cognac being untasty coloured vodka was confirmed again.
IMHO Hennesy XO, VS and VSOP (Very Special Oshe Pizdets) all are not that interesting drinks,
as long as other cognacs I had. Next time
I will try Remi Martin, but think that it will be similar. Yep, I'm not a fan, but rum-m-m-m...
/life :: Link / Comments (0)
|