If something looks undebuggable from the first view, than take a secon one.
Better from different angle. Some problems require third look.
Bits of history of the problem.
Pohmelfs
has extremely large latencies
when syncing local inode to the remote server. This involves sending
a command to the server to create an object with given name and receive back
a response with its real inode information (like inode number and other
fields cached for faster stat() and similar workloads). Pohmelfs
then changes local inode info to match the real data.
Syncing of small tree of 500 files takes about 40 (!) seconds. Well, in Xen
environment where I develop this things local creation of 500 files in single
ext3 directory takes more than 15 seconds, but another 25 is a pure overhead.
That was short description of previous series.
Next, problems of fixing the problems.
First, Xen version used at that testing machine is old enough, so oprofile
does not work. Second, I do not know VFS internals enough (this is my first filesystem,
interested reader can find how I managed to
step
likely on every possible rakes
on that field, some of them were even small kid rakes...) to determine where there
is a possibility to catch that long delays, but since linux filesystem is actually a
not that complex system, but set of callbacks, implementation is not really outstanding,
but knowing in which condition each callback can be invoked and which problems can be
here or there is kind of a magic... Third, remote userspace pohmelfs server was not actually
written by me, instead its bytecode was blown out because of some substances inspiration,
so it can be very much a reason for all the problems, given that it is trivial as
pretty much all my userspace code, even total rewrite will not fix the issue.
So, latency problem in pohmelfs looked really undebuggable. But you know, cup of excellent
tea (from tea-packet) with lemon can fix any problem (or high themperature and substances,
or fair amount of alcohol, everyone has fun the way he likes), so it was first
decided to implement
a simple network kernel module which would connect to remote userspace server and exchange
messages in a similar fasion like pohmelfs does.
Such module was implemented, started and showed excellent performance (about 1 thousand of messages
per second send and received back in test network, which is several orders of magnitude faster
than pohmelfs). So, move back to VFS and pray for inspiration.
Inspiration was met today (thanks Arnaldo, likely it is because I'm getting healthier :).
I always thought that number of subsequent calls for recv() is not a good idea no matter
where: in kernel or userspace, since it takes a socket lock, which in turn can introduce latencies found,
so I eliminated subsequent recvs in pohmelfs code (testing module was written better and does
sending and receiving without such 'fragments'), which resulted in... nothing, results did not changed
at all. So, wrong step, but having subsequent sending calls in a row is not a good idea too,
so I replaced them with allocation and copy, so that there would be only single kernel_sendmsg()
call. As you might expect performance... changed by 30 times. Just by having single send call instead of two
for as much as 500 invokations forced the whole network exchange to behave completely different.
So, to debug problem further I extended testing module and introduced ability to send and receive
data not by single packet but via two fragments: 4 bytes and rest of the packet (60 bytes). Here is a result
table for 1000 of messages sent and received back by testing module:
no fragments: 1.43 seconds
send fragments (4 and 60 bytes): 40.43 seconds
recv fragments (4 and 60 bytes): 1.43 seconds
both fragmentations: 40.43 seconds
It is 30 times difference just for simple application change! tcpdump on receiving side shows that subsequent fragments sending results in a real message sending
all the time kernel_sendmsg() is invoked, which results on ack for each such message (both 4 and 60
bytes), which completely degrades tcp window and connection just can not recover with such behaviour.
So, all that words were written just to show that even undebuggable from the first view problems can be easily
solved, and that harmless (from the first view again) programming mistakes can result in very interesting results...
Now back to drawing board to think how to improve pohmelfs protocol even more to get the last bits out of the wire.
Btw, interested reader can get my network testing module and userspace from theirs
just created homepage.
I can not make a photo of the workplace, where wood plates are placed
because of the ever crappiest repair service of Nikon, so enjoy that
pseudo-grahics instead :)
Originally it was supposed to host not only bottles of wine (I do not
drink wine though), beer and strong alcohol, but also some books,
but I made a mistake in the design which was further increased when
plates were sawn by the people (well, I gained a huge experience of not
letting others do the thing I can do better, even if I pay them for that),
so cells became smaller (it now can host only about 5-6 classical
(0.5l) bottles of beer, each side of the square is about 17 sm)
and have cut side. Both this issues forced me
to rething design a little bit, so now shelves reduced its functionality,
but looks probably even more interesting than in original design.
So far I did not finished wood polishing and did not start its mordanting
and plastering, but only contructed the whole structure on the floor,
so it will take a while to finish, but result is expected to worth all
the eforts.
One told me that there is something similar in Ikea, well, there is always
something similar in Ikea no matter what you created or only thought
about, but that completely does not matter - only process of creation
matters and that is the most interesting and important.
Trying to make at least something during fscking sick.
I've create a simple traffic simulator, which contains variable number
of cars and lights, each of which can be programmed to different acceleration,
maximum allowed speed, stop and deceleration distances. Each light can be
programmed to switch lights after different interval, there are only two lights
in real life: red and green, at least in Moscow very unfortunately) no one
ever cares about yellow, lots of drivers specially accelerate when see yellow...
So, only two colors.
Since I did not bother to implement a nice config for each car and light,
there is only signle set of parameters, but command line parameters allow
to vary initial number of cars and distance between them, number of time frames before
lights change the state or new car enters the road, number of lights
and distance between them.
There are two known problems with the lights on the road: first, bad drivers,
who do not maintain a huge enough buffer, so they have to wait until car
in from of them moves far enough so they can start, this takes some time
from limited timeframe of the green light. If buffer is large enough, drivers
can start simultaneously and thus move much faster.
One can simulate
this behaviour with variable initial number of cars and with different distances between
them, if distance is less than stop distance (i.e. distance where driver has to
stop its car, it is 4 in the current setup), then driver will have to wait,
until distance becomes more or equal to stop distance, if driver stopped far than
stop distance (let say 5 'meters'), then car can start simultaneously with the head. The latter
approach allows to move more cars through the light during fixed time frame,
but psychologically it looks better to stay as closer as possible to the head
car, which introduces a latency, since we have to wait until head car moves far enough,
so we could start. This leads to negative exponential speed increase for each car
behind instead of linear speed if drivers would maintain the buffer. Appropriate
equations are quite simple: difference of the distance moved by the single
time frame is proportional to the acceleration of the head car, which in turn
is proportional to its coordinate, so we have a simple differential equation,
which solution results in a negative exponential. One can read a bit more
here.
Second problem is light interval. If interval is too short, then cars can not start,
only couple of them moves forward, and if it is big enough, then during red light
a large backlog of cars can be accumulated, and it will not be removed during green light
because of the above problem: each car has to wait until head one moves to some distance.
The latter is actually worse, since backlog can become so huge, that it will not be removed
at all, which will lead to complete stall of the traffic flow (at the back side, front one will
move, but number of cars at the tail will be bigger than number of cars which leave the
traffic jam).
One can play with the programm, called
traffic.
It requires gtk2 devel package installed. Homepage contains essentially the same text
and link to the source code.
It also shows usage example.
David Howells of RedHat recently
posted
next round of his CacheFS implementation. Main idea of the project is to
store locally data and metadata modification on disk.
Cache is implemented as write-through one. Locally data is stored as
usual files on a special partition formatted as one or another filesystem.
David also posted
benchmarks
of his apporach. Metadata intensive operations showed significant slowdown
with the local on-disk cache, getting metadata from local cache also shows
a slowdown. The former can be explained by the write-through nature of the cache
and slow local disk operations, which is also a reason for metadata reading
downgrade of the speed.
There is also no cache-coherency algorithm implemented for CacheFS. Another problem,
pointed also by Kevin Coffman is possible slower reading of data from the cache than
from the local filesystem (and from remote one if bandwith is not a limiting
factor which is frequently the case).
This is third (actually the first :) local cache implementation for the network
filesystem, so competition between
CRFS,
POHMELFS and
CACHEFS becomes even
more interesting :)
Stay tuned!
As was mentioned full inode resync logic
is very slow.
Latency is introduced likely somewhere at protocol layer, which is used
by pohmelfs. To test this scenario and find out the best possible
solution I implemented trivial network module and userspace server, which
talk to each other via protocol very similar to what is used in lookup/create
operations in pohmelfs. Server and client also maintain trees of the objects
it sent/received, so that model would be as much as possible similar to
pohmelfs usage patterns.
Its time to test things and find out where the problem lies, but as usual
there are problems. You are sick, everything is aching, but you
want to beat the crap, to move a bit further, to make something
interesting, so you start implementing the tiny bits, you start thinking,
you finally make the things, so you become happy and proud, and that is
just to find out, that all testing machines you had access previously
are turned off, and new ones are behind a firewall and there is no access
to the network from the ass of the world. This is called 'shit happens'.
At least yum developers do not know, that there are systems
with less than 1 Gb of RAM. And it is not even about how slow yum
is. Not about the fact, that to install 30 kb application yum will
download 3.5 Mb sqlite database file.
It is about yum programming bugs:
As you might expect, all 70+ packages above also got 'Cannot allocate memory'
error. My laptop has 256 Mb of RAM and 512 Mb of swap, more than a half was free.
After trying to start the same process again, after some applications were killed to get
free memory, yum refused to install packages because of broken
dependencies...
But libebml has one FC6 version. For the protocol, FC6 was never
installed on this laptop:
libebml-0.7.7-2.fc6
libebml-0.7.7-3.fc8
Fedora Core also forces FC9 stopper bug into needinfo one without any single
patch/version to test (at least I did not receive any such mail),
opened with perfect description, with probability of bufer overflow
somewhere in image processing/rendering code, with 100% reproducible example and image
to test with, even after other person reported the same problem on rawhide
(and marked it as fc9 stopper).
How in the hell you expect to get some info after two months of silence from developers?
(one month after bug was confirmed in rawhide)
Some people still believe in miracles...
I would like to test it right now, but I can not because of yum problems...
Old packages, as you might expect, still have that bug somewhere.
World is far from being perfect :)
I will not turn it off or suspend, I do not believe it will work after that. Instead
I will wait until capabale to get new DVD with some other distro. And it
willnotbe
Debian either.
I first time in my life called a doctor.
Doctor happend to be a nice-looking woman less than 30 years old
we nicely talked about my sickness, and how (un)successfull cure was.
Since it is first time I used some drugs (except aspirine) to cure
(I belive it is a flu) the sick, I managed to miss the point, when getting the drug
should be started, so it was not very useful. After the quick look at me
she found so many possible crappies I would have, so I began to scary.
I always knew that the less one knows the better sleep is, but after
set of questions and 'no, I do not' answers, sky became a bit less cloudy...
She gave me a list of needed drugs to get, described what the heck
it is (flu, which forced some complications: something like
bronchitis), so now I feel a bit better, but will be ill (according
to her prognosis) at least all this week or maybe a bit more.
Modulo themperature I feel not that bad, but when it starts rising
after drug (aspirin) stops working, I start feeling like shit, dejavu.
I actually tried two anti-themperature drugs: aspirine and paracetamol.
The former kicks the themperature in about 30 minutes, but frequently
this results in excessive hyperhidrosis, while the latter acts only in an hour
or so, but without any bad effects.
So, as you might notice, there are no updates about tons of my
projects
because of this fucking sick, but I expect soon to be able to kick
its ass so things would be in a good shape again.
Such (crappily forced) 'vacations' make brain to sleep allmost all the time,
so when this will be ended, I expect even higher rates and more
interesting stuff happens.
Let's suppose one does not have a thermometer, but there are lots
of instruments and equipment around starting from screwdriver to drills, from simple
amper/voltmeter to laptop. As a prompting: there is also electricity,
vater and automatic teakettle.
Task is to measure temperature of the own body and decide to get or not
to get an aspirin. Or make some fun from the process because of quite boring
sickness.
Solution is pretty geeky, but first try to think about it yourself.
So, the solution.
It is based on the fact, that when human body or part of it is placed
into environment with essentially the same temperature, but much
bigger thermal capacity, it does not feel this.
Try get shower with about 36 degress Centigrade, and you will not feel neither
cold, nor hot. Things are different when air on the street is more than 30 degrees
Centigrade, that is because of too much different thermal
capacity of water (it is huge) and air (very small).
So, back to the task. To determine your tempeperature you have to get precise
volume of water in the teakettle (let's say 1 liter, I could measure it because
I have water counters), connect teakettle to the electricity via ampermeter, measure
voltage by voltmeter.
Then you have to put your arm into the teakettle and turn it on (beware of heating
element) and checkout first time. When your hand will feel itself very comfortable
(here is a main error factor) you have to checkout second time. Then remove your arm
and wait until water become boiled and write third time.
Now, its time for school physics: power of the teakettle, which is equal
to multiplication of current strength and voltage, multipled by
time difference is equal to weight of the water multipled by its thermal capacity
and temperature difference, which was changed during above time frame.
So, here are practical results:
current strength I = 3.7 A
voltage V = 231 V
mass of the water m = 1 kg
thermal capacity of the water c = 4200 J/(kg*degree)
time difference for complete boiling (from unknown temperature to 100 degrees Centigrade)
dt0 = 420 seconds
temperature difference dT can be found from following equation:
I*V*dt0 = m*c*dT
So, we have dT = 100 (temperature of the boiling) - T0 (initial temperature
of the water) = I*V*dt0/(m*c), and is equal to 85 degress Centigrade, so initial
temperature of the water was about 15 degress Centigrade.
Time difference between start of the process and comfortable temperature
was about 30 seconds, so placing this timeframe into above equation we can find,
that temperature was changed by 6 degress.
Since we already found, that initial temperature was about 15 degress Centigrade,
calculated temperature of my body is about 21 degrees Centigrade.
Its time to go back to grave...
P.S. Yes, I'm a former looser-physicist, that's why I became a kernel hacker, this can
explain alot...
Here are number of interesting moments in this miserable
melodrama
(got via Linux Today News).
Stallman wanted to visit Russia this March, and parliament member Viktor Alksnis
promoted his visit and wanted to help with 'administration issues'.
Then LOR (www.linux.org.ru, one of the most popular linux sites
in Russia) moderator Sergey Udaltsov (who lives in Ireland)
sent a letter to Richard, where
pointed that Alksnis is not a good man, he also noted about Alksnis'
"fight against the independence of the Baltic countries" in late 80s.
Stallman then said that he does not want Alksnis to organize his visit.
Well, my couple of points about this stupid situation.
First, Alksnis is really not a very smart person in IT, and it looks
very much like he is a usual careerist, since I do not know about his
work at all except stupid idea of creation of 'national OS',
probably using 'nanotechnologies' (it is a modern trend here :).
Second, step of Sergey Udaltsov is very well braindamaged - while it is ok to
describe who Alksnis is, but pointing to Baltic independence is even
more stupid than 'national os' idea.
So, my simple point is that both Viktor Alksnis and Sergey Udaltsov just
wanted to make some self-advertisement profit from Rishcard Stallman's visit
and do not really do it because of open source. Although
self-advertisement is not a bad idea (you read this blog :), such movements
are stupid.
If Stallman got problems with visa (I'm surprised if it takes more than two weeks),
then it means he does not really want to
visit Russia, for example I got kernel summit invitation more than two month
ahead of the meeting, which was enough to get visa (it required two (!) visits
to the UK visa office just to give and get back documents,
it took two (!) days to check documents, and it was
possible to order a courier,
although previous year time frame was shorter). If he does not want,
why would we care?
If someone wants to visit a country, he can find a way to do that himeself
and do not wash own brain with stupid rethorics.
I think it will be a cause for Richard not to visit, since it really
looks like he does not want to do it :)
Well, when you are sick it is generally not a very good idea
to work on something, but this state can produce very interesting
ideas. So, I decided to change my table,
so I 'disconnected' a leg (there is only one, other side is connected to the wall near the window),
removed all varnish layers using a plane, polished
a bit by grinding machine and painted both sides of the table
with new dark chocolate colour. Looks pretty good, but since orgalite,
used on both sides, is made out of wood fibers, it is not 100% smooth,
although second layer of the colur is a mandatory, it is possible,
that it will not be enough. I want if not completely smooth surface,
but at least that hand moved on top of it would eel nothing. Managed
to paint hands and feet, but at least hair was not touched (although
I'm not sure).
Also installed number of electricity sockets, so there is no need to place kettle
far from the 'bed' (i.e. part of the floor).
Overall it was a good development day, I expect next one to be as much
as productive if not more. Nevertheless I hate to be sick...
It also works with much bigger trees (like untarring linux kernel tree,
although ugliness of userspace server requires to rise maximum amount of opened
file descriptors).
There is a single problem in this case: it is damn slow. And I do not see
an easy explaination for that. Well, tcpdump shows small window, but that is an end result
I think, not a reason, and the reason is likely in the protocol pohmelfs uses - system sends
number of short packets in round-robin fashion, which may be slow for some reason.
Since I'm waiting for real hardware to test things on (since oprofile does not work on installed
Xen version), I can only handwave about the root of the problem...
And that is exactly the same problem which was with write-through cache pohmelfs had first, I think
even timings are similar, so after this problem is fixed, new version will be released.
There is another problem, which complicates the development - I got a cold (second one this year, and third
one for the last 3 or 4 years though), but such condition with some temperature, when brain is in the
'hinged' state between sick and good shape, opens very fun feelings about things around, which usually
ends up with very interesting results.
But overall it does not work, since writeback can happen for any inode
inside the whole not-synced tree, so trying to sync inode number for some
obscure object, which sits in the directory server never saw before, is quite
problematic - the whole tree has to be traversed from the inode under writeback
up to the one which is known for the server host.
Although this is not a very complex task, but there is a question about what to
sync. Should the whole directory content be synced, or just single inode,
if the former, than should we force writeback for other objects in the directory under
resync... I think the simplest case is to force only higher layer object creations,
not syncing theirs content (like other objects in the directory), but directory itself
should be marked as dirty, so that access from different clients forced appropriate
resynchronization.
That was bloody excellent training, but since it was only third
one this year, I hardly can find a piece of body which is not aching right
now. Maybe ears though...
I started to climb high without usual warming traverses today, so knees were
damaged first (not counting couple of strikes by the wall and holds). Then
fair number of various attempts to finish new (quite complex though) trace
over blue holds in the right central section, which moves over the new
balcony in that area. Eventually I found the solution (although I think there
is another one, which requires a bit more power, but a bit less technique),
but damaged arms, rubbed all 20 fingers, and made arms very tired. Also tried
some interesting old traces, but since I tried them only second time, I failed.
That was ok, legs and back were not aching that time, but then I found a trace,
which made the day! Excellent complex and quite short trace in the right
vertical corner of the climbing area.
Trace requires very interesting technique almost without heavy powerlifting
movements, but good balance and stretching.
I spent more than a half of a hour there, damaged shoulders and fingers,
stretched everything which was not broken before, but completed the trace
(although only piece-by-piece, will finish it cleanly next time), and finally
moved to the shower (even sauna did not work today) as a number of separate pieces
of crap connected to each other essentially by virtue of the mind.
Excellent day, excellent time!
It is rather dumb and even does not have state machine handling
in the usual meaning.
Existing pohmelfs implementation has only two places where content of the inode
is 'globaly' modified, by 'gloabaly' I mean some changes, which have to be seen
by other clients if they will access given inode.
First one is directory reading, when inode in question gets information about
other inodes in given one, another one is object creation. Object removal is local
operation, and there are no collisions if multiple clients delete the same object
simultaneously.
When directory is being read first time, pohmelfs just syncs its content from the server,
all subsequent reads happen from cache, since all creations and removals happen locally.
This case is simple.
When pohmelfs is about to create an object, it marks parent inode as dirty,
if parent inode was not marked dirty previously, this ends up sending a single
message to the server. Server in turn can return content of the directory in question,
if that inode was already modified by different client. If there are objects with the same
name as local ones, local objects are 'renamed' to the 'oldname-synctime', so that
user could later run diff or whatever and merge changes. That is how offline
pohmelfs clients work.
Object is always created in the local cache only with local inode number. So far
it is never being sent to server (although code which does it and changes the inode
content exists), even writeback does not work right now (since server does not know
about object with local inode numbers). This part is a bit more complex: pohmelfs
has to sync inode (i.e. to send current inode info, wait until server creates object,
then receive real inode info and change local cache) either in writeback (when
system forces to writeback a page(s), appropriate inode will be synced first)
or in cache coherency algo. For that purpose each network state locking first checks if there
are messages in the queue from the server, which have to be processed first,
so far only server content receiving is supported, forcing to send own content on request
from server is a base of the cache coherency nad this is not yet turned on. Here
major race lives, which can lead to the full resync of the idea actually. After we locked
own network state and checked that there are no requests from the server, client can start
sending own commands, but before they came to the server, it can start CC resync
(and send messages into the same pipe as clients command) initiated
by different client, which will break protocol state machine. This is main idea to think about.
Oh, and to implement the same logic on server :)
Someone good placed my rss feed to kernelplanet.org,
which is a kernel hackers place of shame glory :)
Well, one who did that probably saw that frequently I write quite a lot
of notes for a day, that I have no political/hacker/whatever ethic in
the blog, that I made too many english errors (especially when I have
no access to the dictionary) and so on, hope it is not that bad.
So, couple of words about what it is. This blog is fully devoted to how
I spend the days: hacking, having a rest, sleep and move to the toilet...
Blog has comments (with a bit not user friendly captcha), and number of them
one can find at the end of the message. When new comment is added, entry is updated,
so stream-based aggregators will see it as a new one. Usually there are 1-3 entries
per day, sometimes more, sometimes no entries.
Also replaced electricity switch with the new one and installed
warm floor thermal system in the bathrom, but the latter is not yet
fully completed (wires are connected without good isolation yet
and thermal controller is not placed into its own socket in the wall,
but hungs around on wires), but it work (do not know how good, and
I will turn electricity off when move from home, will check it
thermal capabilities when I'm able to stop the fire...
I started electricity projects in the bathroom, which required
to setup electicity sockets for light switch and warm floor. During
installation I managed to get dirty of assembly foam, which,
if you do not know, is much much more heavier to clean than
any other material we have. And I managed to make dirty
not hands (that's usual), but hair (I first time worked without hat).
This forced me to wash my head with acetone... It was not that bad actually,
maybe except its smell, so eventually I cleaned almost all dirty areas
(about one quarter I belive, since I have no mirror I can not say for sure)
and washed it couple of times usual way, but that stopped me from further
development for today.
Maybe I will finish it tomorrow.
At LWN.net. And as usual I do not have an account this time...
So, will wait for a week for free article, by that time pohmelfs will contain very tasty things,
which do not exist in any other fs out there (or at least in the single filesystem).
Edited to add, that Simon Holm Thøgersenshared a link to the article. It is somewhat fun,
although author (Jake Edge) writes quite differently from Jonathan Corbet imho. Article
does not compare pohmelfs and crfs, but shows that they are very similar. I've known, that
Zach Brown works about a year on CRFS, while pohmelfs exists
less than a month. Someone shared a secret knowledge about meaning of the pohmelfs abbreviation
in russian, well, maybe he/she is right, who knows...
Article does not cover features scheduled for pohmelfs like offline working and inode resync logic.
Commenters try to compare crfs and pohmelfs with afs and pnfs. Both do not have metadata caching
mechanisms, so they are fundamentally different, pnfs in addition allows to implement closed
extensions, which will lead to vendor lock.
One point to writer Jake Edge is that he does not use names in the articles, but only last names.
There is a long discussion in linux-fsdevel
about various filesystem freezing implementations and features it should have.
Main goal of this project is to freeze any filesystem, so that all write requests
would be blocked. This allows to implement consistent backups. This task
belongs to block layer though, and this patchset actually implements that by
suspending underlying block device. Although interface (ioctl) is a bit ugly,
it will likely be accepted, since other filesystems (namely XFS) have such feature
via own provite ioctls. People say that it does not always work though.
LVM supports consistent backups natively, but having such interesting feature without
need to work on top of device mapper would be a great deal!
This highlighted a very interesting project I have in mind (actually it will be
another reinvention of the wheel though) about various removable devices. Actually
it is not only about removable, but any devices, which can suddenly dissapear or stuck
(like network filesystem, broken cable to local disk or bad drive).
Old idea is to remount access to such device as readonly and with error returned to
any atempt to access it. There is a frevoke() syscall which does that
for given file descriptor - it is marked as errorneous so access to it returns errror,
but this does not fix a problem with network filesystem for example. Let's suppose
we have NFS client which stuck because of server was disconnected, there are cases when
it will never resume and return error. Or bad block/bad drive access, which will try
again and again forever...
Revoking particular file descriptor is simple task, but what if we have a web server,
which accesses broken drive for each new client or similar scenario? While we revoke one file descriptor,
server will create another two, stuck in the middle of the operation.
The very good solution I have in mind is to break all existing access pathes (block
layer has access to all bios) and either replace underlying device with fake one,
so that all requests would be completed with error (consider it like hotplug/unplug
of storage device), or replace filesystem (inode and file) operations, so that
they returned error (that is like hotplug/unplug of the filesystem). In the latter case
it would be even possible to change filesystem on the fly! First, plug a filesystem which
just queues requests isntead of processing data, then unplug real filesystem,
plug new one and unplug fake one.
Not sure it is very useful functionality, but very interesting...
Chris Mason changed on-tree disk format again, which leads
to very noticeble (30 times!) speed improvement
for random write access (from 1 mb/s to 30 mb/s).
This release also contains mount option and some tweaks for SSD (solid-state disk),
mainly write clustering without getting into account directory
file writes belong to. Also added simple ENOSPC handling,
although it is still possible to crash machine, when there is
not space left on device, now it is a bit harder.
Next step for btrfs
is to support multiple devices for single filesystem via
subvolumes.
Jake Edge posted an article at LWN.net
about various memory pressure notification, which userspace application may be insterested in.
For example they can wait for swap in/out notifcaitinos or oom condition far before
it is killed by oom-killer, so it could free some unused ram (like firefox could
free some recently viewed pages cache).
Notifications are transferred to userspace via /dev/mem_notify file, which
is readable and pollable. Alternative way is to use SIGIO signal to the process when
the device becomes readable.
Patch likely will be accepted soon.
This is another example of the real need for unified event management subsystem in the Linux kernl.
Will order some equipment tomorrow and continue table and shelves
development.
Since my kitchen is not completed, it will be cleaned and
temporarily transformed into wood workshop. I also have to complete
some electricity projects (in kitchen, bathroom and hall).
That was my second training this year - not a very huge progress
as you can see, so this training was hard. There is fair
number of new traces, most of them were quite simple, so I decided
to run several in one go without the rest in-between. Probably that
was not that good decision, since after 7 or so of them completed
after two starts I was very tired and was not able to climb good
over more complex traces. So that will be postponed for the next
training.
I bought myself new climbing shoes, which are a bit large - usually
I wear 3 sizes smaller climbing shoes, this time difference is only
two sizes, but it is my favourite shoes, so I expect very good climbing.
As I wrote previously,
accepted design of the local cache
allows not only to fix problem cases with inode generation numbers, but also
provides a very interesting feature with offline working.
Let's suppose client was moved offline or just does not yet synced its cache with the server.
It can work without any problem and later when it connects back to server system will resync
its data with server one. For all files, which are different on client and server, client will
have an own version, but with different name (like orig_name-$date_of_sync),
so that user could run diff or anything else and merge changes properly.
Number of usage cases for this excellent imho functionality is extremely large...
There is a problem though, since client's memory is limited, and eventually writeback will
start pushing data to server, so for such cases client has to have ability to cache not only
to mem, but to disk too. That is future extension though.
An anounymous reader dropped me a note, that such behaviour of locally cached files,
when its inode number will change after resync with server, will be frowned upon by some
RSBAC systems.
I believe that inode-only based approach is broken because of heavy problems with filesystems,
when file can be changed by different clients. There is a possibility to remove file and then
create new one, and it will have the same inode number as just removed one, so withough knowing
name of the file system will be screwed. And how does this system work with hardlinks, which
have the same inode number as target object, but different names?
assert youKnowWhatYouReallyWant == true;
if (iAmWritingForPersonalUseOnly()) {
if (iWantAReallyNewParadigm()) { // actually you'll get some irreversible brain damage.
try {
return "Huskell"; // dude, I really mean the DAMAGE!
} catch(ECriticalBrainFailure e) {
if (preferDotNetWorld()){
return "F#"; // it's the same as Gb, ain't it?
} else if (processorCount() >= OH_SO_MANY) {
return "Erlang"; // start thinking in 1000 threads
} else if (preferPunctuation() == STRONGLY){
try {
return "J"; // APL needed a transliteration -- and got it
} catch (EBrainOverolad e) {
return "K"; // better have a bank hire you soon!
}
} else {
throw new RethinkParadigmException();
// you should have better selected Haskeel before
}
}
} else {
if (isDynamicTypingOk()) { // hey, everyone wanna be a cool geek today.
if (cannotLiveWithoutCurlyBraces()) { // well, who can ?!
return "Ruby"; // it's Python done better.
} else if (enjoyIndentation()){
return "Python"; // it's Ruby done right.
} else if (shizophrenia->isOK()){
return "Perl"; // all the expressivenes and imprecision of a human language.
} else if (sourceCodeConceptIsObsolete()){
return "Smalltalk"; // ever modified the value of True -- on a live system?
} else {
throw new LameException("PHP5"); // stick with this, los^W poor dude
}
} else { // static typing obviously
if (isManagedOk()) { // let PC do some job for me, they are so smart nowdays.
//Sick of doing everything myself.
if (preferJavaWorld()) { // die, MS, die!!!
return "Scala"; // huge, really huge. Must be inspired by Noah Arc.
} else if (preferDotNetWorld()) { // stuck on Windows, ha?
return "Nemerle"; // kazalos' by... oh, not again...
} else {
throw new IsThereReallyAnythingElseException();
}
// computers will eliminate the humankind if they get enough control.
} else if (unmanagedOnly()) {
return "D"; // get a whole new language with every new release. Great fun.
} else {
throw new YouWantSomethingStrangeHereException();
}
}
}
} else {
return "Do Whatever Your Boss Says To And Keep Your Mouth Shut Programming Language";
}
I only know C a bit and some time ago I tried Java and knew what C++ was...
I think I'm living out of this new and shiny world of programming, and that's cool.
I think I've just designed the way to fix the problem with
overlapping inodes on different clients or server and clients.
Here is short problem description: when client locally creates some
object, it has to assign unique number to its inode and put it into
global hash tables. With local cache and maximum performance (or when client
is offline) it shold not connect to server and perform create operation
at all, instead it should pick some number for inode and work with it.
Problem is that number of clients can have the same inode number for
different inodes and have actually the same object but with different inode
number on different client's machine.
When clients and server will have to sync its states problem rises: server
does not know about inode with client's number and thus sync can not happen.
Solution is quite simple imo, which solves both cache coherency problem and
inode number one.
Clients use any numbers they like: for example sequential increase from zero.
When new object is created its parent is marked as dirty by client (if it is already
marked as dirty by other clien, it is forced to push its changes to the server,
which then will be forwarded to the new client), and client uses own inode numbering
scheme. When later there is a need for resync (lile forced writeback or above case
of cache coherency synchronization), client sends inode content to the server
with both name and local inode number. Server then creates an object and assigns
real unique inode number to it, which is then returned back to client. Client
removes inode with old (local) number from hash and inserts it back with different
inode number. That's all.
Simple. And allows to work with any filesystem on the server side because system
uses both object name and object id (inode number) as identificators during creation time.
So far I do not see any drawbacks in this approach, but practice will show if it is
correct design or not. Stay tuned.
We celebrated Grange's
birthday in the small wood cottage somewhere at the coutry-side,
where had a rusian bath there, bathing in an ice-hole, shashlik cooking
on the frost, musical jam (electric guitar, couple of tomtoms and a saxophone),
snow balls and wrestling, and main one: lots of friends.
That was bloody cool days, thanks a lot for organizing that meeting!