|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Fri, 31 Aug 2007
New release of the distributed storage subsystem.
It includes security attributes, a lot of documentation
and (purely internal change) per-sector bitmask of the dirty
sectors instead of per-page in the mirror algorithm.
One can find sources of the new userspace utility, userspace documentation
with several setup exmaples and kernel patch on the project
homepage.
Enjoy.
/devel/dst :: Link / Comments (2)
Thu, 30 Aug 2007
Security attributes have been added to DST.
Now it is possible to assign list of address, which are
allowed to connect to local exported node. Also added permissions
bitmask, which so far only includes reading and writing. This permissions
are set in above security arguments.
Another DST TODO item
has been closed.
I think I will write documentation update (overall design bits and userspace usage examples)
tomorrow and release new version before KS (I move there Sep 4, will be in London about 12:20
local time). I will leave netlink configuration and ability
to save state bitmask of the mirror node to the next release (although can implement the latter,
it is not that complex task).
/devel/dst :: Link / Comments (0)
Wed, 29 Aug 2007
Climbing evening.
That was quite lazy training.
Although I completed quite a few traverses, couple of interesting boulderings
and started new exercises on the negative slope to flush the power (climb without
insurance rope several meters up and down in a loop until fall without the power)
I managed to complete couple of simple traces on the negative slope with bottom
rope.
Actually it was a good training, but somewhat uninteresting. There was no spark,
so a bit uninteresting.
/life :: Link / Comments (0)
DST has been moved from per-page to per-sector bitmap.
One of the DST TODO items
is closed.
/devel/dst :: Link / Comments (0)
Tue, 28 Aug 2007
Why LDPC (and similar probabilistic) codes are not suitable for redundant storages.
Just because they operate on bits,
and have probabilistic nature of the recovery process. Getting
huge size of the word to be encoded and sparseness of matrix, there is a possibility
to create a noise, which while being perfectly in the recovery range of the code,
still can not be recovered. Fixed blocksize codes (like Reed-Solomon) are very
constrained to how code was created, i.e. there is no possibility to add
new checks after data was encoded, so it can not recover from errors more than
level defined in encoding algorithm (like RAID5 can only recover after one erasure
failover and RAID6 after two).
LDPC codes can be suitable for flow encoding of the data, which does not require
guaranteed recovery (like voice data in GSM telephony), but for guaranteed recovery
it must be coupled with fixed blocksize code (like Reed-Solomon or any other similar).
Another reason is that this codes are bit-based. There is no possibility to split
data and checksum into bytes or other words and store them independently,
and then recover if one or another set has failed, to implement this LDPC code operations must be
performed over galois field GF(2^w), where 'w' is size of the elemental particle, which can be byte
or word or whatever is needed. Operations over galois field of order 'w' require
either O(2^w) memory overhead or system of 'w' equations to be solved, which is slow.
One can find a bit more on my implementation and tests of the LDPC codes
here.
/devel/dst :: Link / Comments (21)
LDPC iteractive decoding.
This is low-dencity parity check code iteractive decoding algorithm presentation.
Original image was 50x50 bitmap 'transferred' over gaussian (in that degree,
how glibc random number generator (rand()) matches) channel,
where 10% of the noise (swapped bits) were introduced. Rate of the 'transmission'
is 0.5 (i.e. only half of the channel was used to transfer the data, and half for the parity bits).
Code uses hard decoding algorithm, which stops after 4 iterations, since all errors are detected.
Algorithm for matrix generation (takes most of the processing time)
is a bit buggy (can crash, what do you want, I spent 3 hours Sunday night to write it :),
but it does not matter for this presentation.
One can find original presentation by MacKay and Neal (1995)
here,
where authors fully recovered 10000 bits image after 'transmission' over 7.5 % noise channel.

According to my study, LDPC codes and any other similar (let's call them 'probabilistic') codes
can not be used for any reliable transfer, since probability of the error detection is never equal
to one. So, for some matrix some noise can be fixed, but different noise will not. There will be
matrices which will fix at least number of errors, if I understood LDPC correctly, but its generation
is generally very complex (my algorithm tries to automate that a bit though), and I failed
to find a precise description of the error-recovery rate (i.e. how many bits can be 100% recovered
by given matrix (set of weights and word/checksum sizes)). So, for guaranteed transmission some kind
of combined algoritms must be used - LDPC or any other probabilistic code to detect some
errors (as much as possible) and then provide given image to fixed blocksize decoder (like Reed-Solomon),
which guarantees recovery of number of bits, specified during encoding time. Such combined algorithm
will behave better than any of its parts (LDPC and RS separately), since because of probabilistic
nature of the LDPC code, number of errors will be detected, so that error rate becomes small enough
for RS decoder.
One can find source code for LDPC encoder/decoder (code.c),
image to bitmap translator (file2bit.c) and bitmap to picture viewer
(gen_images.c) in archive.
Last two require GTK development library.
I've also created a homepage for this project.
/devel/math/codes :: Link / Comments (0)
[:||||:]
:)
/other :: Link / Comments (0)
Mon, 27 Aug 2007
Climbing evening.
That was really good one - although Grange was late (for a hour),
I managed to complete several interesting traces on the negative slope,
although I fell many times, there is a progress. Number of various traverses,
power endurance exercises on the negative slope and I was completely turned off.
Excellent time.
/life :: Link / Comments (0)
LDPC code presentation.
I'm cooking up a nice LDPC code interactive decoding presentation.
It will be quite small (image is about 50x50 pixels), but I will increase pixel
size on the resulted image.
Presentation (iteractive gif) will show several steps of the image after it was
altered by guassian distribution of the noise.
Stay tuned, it will be ready tomorrow.
/devel/math/codes :: Link / Comments (0)
Fri, 24 Aug 2007
A job.
I just decided to check what knowledge modern russian search engines request
from attendees. That was very fun and contrary very sad - I failed to answer whole set of
questions just because I can not recall some bits, and expect that searching
for the answer is unfair.
But there are some interesting questions, which I would definitely like
to answer.
For example what multiplexing system would you use to handle more than 1000 clients
for the web server. There are number of answers: poll, select, something else
and no need to multiplex anything. I have at least two answers (unfortunately they do not
request a sample code) - something else (guess what? Of course reinvent the wheel
and write my own) and no need to multiplex anything, with quite strong ground.
Another interesting question is about data sharing: you have many terabytes of data and
need to answer a question about how to distribute it between many servers so that
searching would take resonable time, and what will happen if number of them fail.
I think if I will answer, that I will write a distributed storage in kernel, people
will laugh on me.
In one of the questions for Unix system programmer there is a subquestion about possibility
to write fork() for Windows, and mail-system developer, which must know Oracle/MySQL.
There is a question about port scanning - if you have a system with infinite number of sockets,
descriptors, processes and very fast interenet channel and need to scan huge number of ports on the remote
systems, how long it will take if scanning for one port takes a second. My first question
was about which local system is being used, since in one system it is possible to bind many
sockets to the same port, if remote tuple is different, but in another it is impossible, thus
it can only handle 64k connections in a time.
Many questions are about parallel computing in distributed environment. I expect recruiters
want to hear about some kind of message passing algorithm like MPI, but I first thought
about transferring not data request, data processing, so intead of doing 'calculate this
stuff on your data and give me back result so that I will send it to another client'
it would be possible to start a new process to do this stuff, which would migrate to remote
system, process all needed data there without sending its parts to main host and back,
and then provide result to the some client. This is just a though, but given how simple
is rescheduling (well, not _that_ simple, but I managed to write it myself in userspace
in M:N threading model
project), it would be possible to implement as a research project.
So, generally I've found, that I will fail for all positions, since in each vacancy I did not
know some questions, and I like that - I do not want to know everything - I like to get new knowledge.
/devel/other :: Link / Comments (0)
Serious DNS outage.
Now it is fixed.
/other :: Link / Comments (0)
Thu, 23 Aug 2007
Distributed storage in Linux Weekly News.
I've finally subscribed to LWN, so right now I'm reading an article.
Here is a start:
Evgeniy Polyakov is not an easily discouraged developer. He has been the source of a great deal
of interesting kernel code - including a network channels implementation, an asynchronous crypto
framework, the kevent subsystem, the "network tree" memory management layer, and the netlink connector code.
Of all of those patches, only the netlink connector has made it into the mainline kernel - and that was
back in 2005. Undeterred, Evgeniy has come forward with another significant patch set for consideration.
His ambitions are no lower this time around: he would like to replace much of the functionality offered
by the device mapper, iSCSI, and network block device (NBD) layers. He calls the new subsystem distributed
storage, or DST for short. The goal is to allow the creation of high-performance storage networks in a
reliable and easy manner.
Jonathan was likely kidding when wrote that :)
As of my projects, entered kernel, I can add
that netlink connector
code was a 'parent' for generic netlink (genetlink) project, created by Jamal Hadi Salim and Thomas Graf.
And I'm also author and maintainer of Dallas one-wire
bus in Linux kernel.
So, yes, sometimes I do generate some interesting ideas, especially if we will not recall
kevent project.
That was just to add some clarity :)
As a conclusion, Jonathan writes:
This patch set is clearly in a very early state; quite a bit of work would be required before it would be
ready for production use with data that somebody actually cares about. Like all of Evgeniy's patches, DST
contains a number of interesting ideas. If the remaining little details can be taken care of, the DST code
could eventually reach a point where it is seen as a useful addition to the Linux storage subsystem.
He is right of course, there are number of tasks to be completed for this project, but without interest
from other parties (I only say about interest, about discussion of the idea and implementation,
and not about anything else), it is quite hard to determine the right direction.
So far TODO list includes:
- move to per-block data bitmap instead of per-page
- implement optional saving of mirroring/linear information on the remote nodes
- implement netlink based setup
- implement simple security mechanism - like NFS export list (list of IP addresses allowed to connect to given node)
- new redundancy algorithm (low-prio, since I'm only studing them)
Stay tuned.
/devel/dst :: Link / Comments (2)
Kernel summit attendees.
The invitee list
has now been published.
/devel/other :: Link / Comments (0)
Wed, 22 Aug 2007
Climbing evening.
That was really hard one - even warming traverses were turned into
quite complex traces. Climbing on the negative slope with the bottom rope insurance
was initially started as quite simple trace, but then I aded climbing down and up
without the rest over all holds on the negative slope, which sucked all my power,
so I only completed couple of really simple old traces and went to have a rest. getting
into account that this day I started at 6 o'clock, that was really hard time. But it was good.
Just damn good.
/life :: Link / Comments (0)
LDPC codes.
Ok, I've completed hard-decision decoding algorithm,
with randomly generated matrix (if we are lucky) it can safely decode small
words with multiple bits errors - for example word of 16 bit with
3-bit weight of the random matrix it is possible to decode at leat two bits of
errors even if one of them is in transferred checksum itself.
Unfortunately random matrix generation for larger word sizes
does not work (at all). LDPC generation matrices must obey at least
one rule - it must be invertible. Number of 1's in each column depends on
number of 1's in each row, which is known as weight.
So, to correctly decode damaged message one needs
to have correct matrix, so I need to think about a good algorithm of
creating it. In the regular (simpler) case, matrix has
the same number of 1's in each row and column. The simplest case
is wrapping sequence of bits, like this matrix with weight 3:
1 1 1 0 0 0 0 0
0 1 1 1 0 0 0 0
0 0 1 1 1 0 0 0
0 0 0 1 1 1 0 0
0 0 0 0 1 1 1 0
0 0 0 0 0 1 1 1
1 0 0 0 0 0 1 1
1 1 0 0 0 0 0 1
And so on, I will check if it works good.
/devel/math/codes :: Link / Comments (0)
Yahoo's Haddop.
Everyone becomes crazy about Hadoop
project, since it looks like it is the only answer for the world on how
to create distributed systems.
No, it is not a panacea, it has very limited usage area - it can only work
with batched data sets, which does not require modifications and/or shrinking.
At all. Underlying filesystem even does not support anything besides streaming write -
no overwrite or erase. Write once - read many, this is main idea behind Hadoop.
But it does its work very good, which is why Yahoo (and Google too) uses it in its
web indexing servers.
/devel/dst :: Link / Comments (0)
Tue, 21 Aug 2007
Monday is over.
My president.
/other :: Link / Comments (0)
Lightning without thunder.
Interesting, was it lightning or Google-maps took shots with a flash?
/other :: Link / Comments (0)
Mon, 20 Aug 2007
Climbing evening.
That was a good training - not excellent, just good.
I completed several old quite simple traces (about 6a) on the vertical
wall to warm up, but fell too many times on the negative slope with the bottom rope.
Maybe it was a result of a hangover, but I did that trace on-sight (although without fixing the rope)
with much smaller number of falls.
Fun part was rope testing I decided to perform - so I climbed several meters high and then jumped
to some higher hold, but did not get it and instead intentionally fell couple of meters down.
Rope held my jumping body without any problem, so I climbed further (although already without much
power in the arms).
Anyway, I tired quite strongly, and that yb itself is a good sign.
Good time.
/life :: Link / Comments (0)
Gallager codes.
Ok, I read number of papers and know kong-fu now - I understood main principles behind LDPC codes.
Main design principle is to send a redundant message over the channel,
which contains a codeword and a checkword(s). The latter is created by multiplying
generator matrix with a source codeword, usually it is very big (thousands of bits),
but matrix is very sparse, so number of operations is not very big.
Generation matrices must obey some rules, main one is that it must be invertible,
otherwise one row can be selected as linear combinations of others, and thus
appropriate information bit will be lost.
Decoding is quite simple - the most powerful algorithm is to use belief propagation -
each check node in generation graph (or generation matrix, which is just a different
representation of the same entity) sends to codeword nodes what they belief a valid bit
with calculated probability, after error probabily that given bit
is zero or one becomes less than requested number (according to Shannon's theorem
code, which allows to reduce error probabily infinitely,
always exist for given rate and communication capacity), codeword calcualtion (decoding) is completed.
There are two mechanisms - hard and soft decision algos, the former is simpler,
but the latter frequently is faster. Let me show an example of the hard decision algorithm
(gotten from "LDPC Codes a brief Tutorial" by Bernhard M.J. Leiner,
although description there is far from being perfect).
Let generation matrix to be this (not very sparse) set:
0 1 0 1 1 0 0 1
1 1 1 0 0 1 0 0
0 0 1 0 0 1 1 1
1 0 0 1 1 0 1 0
And source codeword is:
1 0 0 1 0 1 0 1
Check word, calculated my multiplication of matrix and codeword is:
0 0 0 0
Let's during transmission of the codeword and check word over the channel
codeword was changed to this (secod bit changed):
1 1 0 1 0 1 0 1
Here is a generation graph (originally proposed by Tanner) of the given matrix:

Here starts decoding algorithm, where each codeword node C first sends its bit
to each check node F.
F0 node will receive 1 1 0 1 bits from C1, C3, C4 and C7 accordingly.
F1 node will receive 1 1 0 1 bits
F2 node will receive 0 1 0 1 bits
F3 node will receive 1 1 0 0 bits
Next step is to calculate the answer for each code node.
Received check word is 0 0 0 0 (calculated above), so set of simple equations starts here.
Each check node F gets three out of four received bits and XORing (summing modulo 2, since this example
works in Galois finite field of power of 1 - GF(1)), and sends to the codeword node
a bit it expects to be correct to satisfy received check bit.
Here is an example for first check node:
X0 ^ 1 ^ 0 ^ 1 = 0. X0 = 0
1 ^ X1 ^ 0 ^ 1 = 0. X1 = 0
1 ^ 1 ^ X2 ^ 1 = 0. X2 = 1
1 ^ 1 ^ 0 ^ X3 = 0. X3 = 0
Then we send Xi to Ci code bits. After all check nodes are processed, codeword nodes has following set of bits:
C0: 0 from F1, 1 from F3, 1 from originally received codeword.
C1: 0 from F0, 0 from F1, 1 from originally received codeword.
C2: 1 from F1, 0 from F2, 0 from originally received codeword.
C3: 0 from F0, 1 from F3, 1 from originally received codeword.
C4: 1 from F0, 0 from F3, 0 from originally received codeword.
C5: 0 from F1, 1 from F2, 1 from originally received codeword.
C6: 0 from F2, 0 from F3, 0 from originally received codeword.
C7: 1 from F0, 1 from F2, 1 from originally received codeword.
Then using the voting for each bit (i.e. which bit has more 'votes' in above table
out of three cases), we get a new codeword:
1 0 0 1 0 1 0 1
The same steps then are repeated until cdeword stopped to change. In our case
we get it after the first run.
Spft decision algorithm usses essentially the same logic, but it operates with probabilities
of the bit to be 1 or 0, each probabilities are recalculated in each run,
and after probability is higher than requested value (or error
probability is less than requested value), loop stops.
Real world examples use much bigger codewords (up to several thousands of bits),
but logic is always the same.
So, I've started initial userspace implementation, if stars are in the right
order I will be able to complete it until move to climbing zone, otherwise I hope
it to be ready tomorrow.
Stay tuned.
/devel/math/codes :: Link / Comments (0)
Sun, 19 Aug 2007
Met with friends.
Fedor and Irin invited friends to theirs new appartment, where I met a lot of old friends.
That was really fun time there, I would definitely like to repeat, but that will wait
until they get all things needed for comfortable living.
Development is essentially completed, but there is no furniture and the like.
I tested Fedor's drill. Quite bad :)
/life :: Link / Comments (0)
Fri, 17 Aug 2007
Climbing evening.
That was excellent training - a lot of traces on the negative slope with bottom
rope, several old and new ones, traces with top insurance on the negative slope -
a lot of hard exercises, and that was really good.
Later me and Grange got some beer and
finixhed evening in even better mood. He even managed to drive much better than usual,
which is likely a fuel economy rather than good practice.
Excellent time.
/life :: Link / Comments (0)
Thu, 16 Aug 2007
Laptop software update.
I have a HP nc6000 laptop, which is quite old and with its 256 mb of ram
is not very convenient smetimes, and is a bit heavy to carry it everyday, but I like it
(interesting note, it had 3 years of wrranty (likely it has not ended yet),
but its price dropped three times after I bought it).
Long time ago I installed Debian Woody there and it worked perfectly ok, but then hard drive
broke, and after replace I installed there Debian Etch and got 100% CPU usage when watching a film
on the full screen (which did not work as a full screen, only in maximized window).
Network adapter (tg3) was also broken - one day after deep hibernate it woke up,
was likely so much surprised about changed world and decided to stop working.
It can not be reset, although pci config space is perfectly
valid for the card and registers seems to be ok (i.e. do not contain 0xff or all zeroes),
but NIC just does not work. Also after I changed disk (in official
HP service center on warranty), rfcomm/bluetooth leds started to work incorrectly.
Today I decided to reinstall the system and tried FC7. Right now it works quite good,
except without network it is a bit unconvenient to work with. For example
default yum repositories are in network, without network it fails to resolve
dependencies (i.e. it can say that you need to install a, b, and c to satisfy d's deps,
but can not install them and fails with python stack trace). When pressing power button,
laptop's screen becomes grey and nothing happens, this can only be fixed by powering system
off pressing power button for a longer time. Hibernate failed I first time tried it in FC7
(system was not able to wake up and was frozen after X were recovered).
In Debian it was ok, but sucked all battery power in quite a short of time.
I will eventually test it in FC7, but not now. Network (tg3) does not work too, likely it is
hardware problem. Video can be watched perfectly ok (even big mpeg4), which is good, but otherwise
things are quite bad.
/devel/other :: Link / Comments (3)
Wed, 15 Aug 2007
Road rules in Russia.
Yep.

Via Maxim Kononenko.
/other :: Link / Comments (0)
Back to ...
... userspace for writing redundancy self-correction codes.
First, I consider Gallager codes (LDPC codes) and to study sparse graph theory.
Just becuse I still can not understand some bits of WEAVER codes because of lack
of documentation and very specific notation in existing.
My initial implementation of encoder and decoder will be in userspace.
To recall how hardware works, I will use it (if will not very lazy) with self-made
optoelectronic link made on top of laser diodes, although do not yet
know where to find a GPIO links, but that is already implementation details.
I will use optic line over the air instead of radio transmission schemes as a first
step in my robot implementation.
Stay tuned, codes bits should not take too long, since Gallager codes and sparse
matrices are well described in papers.
/devel/dst :: Link / Comments (4)
Tue, 14 Aug 2007
New release of the distributed storage subsystem.
I've just completed testing of the following complex setup,
which I think is a good milestone for the project.
One system has 10 local devices, each of which is a mirror device on top of two remote systems.
All 10 devices are combined into single linear device, which is exported
as a local target to remote machine, where that device was added into linear device
with another remote system, and resulted device was tested by performing different IO
tasks.
So, roughly it looks like this:

Main feature of the new release is mirroring to any number of devices with automatic
reconnection to remote nodes when it is ready after a crash and resync. Reconnection is a feature
of the distributed storage core and does not depend on algorithm being used. Resync bitmap
can have various size, so with some compilers it might not compile, I'm considering
to move to per-block bitmask in the next release. Because of this feature resync can be inaccurate
(i.e. partial if less than a page was written to the device before resync is completed),
and thus it will be rescheduled.
Another feature for the next release is documentation (excertpts from blog mostly and overall
design, since it looks like (quite big for me amount of) comments in code is not enough).
/devel/dst :: Link / Comments (2)
Strange math in kernel.
During heavy testing of the mirroring code in the distributed storage (200 remote nodes and one local form a mirror,
which is pretty slow just because 200 nodes are separated between two physical devices,
each one has 100 process accessing each own's file in random order, I can not test 1000 nodes
since it will be too slow, try to put 1000 active readers/writes over 1000 5mb files
in your desktop, for 100 processes my LA is about 77) I found following problem.
To select an index in the dirty bitmap code gets a sector and devides it to chunk size,
so I calculate this:
request start: 7 segments
rest of the data: 0 segments
chunk size: 8 segments
index: (7 - 0) / 8 = 2305843009213693951
Which is 0x1fffffffffffffff. It happens quite rarely, and this problem
does not hurt anything, and in future I plan to move to sector-sized chunks, instead of
dynamically sized, so dividing will be eliminated, but right now it sometimes fires
in dmesg.
/devel/other :: Link / Comments (0)
Kevent has been removed from kernel summit agenda.
As long as other event delivery mechanisms, seems things settled
down and everyone is happy.
/devel/kevent :: Link / Comments (0)
Spam.
I got a mail with following header:
POSIX and up able to recognize all three formats, and let older GNU tar fade out slowly.
I also regulary get excerpts from clasical literature (in english and russian).
What next, parts of kernel code?
Although it failed to get through spam filters...
/other :: Link / Comments (0)
Mon, 13 Aug 2007
Beaten to meat.
That is my feelings after the climbing training. Completely to meat.
I completed number of traces on the negative slope with the bottom rope,
this time with small number of falls, and thus tired more than usual.
It looks like I found why I perform bad with the bottom rope - it's a fear. A fear
to fall, so I fix the rope in very unconvenient positions, which sucks a lot
of power, so eventually I fall. Some time ago I did not fear to fall, I even like
to 'fly' now, I will get that back - let's see how it will come.
/life :: Link / Comments (0)
Sun, 12 Aug 2007
More on DRBD.
I've just read its Linux Euroconf
paper.
DRBD is not RADI1 at all, it is a single storage, which
mirrors all write requests to remote node. So until resync is completed,
you are not allowed to read (and write) to given blocks, you have to reconnect
to slave node when master one is dead, two-nodes-only setup (commercial version has a support
for a bit more easy three-node setup).
Contrary distributed storage allows to form a device on top of any number of remote and local nodes,
reading is performed from the 'nearest' node (in terms of the current head position),
writing is always mirrored to all nodes and if one of them has failed, reconnection will happen automatically.
Resynchronization (in testing stage, not yet 100% ready) happens when system reconnects to remote node.
Resync infomation is not stored on devices itsefl, which allows to mount them independently,
but also means that after one node is dead and not yet completely recovered and main node crashes,
nodes will contain different data and it is up to administrator to decide which one is ready.
In future versions I can add possibility to save sync bitmap on the storage itself (this can be simple,
but this removes possibility to mount remote node's data as standalone system).
If main node has failed, the array is destroyed and one needs either to form new one
on the different node or just connect to another node, but if main node has crashed,
application is dead too...
So, DRBD does not look similar to DST, this is a different project with own aims and results.
/devel/dst :: Link / Comments (0)
Die Hard.
All four films were produced by John McTiernan.
John McFuckingClane.
/other :: Link / Comments (0)
Compilers.
This thread about
using volatile (which I consider as a bad hack) in atomic operations
forces me to think about why the hell there is only one compler in the opensource (and
actually the rest too) world?
Thinking...
Not that I'm going to start writing new compiler tomorrow, I basically do not have
enough knowledge (likely, but I did not even try) now, but I'm thinking, why not to fill this gap.
There is number of projects (start from security systems, the simplest case is OpenBSD) which
do want a compiler, which would allow to easily add own security frameworks, and to have a
competitor for benevolent dictator in compiler land. I think the compiler, which can compile
Linux kernel, is a good development milestone.
/devel/other :: Link / Comments (4)
Mercury rising.
Just watched a film - very interesting, why the main bad guy,
NSA colonel btw, has russian (or slavic) surname Kudrov...
Day of old films - of course all "Die Hard"s.
/life :: Link / Comments (0)
Blues and jazz pubs in Moscow.
I'm looking for small place with live blues and/or jazz music in Moscow.
If you know interesting one, please drop a line in comments.
/life :: Link / Comments (2)
Let it be a day of the rest.
I got bottle of red Chile merlot wine "Canto de flora" and decided to devote the whole
day to lazyness - I did not do this quite for a while already (I do not even
recall when I spent the whole day not doing something), and now I feel myself
good.
Have a nice day.
/life :: Link / Comments (2)
NFS directory reading speed.
Here is a graph of ls -l of the NFS exported and directly (ext3).
Gap is huge.

I tried different NFS export flags (sync/async), mounting it with and witout locks - result is essentially the same.
/devel/other :: Link / Comments (6)
Sat, 11 Aug 2007
Initial implementation of the resync code is ready.
It does not yet completely resyncs, but already fetches data
from remote node according to notsynced blocks bitmap. There is a small
issue with recovery - each block fetching completion can happen
in interrupt context (when data is fetched from local node),
so it is impossible to requeue request there, since generic_make_request()
can sleep, so this work must be postponed to worker thread
and processed requests must be added into backlog queue in the ready
completion callback. After this task is completed, I will release
new version of the distributed storage subsystem with mirroring
support between arbitrary number of nodes.
There is a problem with local storage - when block reading fails for local node,
right now there is no way to repeat the same reading from different node,
so upper layer must repeat it itself (error is retured). Reading failures from remote nodes
are handled correctly. Writing is always performed for all (currently accessible) nodes.
I also want to test how it will scale to thousand of nodes in linear
set for example and mirror of each node to another couple to make things harder.
Likely it will happen tomorrow. Stay tuned.
/devel/dst :: Link / Comments (5)
Fri, 10 Aug 2007
Interview with Chris Mason about btrfs.
One can find it here.
Main issues are stage of the project (early alpha, although Chris uses btrfs for mail spool,
but with mail config, which mirrors mails to stable storage), featule list
(no ACL and extended attributes, non-scalable locking, number of other major issues),
timeline (on-disk format stabilization this year, usable and tested by users early in 2008),
Chris wants btrfs to be default Linux filesystem in a near future, he says, that btrfs
is not ZFS competitor, but only other linux filesystems.
Consider to start my own filesystem, wich such timings I can compete.
Thinking...
/devel/fs :: Link / Comments (2)
Thu, 09 Aug 2007
Initial mirroring implementation entered testing stage.
So far it does not support sync on recover, it only puts dirty
bits for missed chunks in the per-node bitmap and continues to work
with other nodes in the mirror.
Code supports any number of devices in the mirror.
Fancy reconnection logic allows to turn off any mirror nodes, and later system will reconnect to them
and start writing data to. Sync will be done at reconnection time.
As a bonus, one can remove mirror from the storage (physically disconnect a wire for example)
and mount it locally. This works in the simplest case without linear mapping on top/under the
given node of course (since otherwise system state is distributed between given
node and some others), such functionality can be used as live backup - just mount it when node is offline,
and all your data is there.
Resync code will select first node, which has given chunk uptodate,
and fetch data from it.
Stay tuned.
/devel/dst :: Link / Comments (0)
Wed, 08 Aug 2007
Climbing evening.
That was a really good training - some old simple traces
for warming, number of traces on negative slope with bottom rope.
Everything was good except one small problem - I was asked
to tread out new climbing shoes, which were about my size
according to label, but about 1.5 santimeters smaller than
my own, so it was a bit painful (read: it was so fscking painful,
that I wanted to scream at the end of the training and wan not
able to move right when was in that shoes), which
made training hard as hell, but I managed to survive.
I'm actually quite surprised, in what conditions human can continue to be
pleased of surrounding conditions, feelings and life in general.
/life :: Link / Comments (0)
Mirroring in distributed storage.
I know, everyone wants it, and I have a work-in-progress patch, which implements some bits of it.
Btw, previously released distributed storage patches leaked with wrong
changeset, so it contains a bug, which does not allow to use local nodes,
i.e. nodes with locally attached devices. My fault, sorry.
Anyway, back to mirroring - this is what can open a door for wide adoption
of the distributed storage,
especially if this will have ability to work with multiple nodes in a mirror.
My initial implementation allows that, but it does not store sync state on devices
itself (do not know if needed at all), so if one of the nodes was broken and then the whole device is removed,
there is no info which nodes contain correct data. There is also no resync if device node
was removed (i.e. really removed from the storage), but it is possible
just to kill remote node and all missed data will be sent to it,
when it will be back after failover recovery reconnection,
but depending on situation this can stop writing to another nodes (if node was disconnected
during sync for example).
Basically only small part of the needed features is implemented, I would even say
that almost nothing, but nevertheless I started (yesterday evening).
Devil on the shoulder says me to start own small business out of my projects. I would
start long ago (probably with netchannels),
but need to work where I work right now. Working on contract with other parties
seems to be a good idea.
Anyway, the main issue I think quite for a while already is how to shut this devil up,
which wakes up each time I create something interesting.
So, the nearest roadmap is to implement and test mirroring, and then shut the devil somhow
(out of curiosity, what is a price to hold a closed joint-stock company per year in Russia).
Stay tuned.
/devel/dst :: Link / Comments (0)
Device throttling.
In really big systems it is possible to havea huge amount
of block IO requests pending, which can lead to deadlock,
if they require additional allocations (which can happen
in distributed storage for example, since it might require
additional network packet allocations), so some kind of device
throttling must be implemented. Google guys use device specific
mechanism (own semaphore) in theirs Zumastor
prject, but I decided to implement it in block layer, so that
any other users could benefit.
My morning patch allows to limit maximum amount of queued bios
per physical device. By default it is turned off and old block layer
behaviour with unlimited number of bios is used. When turned on (queue
limit is set to something different than -1U via blk_set_queue_limit()),
generic_make_request() will sleep until there is room in the queue.
number of bios is increased in generic_make_request() and reduced either
in bio_endio(), when bio is completely processed (bi_size is zero), and
recharged from original queue when new device is assigned to bio via
blk_set_bdev(). All operations are not atomic, since we do not care about
precise number of bios, but a fact, that we are close or close enough to
the limit.
Tested on distributed storage device - with limit of 2 bios it works slow :)
/devel/dst :: Link / Comments (0)
Tue, 07 Aug 2007
I as bloody wrong.
It is impossible to create an input from fully completed
workspace, so my crack is no more than a simple case, which
does not allow to form an input for given digest.
Instead new equation must be included into the set, which
holds the connection between input and workspace bits.
Back to drawing board.
/devel/math/hash :: Link / Comments (0)
Mon, 06 Aug 2007
Climbing evening.
That was quite good training, but since I did not have a rest aquite for a while
already, I quickly tired and was not able to climb as usual, so failed even on
quite simple traces, which would be completed without problems otherwise.
After number of new traces on the vertical wall I recalled some simple old ones
and eventually even completed one trace on the negative slope.
Although I was too tired to do that good, it was a good training.
/life :: Link / Comments (0)
Breaking Enigma code or cracking SHA1 hash for fun.
Do you recall my intention to crack this hash.
Well, it is first 20 rounds of the most widely used cryptographic digest called SHA1.
SHA1 contains 80 rounds.
I found a simple way to form a workspace, which, after being processed, results
in the given hash value, i.e. algorithm takes needed hash value as parameter and
creates something used as input, hash of which is exactly the same as requested.
This is not a complete crack of the reduced SHA1 algorithm yet, since
workspace (80 bytes for the first 20 rounds) must be turned into 64 bytes of input,
but it is not that complex task.
I do not know if this algorithm will work with full sized SHA1 (80 rounds instead of 20),
but right now I do not see any problems with it.
Here is an example workspace data:
Input data:
5e ca 9b e6 38 cf cd 33 41 cf 61 b3 fb cd 39 df 65 87 61 b8 2c 1e 56 ac 69 d7 d0 18 7f 9b 0f a3
9c 13 99 4c c0 08 c2 de 2d ed c2 d5 99 f8 94 57 d7 a1 e2 35 93 73 0c 11 5a 80 5e 80 ff a8 54 fe
digest: 136be2b1 e949ef99 b85caa61 c97e39cc 7c53ccc5
Cracked data:
workspace (substitute W in sha_transform() with this data):
a57d8667 a57d8667 a57d8667 a57d8667 a57d8667 a57d8667 a57d8667 a57d8667
a57d8667 a57d8667 a57d8667 a57d8667 a57d8667 a57d8667 a57d8667 96ccb97c
a1900adc c7d34989 3c218123 b2380816
digest: 136be2b1 e949ef99 b85caa61 c97e39cc 7c53ccc5
So, the last task in breaking reduced SHA1 is to find input 64 bytes,
which after processed by this method:
W[i] = rol32(W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16], 1);
results in found above workspace.
I do not know, if this can be considered as SHA1 crack, my goal was to complete
it upto exactly this point, i.e. break reduced to 20 rounds SHA1 algo.
Although I will ask Bruce Schneier tomorrow if it is.
Stay tuned, but I will go climbing now.
/devel/math/hash :: Link / Comments (0)
High scalability.
I've found a very interesting resource, http://highscalability.com,
where different scalable systems are described in couple of words with
short note on problems they faced and solutions implemented.
Interesting reading.
/devel/other :: Link / Comments (0)
Mnotify.
Al Boldy requested
an introduction of the new notification system used for delivering
meta information changes (like access time).
He says that although it is possible to use inotify for that, it is too expensive
and unconvenient.
Thinking, how long it will take for someone to introduce new generic enough
event delivery mechanism...
/devel/other :: Link / Comments (0)
Sun, 05 Aug 2007
Appartment development.
I've finally completed the room - there is number of small things to finish,
but generally it is done - I finished wallpaper glueing today, which was
likely the most complex part of the whole process (maybe except
hinging a ceiling), since it is not designed to be performed using only
two hands. During this I decided to paint wallpaper and to
draw something interesting on the one of the walls, so just after room is ready,
process automatically jumps to the beginning.
I have an idea what to draw on the wall, but will not disclose until it is ready,
which actually will not happen very soon - I will start bathroom developing
and when there will be strong desire to move to the shop (all I need I bought
half a year ago), I will not process the room, and since there is no colour,
my painting exercises will not start.
I can not draw, completely, so painting something interesting on the one square
meter or even bigger part of the wall will be a challenge, but I do not care - that's
interestng to proceed.
/devel/flat :: Link / Comments (0)
Fri, 03 Aug 2007
Night views.
Or testing new tripod.
More in gallery (big sizes).


/other :: Link / Comments (1)
Thu, 02 Aug 2007
Fixing bug in Linux network stack.
There was an interesting bug posted
to netdev@ today by user with name John (actually the same one was posted
sevaral times already, but this time he included simple application
to trigger it). It was possible that at the end of the connection,
the last TCP segment was sent with wrong port number,
like in example below:
17:50:43.414212 IP 127.0.0.1.50000 > 127.0.0.1.10250: S 1312601602:1312601602(0) win 1500
17:50:43.452081 IP 127.0.0.1.10250 > 127.0.0.1.50000: S 864201221:864201221(0) ack 1312601603 win 32792
17:50:43.414364 IP 127.0.0.1.50000 > 127.0.0.1.10250: . ack 1 win 1500
17:50:43.452649 IP 127.0.0.1.50000 > 127.0.0.1.10250: P 1:17(16) ack 1 win 1500
17:50:43.452666 IP 127.0.0.1.10250 > 127.0.0.1.50000: . ack 17 win 32792
17:50:43.452735 IP 127.0.0.1.50000 > 127.0.0.1.10250: R 1312601619:1312601619(0) win 1500
17:50:43.564760 IP 127.0.0.1.54076 > 127.0.0.1.50000: R 1:1(0) ack 17 win 32792
As you can see, the last RST segment contains wrong port number 54076 instead of 10250.
This does not break anything actually, but is bad as is,
so I decided to spent ths day helping the world instead of hacking the hash.
Number of tricks I used is more than enourmous - that was debug printks all over the place,
padding red zones to catch overflows, delayed operations and even total rename of given variable.
Bug scenario was only 'detectable' when socket is closed, but there is data unread,
so that RST should (according to RFC 2525) be sent, so it is quite rare condition.
Eventually I tracked it down to the fact, that when socket is being closed,
it already contains wrong port field. Work with timers showed, that in such short lived
connection neither of three TCP timers fired, but processing of the last RST in the
connection above shows that port number is still valied there.
Stumbled. how is it ever possible that nothing happend, but something was broken?
In the MIPT, especially in the physical laboratoris, I was continuously told, that
there are no miracles, and I think that is true, so I audited every single usage
of the port field in the inet socket structure (surprisingly not that many cases, maybe several dozens only),
and eventually found that inet_autobind() fucntion can change given field,
after number of debug prints problem was completely localized and fixed by simple patch,
which checks if socket is really alive and thus requires binding. Problem described above
can only happen for semi-alive socket, when it was partially released (namely its port
value is freed) and thus smells bad, but still can be accessed from userspace (socket
reference itself is not released), so that any subsequent sending call could endup
changing port number by binding to the new port. My simple fix checks if socket is partially
alive and if so, it does not allow sending (which is not allowed anyway later in
tcp_sendmsg()) and does not perform autobinding.
That's all, but it sucked about 6 hours. Do not even know if this number is good or bad.
Bug submitter John reports that this bug exists even in 2.4.0 kernel, and it was referred in
web multiple times, but did not force anyone to fix.
Now it is gone. Even if my fix is not correct, I provided enough information, so that real fix
would be simple.
/devel/networking :: Link / Comments (1)
Wed, 01 Aug 2007
Climbing evening.
That was great training - I climbed on vertical wall and tried new traces -
complex and not very - 6b, 6a+, eventually 7a, but rubbed fingers completely -
that one over tiny holds is really interesting, and I think I can complete it with time.
At the end I completed several on-sight traces on the negative slope, but I was already tired enough
and did that without the rest in between, so second one (although not that complex at all, 6a only)
I fell couple of times, but quite sure I would clearly made that trace if started early.
Also made some old interesting traces, number of traverses - the usual stuff.
I tired as hell, but that worth it. Excellent training.
/life :: Link / Comments (0)
A hash.
You might now this one:
#define f1(x,y,z) (z ^ (x & (y ^ z)))
__u32 W[80];
__u32 digest[5];
__u8 in[64];
for (i = 0; i < 20; i++) {
t = f1(b, c, d) + K1 + rol32(a, 5) + e + W[i];
e = d; d = c; c = rol32(b, 30); b = a; a = t;
}
I've started to think about how to solve this problem.
I expect similar results to my one round Jenkins hash analysis
or complete fail.
You know where to find results - stay tuned.
/devel/math/hash :: Link / Comments (0)
|