|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Tue, 31 Jul 2007
Do you know what Xen is?
No, not virtualization...
I've not asked for inclusion of the distributed storage system in the kernel,
if there will be any interest in that, I will try to push it forward, but getting
into account how things went before I seriously doubt about it.
Tomorrow I will either start mirroring implementation, or will hack on hash (if nothing
will be requested about dst, which is more than likely).
/devel/dst :: Link / Comments (0)
Climbing evening.
There were noe good trainings for the last couple of weeks,
so I had quite a lot of power to spent there. Since Grange
slacks somewhere, I climbed alone, so there were only traverses, boulderings and starts.
Most of the training was devoted to two complex starts on the vertical wall with
finishing part after negative horizontal slope, so this sucked my forces noticebly.
I was not so tired quite for a while arelady, and that's really good.
/life :: Link / Comments (0)
Distribution storage announcement.
Yes, it has happend.
Here is a homepage for this project.
I will announce it in main lists in a couple of moments.
First step is completed.
/devel/dst :: Link / Comments (0)
Testing reconnection as a failover recover in distributed storage.
I decided to release first version of the distributed storage
after this testing is complted and ability to suspend a node is ready,
without additional redundancy algorithm being implemented.
If you think any special algo or general efforts in this area should be continued, feel free
to kick me and suggest your ideas.
I will create a web page for the project, put sources and announce this to
linux-kernel@, linux-fsdevel@ and netdev@ soon.
Stay tuned.
/devel/dst :: Link / Comments (0)
Mon, 30 Jul 2007
Distributed storage beta is ready. Testing is completed.
About 30 thousands iterations passed successfully,
each iteration included mkfs and several mount/umount/copy/sync operations.
Testing was performed on x86 and x86_64 machnes.
After paid work tasks are completed I will start implementation of the initial failover
recovery - reconnect and then add ability to suspend a node.
/devel/dst :: Link / Comments (0)
Sun, 29 Jul 2007
Distributed storage beta is ready.
One can translate 'beta' as supporint fully working local
and local exporting nodes, system passed quite heavy tests on both
x86 and x86_64 with local exporting and remote userspace nodes.
The last tricky bit I tried to fix for the last couple of days was in the
way polling and TCP work - it was observed only on small systems with 100 mbit NIC connected to
gigabit lan. TCP can merge several susequent segments into one chunk (less or equal to MSS,
which is usually 1448 bytes iirc), but ->poll() can only say that there is data
or not, tcpdumps showed that data from previous write request and current read request
(1024 and 24 bytes) were combined into one chunk, local exporting state machine detected
new data via polling callbacks, but after data from write request, it did not check if there
is new one. Since polling state machine will not be invoked again until there is new packet,
read request sat in the reading queue forever blocking all operations on the main node (mount
is synchronous).
Right now distributed storage formed on top of three remote nodes (one in-kernel local exporting
node and two userspace targets) tries to survive read/write/mout/umount/sync/mkfs test,
which is organized to fill filesystem on all three nodes. I will leave this test for this night,
so far about 200 iterations passed, let's see how it will feel itself tomorrow.
/devel/dst :: Link / Comments (0)
Wallpaper glueing.
Half of the room is completed. Since I forgot my phone
in office yesterday and do not have any other watches, that was fun day -
you wake up, look to the street, make some tasks being completely lost in time.
This night I ended glueing about 5-00 - when I saw that sunrise is coming,
although I did not know the time (if I did, likely I would stop earlier).
Only when I returned back to office (to take some food moving here, to read mails
and hack), I found that it was already about 19-00.
Wallpaper glueing is not that simple task for one person as I expected before -
actually I do not like how I did that, but it is possible that it is only because
of glue has not stabilized yet.
Here are couple of photos of my ceiling (with and without flash).


A hook.

A bit more in gallery.
/devel/flat :: Link / Comments (0)
CFS vs SD.
Stupid politics and unfairness as usual. And Con not one who suffered.
I do not care about process, but this sentense rises a question:
that was where the SD patches fell down. They didn't have a maintainer that I could trust to actually care about any other issues than his own.
Did I read this that Linus will not (or will with heavy brakes) accept patches
from other people except his own circle of fame?
But that is better than "utterly overhyped" and "misbenchmarked" :)
Fuck them, just do what you like and do it good, that's the matter, everything other
is just stupid.
/devel/other :: Link / Comments (2)
Sat, 28 Jul 2007
Big appartment development day.
And start of the big appartment development weekend.
Working from about 20-00 to 3-00 this night and half of the day
resulted in completed electricity wiring on the hinged ceiling,
finished ceiling paining in the room, checkroom and hall (will cover with another
layer though - some colour left from previous layer painings),
painted radiator tubes. I also removed old heavy anchor bolts,
which hold by hammock, with new ones, which have form of a hook,
so it becomes quite easy to get chains which support hammock on and off.
Hooks also remove bending force from the chain (which might result in
teared chain, which can hold 350 kg in usual direction, but only about 150
in bending).
All above got ability to be completed just because I moved to deveopment
shop yesterday evening and bought all needed equipment. I do plan to complete
room and checkroom this weekend. And probability of that event to happen is very high.
Next task will be bathroom, since washing in my spartan conditions is absolutely unconvenient.
Unfortunately there is no neon cord in Leroy Merlin development shop,
so that my ceiling would be fully completed, will search further, but its setup is
trivial since all electric parts are already completed.
Maybe I will order a vacuum cleaner and boiler tomorrow, the former is needed to put
and floor carpet, since it is very dusty, the latter is essential part
of the bathroom and my water system project.
/devel/flat :: Link / Comments (0)
Fri, 27 Jul 2007
Kernel Summit.
Usenix will give me a bit more money than requested ticket costs,
so we will invest a bit more into british economy...
Know a good pub place in Cambridge?
/devel/other :: Link / Comments (0)
Thu, 26 Jul 2007
Exporting local nodes in the distributed storage has entered testing stage.
Now it is possible to export local storage (block device) to remote system.
So far it was only possible to use userspace application to be a target, but now there is a kernel target too.
It is faster. It allows to build tree structures of the nodes to remove
single points of failure. It allows to simplify configuration. It allows
to simpler implement recovery (like reconnect, which is next step before
first release). This is pretty simple, since the whole infrastructure was already there.
As I said, next task is initial failover support - so far I only plan to add
reconnect on error. Then I will either release first version, or (likely) will add
redundancy algo (simple mirror).
And of course testing... I will start running heavy benchmark (like iozone
or any other from my great filesystem contest
benchmarks) after local node exporting is tested.
And I again did not move to development shop. Well, eventually...
/devel/dst :: Link / Comments (0)
Hard way learning russian.
/other :: Link / Comments (0)
Linux sucks.
Please tell me how is it ever possible in 21 century
on two-core system with 3.4 Ghz each and two gigs of ram
to stop playing music when system runs recursive chown
on a huge dir?:
top - 14:36:57 up 21:39, 10 users, load average: 0.91, 0.32, 0.12
Tasks: 151 total, 2 running, 149 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.7%us, 2.0%sy, 0.0%ni, 34.2%id, 62.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2075052k total, 1581176k used, 493876k free, 645032k buffers
Swap: 1951856k total, 0k used, 1951856k free, 434852k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7403 root 20 0 3988 952 632 R 3 0.0 0:04.04 chown
7366 s0mbre 20 0 27884 3776 2904 S 1 0.2 0:03.74 mplayer
This is default FC7 installation with the most recent kernel 2.6.22.1-27.fc7.
Crap.
/devel/other :: Link / Comments (8)
Quotation of the day.
Chris Mason wrote:
> Definitely, and I'm glad you are. You haven't converted me yet, but
> I look forward to finding the best ideas from our two approaches when
> the patches are further along (ext2 port of fsblock coming along, so
> we'll be able to have races soon :P).
I'm sure we can find some river in Cambridge, winner gets to throw Axboe in.
P.S. Both Chris Mason and Jens Axboe work in Oracle, Jens is block layer maintainer.
Talk is about how to map data to disk (file offset to block number, flags, sub-page blocks)
and to replace buffer_heads with new extent based tree (rb-tree is used, although
that can be slow, since requires locking, multidimensional
trie similar to judy array used in unified socket storage allows to use RCU protection).
Likely Jens Axboe worked closely with of this much-hated buffer_heads :)
/devel/other :: Link / Comments (0)
Wed, 25 Jul 2007
Climbing evening.
That was quite lazy training - Grange
lost somewhere, no known people. I only did several traverses and number of old boulderings,
but most of the time stupidly sitting. Lazy...
/life :: Link / Comments (0)
Met with Abr and Tanya in "Belfast" pub.
More in gallery.
Belfast pub.

Gorbaty bridge late night.

Walking Moscow night...


Celebration guilties - Abr and Tanya.
/life :: Link / Comments (0)
Tue, 24 Jul 2007
Distributed storage. Local node support.
I've committed local node support for distributed storage,
so it is possible to create a storage which is formed out of locally connected
storages and remote ones using linear mapping algorithm now.
Next task is to complete local node exporting support, so that it would be possible
not only to gather remote nodes using this driver, but also
to export local storage to the remote system. Right now it is implemented
in userspace target, which is obviously slow because of an additional
data copy to/from mapped file and due to the fact, that system works with usual files in userspace,
so now one filesystem (created on top of distributed storage) is placed in the another filesystem
(on the target hosts).
With such type of operations it is possible to create trees of storages eliminating single-point failures
by distributing control nodes between data ones.
/devel/dst :: Link / Comments (0)
Distributed storage competitors. DRBD.
I've just read a news on Kerneltrap
about DRBD going upstream. DRBD is a Distributed Replicated Block Device,
which allows to bind two remote storages to form a mirror. So far it only supports two nodes and has number of questionable
own implementations of work queues, threads and locks. It is very big, I do not know why. It has 4gb limitation (u32 usage likely).
There are problems with codying style.
That was problematic part, but drbd has a major advantage - it has a user base. I used drbd with heartbeat several years
ago to create a high-availability cluster at one of the previous works. Although it was not very stable (sometimes
it was not allowed to write to the system under resync, heartbeat did not detect fails and resync itself had problems), but
it worked. So it is possible that it can be imported into the tree.
Unfortunately it was not posted neither to netdev@ nor to linux-fsdevel@, so I can not comment on that (although
I'm not sure my comments would be understood correctly, but not as advertising of my work :).
So, right now I'm keeping silence about drbd and instead will think about how to better work with block requests,
which cross the boundary between nodes, so that as much as possible allocations would be eliminated.
During that I found a perfect way to make a multi-node mirror, btw, but so far I will postpone its implementation.
/devel/dst :: Link / Comments (0)
Block layer requirements for distributed storage.
Unfortunately block layer requires to clone
block IO request each time it is supposed to be splitted - there is no
way to end IO atomically from several pathes. Actually that is theory,
practice shows that it is possible to call bio_endio()
multiple time, since all ->bi_end_io() callbacks I saw
check ->bi_size and only perform actuall IO ending when it is zero,
so if it is possible to atomically decrement ->bi_size,
it is safe not to clone block IO request. That is what is performed in distributed
storage state machine - although it is not perfect (and headachingly complex for me)
and there is a tiny racy window.
Things become worse with local node support - it (surprisingly) works, but race window
increases.
I implemented protection against data congestion,
i.e. a situation when subsequent reading goes before writing is completed. It is done using
tree of in-flight requests, so that each subsequent request can be copied to/from one in the tree,
but debug shows that such situation never happens. This does not degrade perfomance, since number of in-flight
requests is very small and they never cross each-other's boundary.
Another block layer issue is size of the block request. My observations show that it never exceeds
31 pages (during even heavy syncs on my hardware), although maximum number of pages in the bio is 256.
Likely it is not because of block layer itself, but ext3 filesystem issue, I did not investigate that.
/devel/dst :: Link / Comments (0)
Mon, 23 Jul 2007
FC7 installation bug.
I selected packages and started installation, then
pressed "Release Notes" and got this:

As you might expect pressing "Close" does not close window and instead
shows empty white-blue-gray area without letters.
World is far from being perfect...
/devel/other :: Link / Comments (0)
No Debian/Ubuntu on desktop.
After another crash of the terminal during hacking
my infinite patience suddenly ended, so I' burning FC7 DVD, which
will replace Ubuntu on my desktop.
/devel/other :: Link / Comments (2)
Window with drops.
/life :: Link / Comments (0)
Sun, 22 Jul 2007
Walking summer evening...
Met with Abr and Tanya. And Anton.

Fountains.


Heron.

Graffity.


/life :: Link / Comments (0)
Appartments.
I've finally made a small progress in my appartments development process -
I spent today's morning setting up electricity - the whole electricity project seems to
be close to be completed (at least in the room, hall and checkroom). I've instealled
dotty lights, completed theirs wiring, switches, transformator, fixed electricity sockets.
I like my hinged ceiling even more now, it will be completed after I (eventually I will by it, sigh)
paint the last layer of (clouded pearl 2) colour and buy a blue neon cord (I've also
setup its electricity supply today). I do plan to move to development shop
this week, hopefully tuesday, and finally complete the whole room.
I also need to switch to this task as a type of the rest, since after distributed
storage state machine I feel mybrain^Wmyself a bit tired. After hinged ceiling got painted
I will make its photos, hopefully I will get neon cord to that day.
/devel/flat :: Link / Comments (0)
Sat, 21 Jul 2007
Testing is completed. Distributed storage alpha is ready.
No problems appeared during this night testing.
Alpha means that only linear algoritm exists and I have not yet tested
local nodes and local exporting. Starting this.
Beta version will have full support for local nodes and local exporting,
which will eliminate needs for userspace target support (one can still use
it of course).
First release will be based either on beta version or will additionally include
some redundancy algorithm (likely mirroring).
/devel/dst :: Link / Comments (0)
Fri, 20 Jul 2007
Distributed storage. Linear algorithm considered stable.
Preliminary tests I started yesterday
does found a huge number of bugs, so I started to clean things up and ended with
good structure which describes the whole state machine. It is complex, damn complex,
but that is a price for not having any allocations in fast path.
Such mental exercises, like fixing bugs in that monster, require to turn head brain
on instead of usual spinal cord upto headache and force my eyes to jump out of eye-socket.
But now all they are fixed - testing completed more than a hundred runs with readings, writings, mounting, unmounting,
syncs and filesystem creations. It is quite slow because of huge amount of debug prints
all over the code, but it continuously runs and I plan to leave it for testing for the whole night.
As previously - if there will not be any bugs, I will move forward and commit current stage as stable.
Next step is local and local export node testing. Then performance testing. Then new redundancy algorithm.
What I really like in this system is that how simple is to configure the whole storage array:
./dst -n $ST -A $ALG -f /dev/dst -a kano -p 1025
./dst -n $ST -A $ALG -f /dev/dst -a 192.168.4.78 -p 1025
./dst -n $ST -A $ALG -f /dev/dst -a via -p 1025 -R
Algorithm automatically requests remote configurations and manages node's info to form an array.
No need for any table, to put size informations and preserve the sizes and offsets.
In dst configuration order is significant though,
it is also possible to setup system without autoconfiguration by providing sizes and offsets via command line.
I will put into TODO list a feature which would allow to store node's information
in the attached data itself, which would not then require to save an order, but not now.
Getting into account how fast it was to reproduce previous bugs and how long it works already I can confirm,
that initial support for linear algorithm with remote targets without additional redundancy and failover as ready.
But let's wait for for tomorrow (about thousand or two of testing cycles with different operation modes) until
testing is completed.
/devel/dst :: Link / Comments (0)
Thu, 19 Jul 2007
Distributed storage progress.
So far I completed networking processing rewrite, so that there are
zero additional allocations during fast path (comapre to two in device
mapper in the best case), although code itself is a bit ugly in places, so I will clean this up
eventually. There is one possible allocation if fast path would end up sleeping
in sending of receiving function, in that case new request is allocated from memory
pool and queued to be processed by dedicated worker thread later (when socket
is awakened by bottom half or sending route), this queue is protected via RCU.
I have not yet tested local device node, i.e. when storage contains local disk as one of its
nodes, there is also not yet tested local export mode - this allows to export local
disk to remote peers, right now I use userspace export daemon, which works with local files.
Hunting for tricky bug took most of the last couple of days - bug happend when
huge block IO requests ended up covering multiple nodes. In device mapper it is fixed simply
by allocating additional block IO request, but since I decided to remote additional
block IO allocations completely, I needed to make dst processing engine flexible enough
to allow such cases. The most tricky part was situation, when single bio_vec,
i.e. a page, happend to be on the boundary so that parts of the page are on different nodes.
Simply playing with size and offset of the page is not alowed, since XFS requres that fields
to be untouched by the block layer (this is the only such restricting user though). Local node
actually requires to allocate new BIO here, so in local hot path there might be aditional allocation,
but there is no postponed requests though - I just queue new bio at the end of the queue
for the specified device by calling generic_make_request().
I've started massive killing testing of the system - the whole night it will mount/umount, read/write,
sync and perform filesystem rebuilding, if it will not crash or freeze and log of this crap
will not contain broken data, I will consider remote transfer mode as ready. When it is,
local node support and exporting testing will not take too much time (although I already said something similar
and it took me about a week to move forward, but everything is good which ends up being good).
Because of this bug I did not move to development shop and thus did not complete ceiling paining,
and thus still do not have wallpapers on the walls and cover on the floor,
and the whole appartments developing process
is in a hung state - every day I fall asleep in my hammock
looking to tons of garbage, concrete and instruments around. But I will fix that too, it just requires a bit
more time than I expected (actually it requires exactly those several days I expected, but only in case
I do work in this problem, but so far I usually do something more interesting. I can pay for completing my loft,
but that is not interesting variant)...
Stay tuned.
/devel/dst :: Link / Comments (0)
Wed, 18 Jul 2007
Climbing evening.
That was great one - its main advantage is that I see the progress
climbing on the begative slope. It is slow and right now I still
fall too many times even on simple traces (like 6a), which I do
on-sight on the vertical wall. But number of falls and overall tiredness
is smaller each time - maybe that is because I climb the same traces, but
I try to change them with short time.
Anyway this training was good - I managed to complete rtaces with 1-3 falls,
which is a good sign, although that traces are not that complex (6a-6b).
Excellent time!
/life :: Link / Comments (0)
Block layer issues.
I wrote previously
that it is impossible for the same block queue request processing function
to be called simultaneously, but it is not true, at least I see in distributed storage,
that it is perfectly invalid assumption - my request function is called
simultaneously on single-cpu system (with enabled preemption though).
This is a big surprise, which is contrary to what is written in the comment
for generic_make_request():
* We only want one ->make_request_fn to be active at a time,
* else stack usage with stacked devices could be a problem.
But having a debug print at the begining and end of the request function
I managed to get two subsequent starts without end in-between, so in my setup
with three remote nodes in distributed storage its request function does run
simultaneously for at least two requests. This requires to make some steps
to clean this situation, I will think more on how to fix this without serious lock
contention in the hotpath.
/devel/dst :: Link / Comments (2)
Tue, 17 Jul 2007
A sign.
I've just loaded a module, it crashed and to check the reason
I ran 'dmesg' to see debug output.
And I've seen! I've just fscking found a reason for all damn problems in the world.
That was just like a god's sign.
This opened my eyes and freed my mind. I know the reason now.
I've seen this:
[479591.764465] All bugs added by David S. Miller <davem@redhat.com>
Couple of days ago I discussed how device mapper depends on network. Right now
everything depends on network, even if not directly but via
configuration utilities or some other code, which depends on network
and is required for given application. And networking maintainer is responsible
for all your bugs since likely the day first.
It is even more than the Matrix.
/devel/other :: Link / Comments (0)
Mon, 16 Jul 2007
Climbing evening.
That was a good training - after several simple traces
on the vertical wall I started to climb on the negative slope
with bottom rope. That was quite hard and as usual there were
a lot of falls, but there were also a positive moments besides
getting an experience - I managed to complete several very interesting
parts on the quite complex traces, although that requires to hung
quite for a while, but result justifies resources.
Several times I almost fell in very complex condition, so I prefered
to get the whole power I had (usually very little) and hung on the hold until
I got rope or better hold, so right now quite a few muscles are aching, which is
good.
Actually I fear to fall, but some time ago I did it too, but fell
without problems, right now I prefer to hung intentionally instead of moving up
and fall after several holds. I think I should change that...
/life :: Link / Comments (0)
Data congestion in distributed storage.
I gave this name to the condition when different operations
are queued for the same data in the storage - for example
when reading follows writing. In that case if writing has not yet
completed there is no need to send reading request to the remote node,
but instead just copy data from writing request to reding one.
This is quite rare condition for usual system though, since growing queue
is a sign that system does not handle the load, but it is possible in bursts,
so having such optimization is a win definitely.
This requires to process IO requests on per-page basis instead of per block IO,
so after such changes are done, system will have ability to easier reorder requests,
for example to send write request while remote system is waiting for reading completion.
It also allows not to clone block IO request for each node which are covered by given BIO.,
although BIO must be cloned for local targets (actually I will think about possibility to
drop even that cloning by manipulating bi_idx and bi_vcnt flags).
All that is true for linear block mapping, for example mirroring (or any other redundancy setup)
requires more sophisticated block operations, so actual algorithm remapping function gets block IO requests
combined as single BIO, and then will (or will not) split them to pages.
/devel/dst :: Link / Comments (0)
National question.
/other :: Link / Comments (1)
Sun, 15 Jul 2007
Distributed storage suspend mode or live data migration.
I've just thought what feature I should add into this -
suspend mode or live migration.
Let's say you want to change remote node - either temporary suspend
all IO for given node (for example to change a local disk)
or replace completely one node with another (for example
switch to different remote machine), so that until
data migration from one node to another or during disk replacement
all block requests, which would be completed on given node,
will be frozen until node is ready. Requests to the different nodes should
continue without stops.
Actually I would be surprised if such functionality does not exist in existing block
layer hotplug, but I do not even know how to test if it is there or not -
documentation sucks, there is no feature list (at least I do not know about it),
so I will reinvent the wheel (again).
There is something like queue plug and unplug, but as usual - specs suck. I will
check LWN kernel line, I recall
Jonathan Corbet wrote about it, but even if it does exist, it can not help
in distributed storage, since it is a single device for the block layer
and thus has only one queue, but stopping all IO requests because of one node
is not politically correct I think. Such decision should be made by algorithm of course,
since redundancy might require several nodes to be updated for single block IO,
or even more - to write to some another node algorithm must read some data from suspended node,
so this is the only place which knows about what IO must be frozen.
I'm thinking about should I release alpha version right now (modulo testing I
need to perform for local mode) or implement some other tasty things and show distributed
storage only after that... Pros and cons?
/devel/dst :: Link / Comments (0)
Distributed storage system.
I've added sysfs support, so device tree looks like this
(a storage named 'storage' created with two remote nodes):
/sys/devices/storage/
/sys/devices/storage/alg : alg_linear
/sys/devices/storage/n-800/type : R: 192.168.4.80:1025
/sys/devices/storage/n-800/size : 800
/sys/devices/storage/n-800/start : 800
/sys/devices/storage/n-0/type : R: 192.168.4.81:1025
/sys/devices/storage/n-0/size : 800
/sys/devices/storage/n-0/start : 0
/sys/devices/storage/remove_all_nodes
/sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
/sys/devices/storage/name : storage
As you can see, there are two nodes in linear algorithm,
first one start at 0 sector and has 800 sectors size,
second one starts at 0 sector and has 800 sectors size too.
Implemented initial failover mechanism - if there is recoverable error
(i.e. not -ENOMEM), then appropriate algorithm's callback
is invoked. Right now it does not perform any action,
but can for example reconnect to remote node and resend a block request.
To implement this I need to refactor code a bit.
Extended userspace support. To setup above array one just needs to run following comamnds:
# ./dst -n storage -A alg_linear -f /dev/dst -a kano -p 1025
# ./dst -n storage -A alg_linear -f /dev/dst -a via -p 1025 -R
To remove an array:
# ./dst -n storage -A alg_linear -f /dev/dst -D
Here is small help for userspace options:
Usage: ./dst -n storage_name -A algorithm -b backlog -f device_path
-s start -S size -d local_disk -a addr -p port -r <remove>
-R <start array> -D <del array> -h <help>
So, to be ready for the alpha release I need to test local export (so far I only tested
userspace remote peer, which works on top of usual file (can be a device file though)) and
local (local block devices) targets.
Also watched three parts of "Lethal Weapon" film to help brain not to explode
or flow out of my ears - that's an excellent time.
Stay tuned.
/devel/dst :: Link / Comments (0)
Interesting note about device mapper.
It never performs allocation returned value check.
Since for hotpath it uses either memory pool or biosets
(which is in turn memory pool too), and allocation
happens always in process context (where sleeping is allowed),
neither of them can fail, since
memory pool api internally spins forever (if sleeping is allowed
in the context) until requested data block can be obtained.
/devel/other :: Link / Comments (0)
Sat, 14 Jul 2007
More iSCSI wierdness and DST notes.
static void __iscsi_get_ctask(struct iscsi_cmd_task *ctask)
{
atomic_inc(&ctask->refcount);
}
static void iscsi_get_ctask(struct iscsi_cmd_task *ctask)
{
spin_lock_bh(&ctask->conn->session->lock);
__iscsi_get_ctask(ctask);
spin_unlock_bh(&ctask->conn->session->lock);
}
static void __iscsi_put_ctask(struct iscsi_cmd_task *ctask)
{
if (atomic_dec_and_test(&ctask->refcount))
iscsi_complete_command(ctask);
}
static void iscsi_put_ctask(struct iscsi_cmd_task *ctask)
{
spin_lock_bh(&ctask->conn->session->lock);
__iscsi_put_ctask(ctask);
spin_unlock_bh(&ctask->conn->session->lock);
}
I tried to avoid locks and alocations as much as possible in distributed storage,
but there are at least two locks in DST: one is used to put ready socket into queue
to be processed by dedicated thread in process context and another one to protect
queue of request for each given socket. There is also one possible additional allocation
in the IO request processing path - if it is impossible to complete request when it arrived
(it happens in process context on behalf of generic_make_request()) new request
is allocated from memory pool, which holds given block IO request and additional fields to
form processing state machine, to hold sending/receiving offsets and other fields,
then new request is queued to the tail of the request queue for given state, which holds
a socket. Actually there is always only one consumer (processing thread) and one producer
(a node which holds state which contains socket), so it would be possible to remove additional
locking, but producer is only one since block layer does not allow to run several
generic_make_request() processing functions simultaneously, which can be changed in future
(or not), so I will leave it as is for now.
Actually looking at DST it becomes very similar to device mapper on top
of network block device. Yes, network block device has some problems,
but there is whole system in device mapper which allows to use existing
redundancy algorithms (like different RAID arrays), so it looks like I just
reinvent the wheel with my project (you can not even imagine how frequently
I was told about the same with kevent
and epoll), so I will implement a simple device mapper target too.
Although it will be a bit limited, but it is noticebly simpler that existing system.
It does not mean I will drop what I created, there will be just two operation modes.
/devel/dst :: Link / Comments (0)
Fri, 13 Jul 2007
Friday 13. But it works.
I'm talking about distributed storage.
Initial implementation of the distributed storage with linear
(remote) device mapping (details below) works in Linux now.
Linear mapping is no more than trivial concatenation of the several (remote)
nodes into single local one. There is no redundancy or failover or whatever,
there is only proof-of-concept code, which allows to form a local device,
which is created on top of several (remote) nodes. I created a single ext3 filesystem
on top of it (part of the filesystem is on one device, part on another, but filesystem
does not know about it, it also does not know about any special operations needed
to be performed to work - like those ones needed for NFS to operate correctly).
It is quite trivial, but works ok.
Actually this is no more than a usual device mapper (due to limitations of its config
I managed to create third system to combine devices into one (device mapper,
md (multiple device) and now distributed storage - I will definitely create
a device mapper module to form remote nodes not looking at its table file limitations,
since is it usually simpler for users to work with existing system)), but with ability
to combine local and remote nodes.
There are unsolved problems of course:
- there is no block IO split, i.e. if single block request crosses the boundary
between devices, it will not be splitted into the two, but completed with errors.
This definitely must be fixed, even though wast majority of the requests is page-size only.
- quite bad userspace support - configuration utility is ugly (it has the same ugly char device
with ioctl commands instead of netlink).
- local export was not tested - there are only userspace remote nodes, which work
on top of usual file.
But nevertheless this is big step forward.
There are things to think about:
- Synchronization. Currently each block request is handled in order it arrived - i.e.
no new requests will be sent to the remote device until current one is completed. There are
number of pros and cons for this idea and likely things should be changed to allow per-page tag
tp show to which request it belongs to (or better just put a global offset pf the request) - this is
similar to how iSCSI works, where each request has its tag, so that there is possibility to send
several of them. This is likely a good idea, but needs furhter investigation.
Actually, talking about iSCSI - I doubt it will work good (at least on small systems),
if main sending function works on top of non-blocking socket with following loop
without any polling and sleeping:
static void iscsi_xmitworker(struct work_struct *work)
{
struct iscsi_conn *conn = container_of(work, struct iscsi_conn, xmitwork);
int rc;
/*
* serialize Xmit worker on a per-connection basis.
*/
mutex_lock(&conn->xmitmutex);
do {
rc = iscsi_data_xmit(conn);
} while (rc >= 0 || rc == -EAGAIN);
mutex_unlock(&conn->xmitmutex);
}
And I do not talk about atomic-only allocations in iSCSI stack.
I do not say distributed storage is a perfect solution,
but I designed it with quite a few issues in mind instead of following
theoretical-only design (which is what several people proposed to do
before actually starting to code and find all problematic places)...
- Failure mode. Currently if one node is broken, the whole device stops working - this must be handled
by per-algorithm decision (i.e. redundancy algo will just update required nodes instead of failing).
But existing linear algorithm requires that property, so it is not that bad for now.
Another issue with failure mode is recovery - if some node was repaced, it requires some information to be written into it
during recovery phase, doing that via main node (like eall existing system - they syncs
via main device) is not optimal in distributed scenario, better would be to say
some remote node to send some info into new node directly instead of via main node,
but that depends on the algorithm being used, so it should be postponed until
some interesting redundancy mechanism is developed (like WEAVER codes).
But overall, first distribution storage has been formed on top of several remote nodes
(with trivial autoconfig - remote device sizes are requested by distributed storage core
and that data was used acordingly to form the device).
Briefly saying, this is although quite small, but definitely a success.
/devel/dst :: Link / Comments (0)
Thu, 12 Jul 2007
Ok, back to distributed storage.
I sort of completed uninteresting tasks and finally back to the
good shape, so expect interesting bits soon.
So far, there is a challenge to perform as small as possible additional
allocations per IO request.
For example there is at least one additional allocaiton of struct bio
each time new block is going to be written/read to/from disk (plus struct request,
but I'm not sure - last time I looked into block layer code too long ago
to remember all bits). Device mapper adds another two - cloned BIO and own request.
Then network will add another (at least) one.
First and the last ones are unavoidable (actually they can be removed, at least
skb allocation I can workaround, but that will have questionable error-prone
consequences, so no need for such hack for now), but allocation in-between,
i.e. in the control layer in the distributed storage must be reduced to the
very minimum. Initial state machine, I released, works with one request per time,
since each state does not have request queue.
Network block device contrary does not have any additional allocations, but it is
purely synchronous, so it does not need to keep track of sent requests.
So, problem states as following: node (or state) needs to have a queue/tree/whatever of partially
processed requests. Each requests should have a pointer to the original block IO
request. Each state (or node) should have a callback, which will be invoked by
core of the state machine, when input or output processing can happen,
so that callback will get events and process them in order.
That is essentially what I'm developing right now.
Stay tuned.
/devel/dst :: Link / Comments (0)
Wed, 11 Jul 2007
Climbing evening.
That was really hard one. First, because I had a hangover. Not that heavy,
but bottle of beer I took yesterday was quite bad, so my stomach got
full rainbow of crappy feelings during the day. At the evening it was much better,
but after several warming traverses and simple traces things became worse.
Eventually we started to climb with bottom rope, so that flushed my power
completely. Although I felt every third hold, I did try to complete all traces,
which were started. Eventually, when I climbed the same trace I previously lead,
but this time without holding rope, that was especially interesting. And although
I did not have already any power, but it was much better than climbing the same trace with bottom
rope.
Actually that was only third training with bottom rope in particular and on
negative slope in general for the last half of the year, so I think things come
quite good. Give me some time, and it will be about the same as climbing on the vertical
walls.
Excellent time.
/life :: Link / Comments (2)
English words.
If you know me, then likely you also noticed that
frequently I can form very interesting many-storied undecent
sentencies.
But I'm completely lost in that area in english, so phrases like
it's driving me absolutely bananas
from native english speaking person really blows my mind.
/other :: Link / Comments (3)
Tue, 10 Jul 2007
Kernel ->poll() based network state machine released.
I released initial kernel state machine used
in 'echo'
in-kernel "server" for education purposes by readers request.
I do not know if it can be useful as is, but for studying likely it is good code,
but getting into account that there is no single comment in the code,
I'm not that sure.
Anyway, interested reader can find kernel ->poll() based network state machine
in archive.
/devel/dst :: Link / Comments (4)
Mon, 09 Jul 2007
Climbing evening.
That was interesting and heavy training. After couple of initial
traverses and some simple old traces me and Grange
started bottom rope climbing on the negative slope.
Eventually three traces (quite simple from 5c+ to 6a+) were completed,
if that word is appropriate in this context - I fell several times
even on the simplest trace, and although I was suspended not that long,
just to relax a bit, I think trace was not completed. Although I generally quite
bad on the negative slope, this was a good training, since showed problematic cases
and forces to make results better with (I think short) time.
/life :: Link / Comments (0)
Sun, 08 Jul 2007
High-performance in-kernel 'echo' server.
Also known as distributed storage network state machine is ready after several hours
this night with bottle of Martini and today's time.
It was not integrated into storage module yet and exists by its own,
but that was specially crafted to allow easier testing. This
standalone module allows to create listening socket, bind it to
specified (hardcoded though, since that data is provided by userspace
control in distributed storage module) address, listen for connections,
accept new clients and read data from them (I did not test writing, since
it is exactly the same).
All operations are non-blocking and are handled via special worker thread(s),
which also dispatch events via ->poll() callbacks and form
simple state machine.
So, essentially network backend
for distributed storage is ready.
Next task is to integrade this state machine into distributed storage module.
I have about an hour right now, so I will add support for local node in the storage.
Then trivial distribution algorithm should be created (actually it already
exists, but it only sends data to single node), which will send/receive data
to several nodes in round-robin or mirror format.
That will be a first release milestone. First release will allow
to form simple distributed storage without tricky algorithms being used.
Its main goal is testing and searching for potential narrow places in the design and implementation,
which will be handled further.
Stay tuned.
/devel/dst :: Link / Comments (0)
Sat, 07 Jul 2007
More entries and comments!
I've just added 'more entries' plugin and thus you can find
'next X entries' and 'prev X entries' at the bottom of my blog page,
so you can list pages and read what was written earlier without digging into archive.
You can also comment in the blog now - I created small captcha to prevent
bots from putting crap into my blog, which will require you to recall simple math :)
Hacking feedback plugin to include this
captcha was fun, especially getting into account that my perl knowledge is somewhere
between zero and void.
Enjoy.
/devel/other :: Link / Comments (11)
Fri, 06 Jul 2007
Climbing evening.
It was general endurance trainig today - I managed to complete 10 rounds
of 10 sets of 7 different exercises, which is a personal record.
Actually I do not plan to increase the level - it is more than enough, takes
about two hours and fully exhaust to the end. Main goal is not to increase the power,
but to be able to keep the same level during long time (which is especially
hard for me on the negative slope traces).
/life :: Link / Comments (1)
Sep 4 - first day of the Kernel Summit (agenda).
It is planned to hold FS/VM meeting that day,
so if there will be some interest in areas clinked to my development, I can discuss
set of filesystem related ideas.
/devel/fs :: Link / Comments (0)
Distributed storage's state machine.
It is a heart of the distributed processing engine - it links
together netowrk states and core block IO thus allowing to process
several requests on behalf of one stream without blocking until
first one is competed.
Since there is no in-kernel event dispatching mechanism, I need to bind
it to ->poll() callback just like usual sys_poll()/sys_epoll()
does. There will be one control kernel thread per storage (actually it could
be per-NIC thread, so that each thread would be bound to specified NIC
and only process dataflow related to it, but I think it will complicate
code withouth any noticeble gain in my setup, so for the initial implementation
there will be only one thread).
So far main tasks to be handled via this state machine are:
- local storage being exported to remote nodes - requires listening
socket processing, new clients accepting
- non-blocking sending and receiving
- non-blocking connecting for main node
I plan to first implement it as simple in-kenel dispatching mechanism
unrelated to distributed storage (even in separate module) - testing of this
system is quite simple. After it is completed, initial implementation of the
distributed storage system without redundancy features will be completed
in few moments. So far this is a plan for coming weekend, and now I'm running
out to climbing training.
/devel/dst :: Link / Comments (0)
Small man's pleasures.

Now I have two - old Mach3 and this Braun.
/life :: Link / Comments (0)
Night...
Middle of the night, good weather, hammock, couple of liters of beer, Robert Heinlein's novels...
That is how I live.
/life :: Link / Comments (0)
Thu, 05 Jul 2007
UK september vacations.
I will be there from Sep 4 to 9, three days in Cabridge and two in London
no matter if Usenix will or will not provide travel assistance, my tickets
just arrived. I plan to meet with Meph, Ira and Abr and Tanya there and absolutely
sure it will be fun. Although I was told that it is not enough (and that
is obviously true) for couple of days to see the capital of the (lost) Empire,
I do not want to make our meeting into being boring aftr several days.
Have fun.
/devel/other :: Link / Comments (0)
The 2014 Olympic Games will be held in Sochi, Russia!
/other :: Link / Comments (0)
Wed, 04 Jul 2007
Climbing evening.
That was hard training - really lots of traces starting
from coupled simple down to new really complex ones.
I also tried two traces on the negative slope - one 6a+,
which I failed on-sight, but just because of lack of knowledge
about the trace (poorly selected holds, did not see some of them and so on),
so I think I will complete is next time without any problems,
and one quite complex 6c, which I managed to finish,
but starting after about the last thirds I fell after two-three holds,
since I was tired completely. Eventually I finished the trace, and that
was really goot one, I will try it next time definitely.
The last trace was with bottom rope on the negative slope,
but since I was too tired, I was not able to complete even half of the trace
(6a+/6b) without fall, and eventually that became stupid set of falls
after one-two holds, sometimes I felt several meters on the rope,
but I was asked to put the rope into the trace, so I needed to complete
it even without any power.
Hard and quite exhaustive, but really good training.
/life :: Link / Comments (0)
I've gootten UK visa.
Which means I will go to Linux Kernel Summit this year in Cambridge.
I will buy a liter of russian vodka and bring it there, people who will want to discuss
things with me will become hostages of russian amicability.
You have been warned.
/devel/other :: Link / Comments (0)
Tue, 03 Jul 2007
Network related VM deadlock prevention.
There is a funamental issue when doing VM operations over network
attached storage/device - each operation requires at least one additional
allocation (for the network protocol headers) right now, and frequently
(in case of guaranteed delivery like in TCP) to receive an acknowledge
from remote peer that data was either written or data itself, which is another
allocation. So, if ssytem is out of memory and wants to swap a page
over the net, it can deadlock trying to allocate a space to send page
or receive an ack.
Peter Zijlstra (and initially Daniel Phillips) proposed
several times an approach to fix that issue - they decided that the best
way is to create a reserve pool when system is under initial pressure,
and then only provide data for sockets initially marked as 'special',
so that it would be possible to make small progress which is likely would
be enough in the most deadlock cases.
I was slowly opposed against this idea, since in the given implementation
it is possible to fail to allocate reserve, there is no fair way to
mark sockets as 'special' - only couple of them were setup in kernel,
and if it would be exported to userspace, everyone could put own sockets
into reservable and thus effectively block the whole idea of providing
reserve only for real needs of deadlock avoidance.
Instead I proposed network allocator,
which was specially designed to be exlusively used by network users.
It grabs number of pages from the main memory and use it for skb allocations,
thus effectively not depending on main memory conditions. Such separation
is the way to go in perfect world, but in real life there are problems too
(and one of them is the idea of separation main system allocations and networking
ones, which rised objections from people), although network allocator has set
of features especially useful in network environment, right now I want to talk
not about it, but about deadlock avoidance.
Distributed storage
is such a device, which can suffer (actually as any other) from described above situation,
so I need to think about how to solve it without too invasive changes in the rest of the kernel.
The best thing I think is to get ideas both from network allocator and Peter Zijlstra's idas - I plan
to create a patch, which would allow to bind a independent reserve for any socket -
such a reserve can be stolen from socket buffer itself (each socket has a limited socket
buffer where packets are allocated from, it accounts both data and control (skb) lengths),
so when main allocation via common path fails, it would be possible to get data
from own reserve. This allows sending sockets to make a progress in case of deadlock.
For receiving situation is worse, since system does not know in advance to which socket
given packet will belong to, so it must allocate from global pool (and thus there must be
independent global reserve), and then exchange part of the socket's reserve to the
global one (or just copy packet to the new one, allocated from socket's reseve is it was setup,
or drop it otherwise). Global independent reserve is what I proposed when stopped to advertise
network allocator, but it seems that it was not taken into account, and reserve was always allocated
only when system has serious memory pressure.
Why does this idea better (from my point of view) than first two?
First, because it is not that invasive like network allocator.
Second, it allows to separate sockets and effectively make them fair - system administrator
or programmer can limit socket's buffer a bit and request a reserve for special
communication channels, which will have guaranteed ability to have both sending and receiving progress,
no matter how many of them were setup.
Third, it does not require any changes behind network.
/devel/networking :: Link / Comments (0)
Distributed storage status.
Day is over, so a little information about status
of the storage development will not hurt.
Couple of bits only - completed userspace and fixed number of bugs,
so essentially simple tasks are over.
After setup I found very interesting graph of readahead requests:

or maybe it is not readahead, but some other block layer feature, since it was
first requests received just after block device was created (actually after generic disk was created).
Anyway, next step is definitely a polling state machine.
Paid work requires quite a lot of my attention, so I'm not sure I will complete something
interesting tomorrow, but nevertheless, stay tuned.
/devel/dst :: Link / Comments (0)
Fucking unbelievable.
I needed to make a little hack perform unauthorized steps
with pages on the russian UK Visa center to make it working
and finally say that my visa has been processed and I can
collect passport with or without visa back. They have so much
broken script, so it completely did not work with Firefox,
so I downloaded and fixed it a bit
(with zero knowledge of javascript) and then verified
on collegue's windows machine visa status.
Information paper they gave me contains wrong Moscow
telephone number (with non-existing prefix).
If UK embassy can not build the site to inform about visa status and print
correct own telephone number, what else should I expect in future? :)
Actually tomorrow morning I will move to visa center and get my passport back,
if things will be good, I can even participate in
glider flying
after kernel summit.
/devel/other :: Link / Comments (0)
Hmm, I was a bit optimistic yesterday about devel time.
I moved to climbing around 6 P.M., but finished to work about 15-00 only...
So, not that many things for quite short period of time.
Let's see, what I will complete today.
/devel/dst :: Link / Comments (0)
Mon, 02 Jul 2007
Climbing evening.
That was damn good training - warming traverses, set of simple
old and new traces without rest, set of simple traces on the negative slope walls
and eventually trace with the bottom rope (or 'to lead'). I managed
to finish trace with the bottom rope just with couple of falls,
which was very surprising for me. I think I will continue to
work on the negative slope with bottom rope during next trainings.
/life :: Link / Comments (0)
Moscow, around 8:00 P.M.
Task has not been completed.
Things to do:
- polling state machine (complex)
- async client accepting (part of the above)
- receiving (part of the above) (complex)
- userspace code (simple)
- local disk target (simple)
- testing (infinite?)
Implemented tasks:
- moved away from device mapper to raw block device (simple)
- block layer - disk and request queue allocation, block device initialization (simple)
- configuration - initial autoconfiguration network protocol (trivial - one structure)
- networking - sending/receiving/listening per node part (simple)
- userspace configuration via ioctl (simple and a bit boring - tried to find
perfect structures, ended with usual crap)
- increased code size from several to 20 kbytes (not sure if it is a good sign,
but size is already about the same as network block device, which is much simpler)
Not that many bits, but I only worked until the dinner - it was too tasty and
I had a bit spartan eating this weekend, so I can not resist to have a bit of rest
after taking a food...
Stay tuned, if there will not be any urgent tasks at paid work, initial implementation
will be completed very soon.
Likely tomorrow I will write a small draft of the networking communication protocol,
which will be used in the distributed storage. It is simple, but should include all
possible cases.
And right now I move to climbing zone.
/devel/dst :: Link / Comments (0)
Moscow, around 8:00 A.M.
I'm in office (no, I did not sleep here, sometimes
I just wake up early - today about 5:30), looking
in my two monitors trying to setup a plan for the day.
Less than in 10 hours I will move to climbing zone, until then
I plan no less than to create first version of the storage,
which is supposed to do not less than to allow to connect several
remote storages and form single one on the local node (for the
initial implementation it will be enough to have round-robin
writing algo without redundancy).
Time has started...
/devel/dst :: Link / Comments (0)
Morning.

Origin.
/other :: Link / Comments (0)
Sun, 01 Jul 2007
Moscow lights.

Origin.
/other :: Link / Comments (0)
|