Zbr's days.
July
Sun Mon Tue Wed Thu Fri Sat
       
2007
Months
Jul

About TODO Blog RSS Old blog Projects Gallery Notes

Tue, 31 Jul 2007

Do you know what Xen is?

No, not virtualization...

I've not asked for inclusion of the distributed storage system in the kernel, if there will be any interest in that, I will try to push it forward, but getting into account how things went before I seriously doubt about it.
Tomorrow I will either start mirroring implementation, or will hack on hash (if nothing will be requested about dst, which is more than likely).

/devel/dst :: Link / Comments (0)


Climbing evening.

There were noe good trainings for the last couple of weeks, so I had quite a lot of power to spent there. Since Grange slacks somewhere, I climbed alone, so there were only traverses, boulderings and starts. Most of the training was devoted to two complex starts on the vertical wall with finishing part after negative horizontal slope, so this sucked my forces noticebly. I was not so tired quite for a while arelady, and that's really good.

/life :: Link / Comments (0)


Distribution storage announcement.

Yes, it has happend.
Here is a homepage for this project.

I will announce it in main lists in a couple of moments.
First step is completed.

/devel/dst :: Link / Comments (0)


Testing reconnection as a failover recover in distributed storage.

I decided to release first version of the distributed storage after this testing is complted and ability to suspend a node is ready, without additional redundancy algorithm being implemented.
If you think any special algo or general efforts in this area should be continued, feel free to kick me and suggest your ideas.

I will create a web page for the project, put sources and announce this to linux-kernel@, linux-fsdevel@ and netdev@ soon.
Stay tuned.

/devel/dst :: Link / Comments (0)


Mon, 30 Jul 2007

Distributed storage beta is ready. Testing is completed.

About 30 thousands iterations passed successfully, each iteration included mkfs and several mount/umount/copy/sync operations.
Testing was performed on x86 and x86_64 machnes.

After paid work tasks are completed I will start implementation of the initial failover recovery - reconnect and then add ability to suspend a node.

/devel/dst :: Link / Comments (0)


Sun, 29 Jul 2007

Distributed storage beta is ready.

One can translate 'beta' as supporint fully working local and local exporting nodes, system passed quite heavy tests on both x86 and x86_64 with local exporting and remote userspace nodes.
The last tricky bit I tried to fix for the last couple of days was in the way polling and TCP work - it was observed only on small systems with 100 mbit NIC connected to gigabit lan. TCP can merge several susequent segments into one chunk (less or equal to MSS, which is usually 1448 bytes iirc), but ->poll() can only say that there is data or not, tcpdumps showed that data from previous write request and current read request (1024 and 24 bytes) were combined into one chunk, local exporting state machine detected new data via polling callbacks, but after data from write request, it did not check if there is new one. Since polling state machine will not be invoked again until there is new packet, read request sat in the reading queue forever blocking all operations on the main node (mount is synchronous).

Right now distributed storage formed on top of three remote nodes (one in-kernel local exporting node and two userspace targets) tries to survive read/write/mout/umount/sync/mkfs test, which is organized to fill filesystem on all three nodes. I will leave this test for this night, so far about 200 iterations passed, let's see how it will feel itself tomorrow.

/devel/dst :: Link / Comments (0)


Wallpaper glueing.

Half of the room is completed. Since I forgot my phone in office yesterday and do not have any other watches, that was fun day - you wake up, look to the street, make some tasks being completely lost in time. This night I ended glueing about 5-00 - when I saw that sunrise is coming, although I did not know the time (if I did, likely I would stop earlier).
Only when I returned back to office (to take some food moving here, to read mails and hack), I found that it was already about 19-00.

Wallpaper glueing is not that simple task for one person as I expected before - actually I do not like how I did that, but it is possible that it is only because of glue has not stabilized yet.

Here are couple of photos of my ceiling (with and without flash).




A hook.



A bit more in gallery.

/devel/flat :: Link / Comments (0)


CFS vs SD.

Stupid politics and unfairness as usual. And Con not one who suffered. I do not care about process, but this sentense rises a question:

that was where the SD patches fell down. They didn't have a maintainer that I could trust to actually care about any other issues than his own.
Did I read this that Linus will not (or will with heavy brakes) accept patches from other people except his own circle of fame?

But that is better than "utterly overhyped" and "misbenchmarked" :)

Fuck them, just do what you like and do it good, that's the matter, everything other is just stupid.

/devel/other :: Link / Comments (2)


Sat, 28 Jul 2007

Big appartment development day.

And start of the big appartment development weekend.

Working from about 20-00 to 3-00 this night and half of the day resulted in completed electricity wiring on the hinged ceiling, finished ceiling paining in the room, checkroom and hall (will cover with another layer though - some colour left from previous layer painings), painted radiator tubes. I also removed old heavy anchor bolts, which hold by hammock, with new ones, which have form of a hook, so it becomes quite easy to get chains which support hammock on and off. Hooks also remove bending force from the chain (which might result in teared chain, which can hold 350 kg in usual direction, but only about 150 in bending).

All above got ability to be completed just because I moved to deveopment shop yesterday evening and bought all needed equipment. I do plan to complete room and checkroom this weekend. And probability of that event to happen is very high. Next task will be bathroom, since washing in my spartan conditions is absolutely unconvenient.
Unfortunately there is no neon cord in Leroy Merlin development shop, so that my ceiling would be fully completed, will search further, but its setup is trivial since all electric parts are already completed.
Maybe I will order a vacuum cleaner and boiler tomorrow, the former is needed to put and floor carpet, since it is very dusty, the latter is essential part of the bathroom and my water system project.

/devel/flat :: Link / Comments (0)


Fri, 27 Jul 2007

Kernel Summit.

Usenix will give me a bit more money than requested ticket costs, so we will invest a bit more into british economy...
Know a good pub place in Cambridge?

/devel/other :: Link / Comments (0)


Thu, 26 Jul 2007

Exporting local nodes in the distributed storage has entered testing stage.

Now it is possible to export local storage (block device) to remote system. So far it was only possible to use userspace application to be a target, but now there is a kernel target too. It is faster. It allows to build tree structures of the nodes to remove single points of failure. It allows to simplify configuration. It allows to simpler implement recovery (like reconnect, which is next step before first release). This is pretty simple, since the whole infrastructure was already there.

As I said, next task is initial failover support - so far I only plan to add reconnect on error. Then I will either release first version, or (likely) will add redundancy algo (simple mirror).

And of course testing... I will start running heavy benchmark (like iozone or any other from my great filesystem contest benchmarks) after local node exporting is tested.

And I again did not move to development shop. Well, eventually...

/devel/dst :: Link / Comments (0)


Hard way learning russian.

Exceptions to rules of russian grammar

/other :: Link / Comments (0)


Linux sucks.

Please tell me how is it ever possible in 21 century on two-core system with 3.4 Ghz each and two gigs of ram to stop playing music when system runs recursive chown on a huge dir?:

top - 14:36:57 up 21:39, 10 users,  load average: 0.91, 0.32, 0.12
Tasks: 151 total,   2 running, 149 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.7%us,  2.0%sy,  0.0%ni, 34.2%id, 62.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2075052k total,  1581176k used,   493876k free,   645032k buffers
Swap:  1951856k total,        0k used,  1951856k free,   434852k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
7403 root      20   0  3988  952  632 R    3  0.0   0:04.04 chown
7366 s0mbre    20   0 27884 3776 2904 S    1  0.2   0:03.74 mplayer
This is default FC7 installation with the most recent kernel 2.6.22.1-27.fc7.

Crap.

/devel/other :: Link / Comments (8)


Quotation of the day.

Chris Mason wrote:

> Definitely, and I'm glad you are. You haven't converted me yet, but
> I look forward to finding the best ideas from our two approaches when
> the patches are further along (ext2 port of fsblock coming along, so
> we'll be able to have races soon :P).

I'm sure we can find some river in Cambridge, winner gets to throw Axboe in.
P.S. Both Chris Mason and Jens Axboe work in Oracle, Jens is block layer maintainer. Talk is about how to map data to disk (file offset to block number, flags, sub-page blocks) and to replace buffer_heads with new extent based tree (rb-tree is used, although that can be slow, since requires locking, multidimensional trie similar to judy array used in unified socket storage allows to use RCU protection).
Likely Jens Axboe worked closely with of this much-hated buffer_heads :)

/devel/other :: Link / Comments (0)


Wed, 25 Jul 2007

Climbing evening.

That was quite lazy training - Grange lost somewhere, no known people. I only did several traverses and number of old boulderings, but most of the time stupidly sitting. Lazy...

/life :: Link / Comments (0)


Met with Abr and Tanya in "Belfast" pub.

More in gallery.

Belfast pub.

Brits out

Gorbaty bridge late night.

Gorbaty bridge

Walking Moscow night...




Celebration guilties - Abr and Tanya.

/life :: Link / Comments (0)


Tue, 24 Jul 2007

Distributed storage. Local node support.

I've committed local node support for distributed storage, so it is possible to create a storage which is formed out of locally connected storages and remote ones using linear mapping algorithm now.

Next task is to complete local node exporting support, so that it would be possible not only to gather remote nodes using this driver, but also to export local storage to the remote system. Right now it is implemented in userspace target, which is obviously slow because of an additional data copy to/from mapped file and due to the fact, that system works with usual files in userspace, so now one filesystem (created on top of distributed storage) is placed in the another filesystem (on the target hosts).
With such type of operations it is possible to create trees of storages eliminating single-point failures by distributing control nodes between data ones.

/devel/dst :: Link / Comments (0)


Distributed storage competitors. DRBD.

I've just read a news on Kerneltrap about DRBD going upstream. DRBD is a Distributed Replicated Block Device, which allows to bind two remote storages to form a mirror. So far it only supports two nodes and has number of questionable own implementations of work queues, threads and locks. It is very big, I do not know why. It has 4gb limitation (u32 usage likely). There are problems with codying style.
That was problematic part, but drbd has a major advantage - it has a user base. I used drbd with heartbeat several years ago to create a high-availability cluster at one of the previous works. Although it was not very stable (sometimes it was not allowed to write to the system under resync, heartbeat did not detect fails and resync itself had problems), but it worked. So it is possible that it can be imported into the tree.
Unfortunately it was not posted neither to netdev@ nor to linux-fsdevel@, so I can not comment on that (although I'm not sure my comments would be understood correctly, but not as advertising of my work :).

So, right now I'm keeping silence about drbd and instead will think about how to better work with block requests, which cross the boundary between nodes, so that as much as possible allocations would be eliminated.
During that I found a perfect way to make a multi-node mirror, btw, but so far I will postpone its implementation.

/devel/dst :: Link / Comments (0)


Block layer requirements for distributed storage.

Unfortunately block layer requires to clone block IO request each time it is supposed to be splitted - there is no way to end IO atomically from several pathes. Actually that is theory, practice shows that it is possible to call bio_endio() multiple time, since all ->bi_end_io() callbacks I saw check ->bi_size and only perform actuall IO ending when it is zero, so if it is possible to atomically decrement ->bi_size, it is safe not to clone block IO request. That is what is performed in distributed storage state machine - although it is not perfect (and headachingly complex for me) and there is a tiny racy window.
Things become worse with local node support - it (surprisingly) works, but race window increases.

I implemented protection against data congestion, i.e. a situation when subsequent reading goes before writing is completed. It is done using tree of in-flight requests, so that each subsequent request can be copied to/from one in the tree, but debug shows that such situation never happens. This does not degrade perfomance, since number of in-flight requests is very small and they never cross each-other's boundary.

Another block layer issue is size of the block request. My observations show that it never exceeds 31 pages (during even heavy syncs on my hardware), although maximum number of pages in the bio is 256. Likely it is not because of block layer itself, but ext3 filesystem issue, I did not investigate that.

/devel/dst :: Link / Comments (0)


Mon, 23 Jul 2007

FC7 installation bug.

I selected packages and started installation, then pressed "Release Notes" and got this:

Release notes bug

As you might expect pressing "Close" does not close window and instead shows empty white-blue-gray area without letters.

World is far from being perfect...

/devel/other :: Link / Comments (0)


No Debian/Ubuntu on desktop.

After another crash of the terminal during hacking my infinite patience suddenly ended, so I' burning FC7 DVD, which will replace Ubuntu on my desktop.

/devel/other :: Link / Comments (2)


Window with drops.

Window with drops

/life :: Link / Comments (0)


Sun, 22 Jul 2007

Walking summer evening...

Met with Abr and Tanya. And Anton.

Anton

Fountains.

Fountain
Fountain

Heron.

Heron

Graffity.

Graffity
Graffity
Graffity

/life :: Link / Comments (0)


Appartments.

I've finally made a small progress in my appartments development process - I spent today's morning setting up electricity - the whole electricity project seems to be close to be completed (at least in the room, hall and checkroom). I've instealled dotty lights, completed theirs wiring, switches, transformator, fixed electricity sockets. I like my hinged ceiling even more now, it will be completed after I (eventually I will by it, sigh) paint the last layer of (clouded pearl 2) colour and buy a blue neon cord (I've also setup its electricity supply today). I do plan to move to development shop this week, hopefully tuesday, and finally complete the whole room.
I also need to switch to this task as a type of the rest, since after distributed storage state machine I feel mybrain^Wmyself a bit tired. After hinged ceiling got painted I will make its photos, hopefully I will get neon cord to that day.

/devel/flat :: Link / Comments (0)


Sat, 21 Jul 2007

Testing is completed. Distributed storage alpha is ready.

No problems appeared during this night testing.

Alpha means that only linear algoritm exists and I have not yet tested local nodes and local exporting. Starting this.
Beta version will have full support for local nodes and local exporting, which will eliminate needs for userspace target support (one can still use it of course).
First release will be based either on beta version or will additionally include some redundancy algorithm (likely mirroring).

/devel/dst :: Link / Comments (0)


Fri, 20 Jul 2007

Distributed storage. Linear algorithm considered stable.

Preliminary tests I started yesterday does found a huge number of bugs, so I started to clean things up and ended with good structure which describes the whole state machine. It is complex, damn complex, but that is a price for not having any allocations in fast path. Such mental exercises, like fixing bugs in that monster, require to turn head brain on instead of usual spinal cord upto headache and force my eyes to jump out of eye-socket.
But now all they are fixed - testing completed more than a hundred runs with readings, writings, mounting, unmounting, syncs and filesystem creations. It is quite slow because of huge amount of debug prints all over the code, but it continuously runs and I plan to leave it for testing for the whole night.
As previously - if there will not be any bugs, I will move forward and commit current stage as stable.

Next step is local and local export node testing. Then performance testing. Then new redundancy algorithm.

What I really like in this system is that how simple is to configure the whole storage array:

./dst -n $ST -A $ALG -f /dev/dst -a kano -p 1025
./dst -n $ST -A $ALG -f /dev/dst -a 192.168.4.78 -p 1025
./dst -n $ST -A $ALG -f /dev/dst -a via -p 1025 -R
Algorithm automatically requests remote configurations and manages node's info to form an array. No need for any table, to put size informations and preserve the sizes and offsets. In dst configuration order is significant though, it is also possible to setup system without autoconfiguration by providing sizes and offsets via command line. I will put into TODO list a feature which would allow to store node's information in the attached data itself, which would not then require to save an order, but not now.

Getting into account how fast it was to reproduce previous bugs and how long it works already I can confirm, that initial support for linear algorithm with remote targets without additional redundancy and failover as ready.
But let's wait for for tomorrow (about thousand or two of testing cycles with different operation modes) until testing is completed.

/devel/dst :: Link / Comments (0)


Thu, 19 Jul 2007

Distributed storage progress.

So far I completed networking processing rewrite, so that there are zero additional allocations during fast path (comapre to two in device mapper in the best case), although code itself is a bit ugly in places, so I will clean this up eventually. There is one possible allocation if fast path would end up sleeping in sending of receiving function, in that case new request is allocated from memory pool and queued to be processed by dedicated worker thread later (when socket is awakened by bottom half or sending route), this queue is protected via RCU.
I have not yet tested local device node, i.e. when storage contains local disk as one of its nodes, there is also not yet tested local export mode - this allows to export local disk to remote peers, right now I use userspace export daemon, which works with local files.

Hunting for tricky bug took most of the last couple of days - bug happend when huge block IO requests ended up covering multiple nodes. In device mapper it is fixed simply by allocating additional block IO request, but since I decided to remote additional block IO allocations completely, I needed to make dst processing engine flexible enough to allow such cases. The most tricky part was situation, when single bio_vec, i.e. a page, happend to be on the boundary so that parts of the page are on different nodes. Simply playing with size and offset of the page is not alowed, since XFS requres that fields to be untouched by the block layer (this is the only such restricting user though). Local node actually requires to allocate new BIO here, so in local hot path there might be aditional allocation, but there is no postponed requests though - I just queue new bio at the end of the queue for the specified device by calling generic_make_request().

I've started massive killing testing of the system - the whole night it will mount/umount, read/write, sync and perform filesystem rebuilding, if it will not crash or freeze and log of this crap will not contain broken data, I will consider remote transfer mode as ready. When it is, local node support and exporting testing will not take too much time (although I already said something similar and it took me about a week to move forward, but everything is good which ends up being good).

Because of this bug I did not move to development shop and thus did not complete ceiling paining, and thus still do not have wallpapers on the walls and cover on the floor, and the whole appartments developing process is in a hung state - every day I fall asleep in my hammock looking to tons of garbage, concrete and instruments around. But I will fix that too, it just requires a bit more time than I expected (actually it requires exactly those several days I expected, but only in case I do work in this problem, but so far I usually do something more interesting. I can pay for completing my loft, but that is not interesting variant)...

Stay tuned.

/devel/dst :: Link / Comments (0)


Wed, 18 Jul 2007

Climbing evening.

That was great one - its main advantage is that I see the progress climbing on the begative slope. It is slow and right now I still fall too many times even on simple traces (like 6a), which I do on-sight on the vertical wall. But number of falls and overall tiredness is smaller each time - maybe that is because I climb the same traces, but I try to change them with short time.
Anyway this training was good - I managed to complete rtaces with 1-3 falls, which is a good sign, although that traces are not that complex (6a-6b).
Excellent time!

/life :: Link / Comments (0)


Block layer issues.

I wrote previously that it is impossible for the same block queue request processing function to be called simultaneously, but it is not true, at least I see in distributed storage, that it is perfectly invalid assumption - my request function is called simultaneously on single-cpu system (with enabled preemption though). This is a big surprise, which is contrary to what is written in the comment for generic_make_request():

 * We only want one ->make_request_fn to be active at a time,
 * else stack usage with stacked devices could be a problem.
But having a debug print at the begining and end of the request function I managed to get two subsequent starts without end in-between, so in my setup with three remote nodes in distributed storage its request function does run simultaneously for at least two requests. This requires to make some steps to clean this situation, I will think more on how to fix this without serious lock contention in the hotpath.

/devel/dst :: Link / Comments (2)


Tue, 17 Jul 2007

A sign.

I've just loaded a module, it crashed and to check the reason I ran 'dmesg' to see debug output.
And I've seen! I've just fscking found a reason for all damn problems in the world. That was just like a god's sign.
This opened my eyes and freed my mind. I know the reason now.
I've seen this:

[479591.764465] All bugs added by David S. Miller <davem@redhat.com>
Couple of days ago I discussed how device mapper depends on network. Right now everything depends on network, even if not directly but via configuration utilities or some other code, which depends on network and is required for given application. And networking maintainer is responsible for all your bugs since likely the day first.
It is even more than the Matrix.

/devel/other :: Link / Comments (0)


Mon, 16 Jul 2007

Climbing evening.

That was a good training - after several simple traces on the vertical wall I started to climb on the negative slope with bottom rope. That was quite hard and as usual there were a lot of falls, but there were also a positive moments besides getting an experience - I managed to complete several very interesting parts on the quite complex traces, although that requires to hung quite for a while, but result justifies resources.
Several times I almost fell in very complex condition, so I prefered to get the whole power I had (usually very little) and hung on the hold until I got rope or better hold, so right now quite a few muscles are aching, which is good.
Actually I fear to fall, but some time ago I did it too, but fell without problems, right now I prefer to hung intentionally instead of moving up and fall after several holds. I think I should change that...

/life :: Link / Comments (0)


Data congestion in distributed storage.

I gave this name to the condition when different operations are queued for the same data in the storage - for example when reading follows writing. In that case if writing has not yet completed there is no need to send reading request to the remote node, but instead just copy data from writing request to reding one. This is quite rare condition for usual system though, since growing queue is a sign that system does not handle the load, but it is possible in bursts, so having such optimization is a win definitely.
This requires to process IO requests on per-page basis instead of per block IO, so after such changes are done, system will have ability to easier reorder requests, for example to send write request while remote system is waiting for reading completion. It also allows not to clone block IO request for each node which are covered by given BIO., although BIO must be cloned for local targets (actually I will think about possibility to drop even that cloning by manipulating bi_idx and bi_vcnt flags).

All that is true for linear block mapping, for example mirroring (or any other redundancy setup) requires more sophisticated block operations, so actual algorithm remapping function gets block IO requests combined as single BIO, and then will (or will not) split them to pages.

/devel/dst :: Link / Comments (0)


National question.

National question

/other :: Link / Comments (1)


Sun, 15 Jul 2007

Distributed storage suspend mode or live data migration.

I've just thought what feature I should add into this - suspend mode or live migration.
Let's say you want to change remote node - either temporary suspend all IO for given node (for example to change a local disk) or replace completely one node with another (for example switch to different remote machine), so that until data migration from one node to another or during disk replacement all block requests, which would be completed on given node, will be frozen until node is ready. Requests to the different nodes should continue without stops.
Actually I would be surprised if such functionality does not exist in existing block layer hotplug, but I do not even know how to test if it is there or not - documentation sucks, there is no feature list (at least I do not know about it), so I will reinvent the wheel (again).
There is something like queue plug and unplug, but as usual - specs suck. I will check LWN kernel line, I recall Jonathan Corbet wrote about it, but even if it does exist, it can not help in distributed storage, since it is a single device for the block layer and thus has only one queue, but stopping all IO requests because of one node is not politically correct I think. Such decision should be made by algorithm of course, since redundancy might require several nodes to be updated for single block IO, or even more - to write to some another node algorithm must read some data from suspended node, so this is the only place which knows about what IO must be frozen.

I'm thinking about should I release alpha version right now (modulo testing I need to perform for local mode) or implement some other tasty things and show distributed storage only after that... Pros and cons?

/devel/dst :: Link / Comments (0)


Distributed storage system.

I've added sysfs support, so device tree looks like this (a storage named 'storage' created with two remote nodes):

/sys/devices/storage/
/sys/devices/storage/alg : alg_linear
/sys/devices/storage/n-800/type : R: 192.168.4.80:1025
/sys/devices/storage/n-800/size : 800
/sys/devices/storage/n-800/start : 800
/sys/devices/storage/n-0/type : R: 192.168.4.81:1025
/sys/devices/storage/n-0/size : 800
/sys/devices/storage/n-0/start : 0
/sys/devices/storage/remove_all_nodes
/sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
/sys/devices/storage/name : storage
As you can see, there are two nodes in linear algorithm, first one start at 0 sector and has 800 sectors size, second one starts at 0 sector and has 800 sectors size too.

Implemented initial failover mechanism - if there is recoverable error (i.e. not -ENOMEM), then appropriate algorithm's callback is invoked. Right now it does not perform any action, but can for example reconnect to remote node and resend a block request. To implement this I need to refactor code a bit.

Extended userspace support. To setup above array one just needs to run following comamnds:
# ./dst -n storage -A alg_linear -f /dev/dst -a kano -p 1025
# ./dst -n storage -A alg_linear -f /dev/dst -a via -p 1025 -R
To remove an array:
# ./dst -n storage -A alg_linear -f /dev/dst -D
Here is small help for userspace options:
Usage: ./dst -n storage_name -A algorithm -b backlog -f device_path 
	-s start -S size -d local_disk -a addr -p port -r <remove> 
	-R <start array> -D <del array> -h <help>
So, to be ready for the alpha release I need to test local export (so far I only tested userspace remote peer, which works on top of usual file (can be a device file though)) and local (local block devices) targets.

Also watched three parts of "Lethal Weapon" film to help brain not to explode or flow out of my ears - that's an excellent time.
Stay tuned.

/devel/dst :: Link / Comments (0)


Interesting note about device mapper.

It never performs allocation returned value check.

Since for hotpath it uses either memory pool or biosets (which is in turn memory pool too), and allocation happens always in process context (where sleeping is allowed), neither of them can fail, since memory pool api internally spins forever (if sleeping is allowed in the context) until requested data block can be obtained.

/devel/other :: Link / Comments (0)


Sat, 14 Jul 2007

More iSCSI wierdness and DST notes.

static void __iscsi_get_ctask(struct iscsi_cmd_task *ctask)
{
	atomic_inc(&ctask->refcount);
}

static void iscsi_get_ctask(struct iscsi_cmd_task *ctask)
{
	spin_lock_bh(&ctask->conn->session->lock);
	__iscsi_get_ctask(ctask);
	spin_unlock_bh(&ctask->conn->session->lock);
}

static void __iscsi_put_ctask(struct iscsi_cmd_task *ctask)
{
	if (atomic_dec_and_test(&ctask->refcount))
		iscsi_complete_command(ctask);
}

static void iscsi_put_ctask(struct iscsi_cmd_task *ctask)
{
	spin_lock_bh(&ctask->conn->session->lock);
	__iscsi_put_ctask(ctask);
	spin_unlock_bh(&ctask->conn->session->lock);
}
I tried to avoid locks and alocations as much as possible in distributed storage, but there are at least two locks in DST: one is used to put ready socket into queue to be processed by dedicated thread in process context and another one to protect queue of request for each given socket. There is also one possible additional allocation in the IO request processing path - if it is impossible to complete request when it arrived (it happens in process context on behalf of generic_make_request()) new request is allocated from memory pool, which holds given block IO request and additional fields to form processing state machine, to hold sending/receiving offsets and other fields, then new request is queued to the tail of the request queue for given state, which holds a socket. Actually there is always only one consumer (processing thread) and one producer (a node which holds state which contains socket), so it would be possible to remove additional locking, but producer is only one since block layer does not allow to run several generic_make_request() processing functions simultaneously, which can be changed in future (or not), so I will leave it as is for now.

Actually looking at DST it becomes very similar to device mapper on top of network block device. Yes, network block device has some problems, but there is whole system in device mapper which allows to use existing redundancy algorithms (like different RAID arrays), so it looks like I just reinvent the wheel with my project (you can not even imagine how frequently I was told about the same with kevent and epoll), so I will implement a simple device mapper target too. Although it will be a bit limited, but it is noticebly simpler that existing system. It does not mean I will drop what I created, there will be just two operation modes.

/devel/dst :: Link / Comments (0)


Fri, 13 Jul 2007

Friday 13. But it works.

I'm talking about distributed storage.

Initial implementation of the distributed storage with linear (remote) device mapping (details below) works in Linux now.

Linear mapping is no more than trivial concatenation of the several (remote) nodes into single local one. There is no redundancy or failover or whatever, there is only proof-of-concept code, which allows to form a local device, which is created on top of several (remote) nodes. I created a single ext3 filesystem on top of it (part of the filesystem is on one device, part on another, but filesystem does not know about it, it also does not know about any special operations needed to be performed to work - like those ones needed for NFS to operate correctly). It is quite trivial, but works ok.
Actually this is no more than a usual device mapper (due to limitations of its config I managed to create third system to combine devices into one (device mapper, md (multiple device) and now distributed storage - I will definitely create a device mapper module to form remote nodes not looking at its table file limitations, since is it usually simpler for users to work with existing system)), but with ability to combine local and remote nodes.

There are unsolved problems of course:

  • there is no block IO split, i.e. if single block request crosses the boundary between devices, it will not be splitted into the two, but completed with errors. This definitely must be fixed, even though wast majority of the requests is page-size only.
  • quite bad userspace support - configuration utility is ugly (it has the same ugly char device with ioctl commands instead of netlink).
  • local export was not tested - there are only userspace remote nodes, which work on top of usual file.
But nevertheless this is big step forward.

There are things to think about:
  • Synchronization. Currently each block request is handled in order it arrived - i.e. no new requests will be sent to the remote device until current one is completed. There are number of pros and cons for this idea and likely things should be changed to allow per-page tag tp show to which request it belongs to (or better just put a global offset pf the request) - this is similar to how iSCSI works, where each request has its tag, so that there is possibility to send several of them. This is likely a good idea, but needs furhter investigation.
    Actually, talking about iSCSI - I doubt it will work good (at least on small systems), if main sending function works on top of non-blocking socket with following loop without any polling and sleeping:
    static void iscsi_xmitworker(struct work_struct *work)
    {
    	struct iscsi_conn *conn = container_of(work, struct iscsi_conn, xmitwork);
    	int rc;
    
    	/*
     	 * serialize Xmit worker on a per-connection basis.
    	 */
    	mutex_lock(&conn->xmitmutex);
    	do {
    		rc = iscsi_data_xmit(conn);
    	} while (rc >= 0 || rc == -EAGAIN);
    	mutex_unlock(&conn->xmitmutex);
    }
    And I do not talk about atomic-only allocations in iSCSI stack.
    I do not say distributed storage is a perfect solution, but I designed it with quite a few issues in mind instead of following theoretical-only design (which is what several people proposed to do before actually starting to code and find all problematic places)...
  • Failure mode. Currently if one node is broken, the whole device stops working - this must be handled by per-algorithm decision (i.e. redundancy algo will just update required nodes instead of failing). But existing linear algorithm requires that property, so it is not that bad for now.
    Another issue with failure mode is recovery - if some node was repaced, it requires some information to be written into it during recovery phase, doing that via main node (like eall existing system - they syncs via main device) is not optimal in distributed scenario, better would be to say some remote node to send some info into new node directly instead of via main node, but that depends on the algorithm being used, so it should be postponed until some interesting redundancy mechanism is developed (like WEAVER codes).
But overall, first distribution storage has been formed on top of several remote nodes (with trivial autoconfig - remote device sizes are requested by distributed storage core and that data was used acordingly to form the device).

Briefly saying, this is although quite small, but definitely a success.

/devel/dst :: Link / Comments (0)


Thu, 12 Jul 2007

Ok, back to distributed storage.

I sort of completed uninteresting tasks and finally back to the good shape, so expect interesting bits soon.

So far, there is a challenge to perform as small as possible additional allocations per IO request.
For example there is at least one additional allocaiton of struct bio each time new block is going to be written/read to/from disk (plus struct request, but I'm not sure - last time I looked into block layer code too long ago to remember all bits). Device mapper adds another two - cloned BIO and own request.
Then network will add another (at least) one.
First and the last ones are unavoidable (actually they can be removed, at least skb allocation I can workaround, but that will have questionable error-prone consequences, so no need for such hack for now), but allocation in-between, i.e. in the control layer in the distributed storage must be reduced to the very minimum. Initial state machine, I released, works with one request per time, since each state does not have request queue.
Network block device contrary does not have any additional allocations, but it is purely synchronous, so it does not need to keep track of sent requests.

So, problem states as following: node (or state) needs to have a queue/tree/whatever of partially processed requests. Each requests should have a pointer to the original block IO request. Each state (or node) should have a callback, which will be invoked by core of the state machine, when input or output processing can happen, so that callback will get events and process them in order.
That is essentially what I'm developing right now.
Stay tuned.

/devel/dst :: Link / Comments (0)


Wed, 11 Jul 2007

Climbing evening.

That was really hard one. First, because I had a hangover. Not that heavy, but bottle of beer I took yesterday was quite bad, so my stomach got full rainbow of crappy feelings during the day. At the evening it was much better, but after several warming traverses and simple traces things became worse.
Eventually we started to climb with bottom rope, so that flushed my power completely. Although I felt every third hold, I did try to complete all traces, which were started. Eventually, when I climbed the same trace I previously lead, but this time without holding rope, that was especially interesting. And although I did not have already any power, but it was much better than climbing the same trace with bottom rope.
Actually that was only third training with bottom rope in particular and on negative slope in general for the last half of the year, so I think things come quite good. Give me some time, and it will be about the same as climbing on the vertical walls.
Excellent time.

/life :: Link / Comments (2)


English words.

If you know me, then likely you also noticed that frequently I can form very interesting many-storied undecent sentencies.

But I'm completely lost in that area in english, so phrases like

it's driving me absolutely bananas
from native english speaking person really blows my mind.

/other :: Link / Comments (3)


Tue, 10 Jul 2007

Kernel ->poll() based network state machine released.

I released initial kernel state machine used in 'echo' in-kernel "server" for education purposes by readers request.
I do not know if it can be useful as is, but for studying likely it is good code, but getting into account that there is no single comment in the code, I'm not that sure.

Anyway, interested reader can find kernel ->poll() based network state machine in archive.

/devel/dst :: Link / Comments (4)


Mon, 09 Jul 2007

Climbing evening.

That was interesting and heavy training. After couple of initial traverses and some simple old traces me and Grange started bottom rope climbing on the negative slope.
Eventually three traces (quite simple from 5c+ to 6a+) were completed, if that word is appropriate in this context - I fell several times even on the simplest trace, and although I was suspended not that long, just to relax a bit, I think trace was not completed. Although I generally quite bad on the negative slope, this was a good training, since showed problematic cases and forces to make results better with (I think short) time.

/life :: Link / Comments (0)


Sun, 08 Jul 2007

High-performance in-kernel 'echo' server.

Also known as distributed storage network state machine is ready after several hours this night with bottle of Martini and today's time.
It was not integrated into storage module yet and exists by its own, but that was specially crafted to allow easier testing. This standalone module allows to create listening socket, bind it to specified (hardcoded though, since that data is provided by userspace control in distributed storage module) address, listen for connections, accept new clients and read data from them (I did not test writing, since it is exactly the same).
All operations are non-blocking and are handled via special worker thread(s), which also dispatch events via ->poll() callbacks and form simple state machine.

So, essentially network backend for distributed storage is ready.
Next task is to integrade this state machine into distributed storage module.
I have about an hour right now, so I will add support for local node in the storage.
Then trivial distribution algorithm should be created (actually it already exists, but it only sends data to single node), which will send/receive data to several nodes in round-robin or mirror format.
That will be a first release milestone. First release will allow to form simple distributed storage without tricky algorithms being used. Its main goal is testing and searching for potential narrow places in the design and implementation, which will be handled further.
Stay tuned.

/devel/dst :: Link / Comments (0)


Sat, 07 Jul 2007

More entries and comments!


I've just added 'more entries' plugin and thus you can find 'next X entries' and 'prev X entries' at the bottom of my blog page, so you can list pages and read what was written earlier without digging into archive.

You can also comment in the blog now - I created small captcha to prevent bots from putting crap into my blog, which will require you to recall simple math :)
Hacking feedback plugin to include this captcha was fun, especially getting into account that my perl knowledge is somewhere between zero and void.

Enjoy.

/devel/other :: Link / Comments (11)


Fri, 06 Jul 2007

Climbing evening.


It was general endurance trainig today - I managed to complete 10 rounds of 10 sets of 7 different exercises, which is a personal record.
Actually I do not plan to increase the level - it is more than enough, takes about two hours and fully exhaust to the end. Main goal is not to increase the power, but to be able to keep the same level during long time (which is especially hard for me on the negative slope traces).

/life :: Link / Comments (1)


Sep 4 - first day of the Kernel Summit (agenda).


It is planned to hold FS/VM meeting that day, so if there will be some interest in areas clinked to my development, I can discuss set of filesystem related ideas.

/devel/fs :: Link / Comments (0)


Distributed storage's state machine.


It is a heart of the distributed processing engine - it links together netowrk states and core block IO thus allowing to process several requests on behalf of one stream without blocking until first one is competed.
Since there is no in-kernel event dispatching mechanism, I need to bind it to ->poll() callback just like usual sys_poll()/sys_epoll() does. There will be one control kernel thread per storage (actually it could be per-NIC thread, so that each thread would be bound to specified NIC and only process dataflow related to it, but I think it will complicate code withouth any noticeble gain in my setup, so for the initial implementation there will be only one thread).
So far main tasks to be handled via this state machine are:

  • local storage being exported to remote nodes - requires listening socket processing, new clients accepting
  • non-blocking sending and receiving
  • non-blocking connecting for main node
I plan to first implement it as simple in-kenel dispatching mechanism unrelated to distributed storage (even in separate module) - testing of this system is quite simple. After it is completed, initial implementation of the distributed storage system without redundancy features will be completed in few moments. So far this is a plan for coming weekend, and now I'm running out to climbing training.

/devel/dst :: Link / Comments (0)


Small man's pleasures.


Braun razor

Now I have two - old Mach3 and this Braun.

/life :: Link / Comments (0)


Night...


Middle of the night, good weather, hammock, couple of liters of beer, Robert Heinlein's novels...
That is how I live.

/life :: Link / Comments (0)


Thu, 05 Jul 2007

UK september vacations.


I will be there from Sep 4 to 9, three days in Cabridge and two in London no matter if Usenix will or will not provide travel assistance, my tickets just arrived. I plan to meet with Meph, Ira and Abr and Tanya there and absolutely sure it will be fun. Although I was told that it is not enough (and that is obviously true) for couple of days to see the capital of the (lost) Empire, I do not want to make our meeting into being boring aftr several days.
Have fun.

/devel/other :: Link / Comments (0)


The 2014 Olympic Games will be held in Sochi, Russia!

/other :: Link / Comments (0)


Wed, 04 Jul 2007

Climbing evening.


That was hard training - really lots of traces starting from coupled simple down to new really complex ones.
I also tried two traces on the negative slope - one 6a+, which I failed on-sight, but just because of lack of knowledge about the trace (poorly selected holds, did not see some of them and so on), so I think I will complete is next time without any problems, and one quite complex 6c, which I managed to finish, but starting after about the last thirds I fell after two-three holds, since I was tired completely. Eventually I finished the trace, and that was really goot one, I will try it next time definitely.
The last trace was with bottom rope on the negative slope, but since I was too tired, I was not able to complete even half of the trace (6a+/6b) without fall, and eventually that became stupid set of falls after one-two holds, sometimes I felt several meters on the rope, but I was asked to put the rope into the trace, so I needed to complete it even without any power.
Hard and quite exhaustive, but really good training.

/life :: Link / Comments (0)


I've gootten UK visa.


Which means I will go to Linux Kernel Summit this year in Cambridge.

I will buy a liter of russian vodka and bring it there, people who will want to discuss things with me will become hostages of russian amicability.
You have been warned.

/devel/other :: Link / Comments (0)


Tue, 03 Jul 2007

Network related VM deadlock prevention.


There is a funamental issue when doing VM operations over network attached storage/device - each operation requires at least one additional allocation (for the network protocol headers) right now, and frequently (in case of guaranteed delivery like in TCP) to receive an acknowledge from remote peer that data was either written or data itself, which is another allocation. So, if ssytem is out of memory and wants to swap a page over the net, it can deadlock trying to allocate a space to send page or receive an ack.

Peter Zijlstra (and initially Daniel Phillips) proposed several times an approach to fix that issue - they decided that the best way is to create a reserve pool when system is under initial pressure, and then only provide data for sockets initially marked as 'special', so that it would be possible to make small progress which is likely would be enough in the most deadlock cases.
I was slowly opposed against this idea, since in the given implementation it is possible to fail to allocate reserve, there is no fair way to mark sockets as 'special' - only couple of them were setup in kernel, and if it would be exported to userspace, everyone could put own sockets into reservable and thus effectively block the whole idea of providing reserve only for real needs of deadlock avoidance.

Instead I proposed network allocator, which was specially designed to be exlusively used by network users.
It grabs number of pages from the main memory and use it for skb allocations, thus effectively not depending on main memory conditions. Such separation is the way to go in perfect world, but in real life there are problems too (and one of them is the idea of separation main system allocations and networking ones, which rised objections from people), although network allocator has set of features especially useful in network environment, right now I want to talk not about it, but about deadlock avoidance.

Distributed storage is such a device, which can suffer (actually as any other) from described above situation, so I need to think about how to solve it without too invasive changes in the rest of the kernel.
The best thing I think is to get ideas both from network allocator and Peter Zijlstra's idas - I plan to create a patch, which would allow to bind a independent reserve for any socket - such a reserve can be stolen from socket buffer itself (each socket has a limited socket buffer where packets are allocated from, it accounts both data and control (skb) lengths), so when main allocation via common path fails, it would be possible to get data from own reserve. This allows sending sockets to make a progress in case of deadlock.
For receiving situation is worse, since system does not know in advance to which socket given packet will belong to, so it must allocate from global pool (and thus there must be independent global reserve), and then exchange part of the socket's reserve to the global one (or just copy packet to the new one, allocated from socket's reseve is it was setup, or drop it otherwise). Global independent reserve is what I proposed when stopped to advertise network allocator, but it seems that it was not taken into account, and reserve was always allocated only when system has serious memory pressure.

Why does this idea better (from my point of view) than first two?
First, because it is not that invasive like network allocator.
Second, it allows to separate sockets and effectively make them fair - system administrator or programmer can limit socket's buffer a bit and request a reserve for special communication channels, which will have guaranteed ability to have both sending and receiving progress, no matter how many of them were setup.
Third, it does not require any changes behind network.

/devel/networking :: Link / Comments (0)


Distributed storage status.


Day is over, so a little information about status of the storage development will not hurt.
Couple of bits only - completed userspace and fixed number of bugs, so essentially simple tasks are over.
After setup I found very interesting graph of readahead requests:

Initial readahead

or maybe it is not readahead, but some other block layer feature, since it was first requests received just after block device was created (actually after generic disk was created).

Anyway, next step is definitely a polling state machine.
Paid work requires quite a lot of my attention, so I'm not sure I will complete something interesting tomorrow, but nevertheless, stay tuned.

/devel/dst :: Link / Comments (0)


Fucking unbelievable.


I needed to make a little hack perform unauthorized steps with pages on the russian UK Visa center to make it working and finally say that my visa has been processed and I can collect passport with or without visa back. They have so much broken script, so it completely did not work with Firefox, so I downloaded and fixed it a bit (with zero knowledge of javascript) and then verified on collegue's windows machine visa status.
Information paper they gave me contains wrong Moscow telephone number (with non-existing prefix).
If UK embassy can not build the site to inform about visa status and print correct own telephone number, what else should I expect in future? :)

Actually tomorrow morning I will move to visa center and get my passport back, if things will be good, I can even participate in glider flying after kernel summit.

/devel/other :: Link / Comments (0)


Hmm, I was a bit optimistic yesterday about devel time.


I moved to climbing around 6 P.M., but finished to work about 15-00 only...
So, not that many things for quite short period of time.
Let's see, what I will complete today.

/devel/dst :: Link / Comments (0)


Mon, 02 Jul 2007

Climbing evening.


That was damn good training - warming traverses, set of simple old and new traces without rest, set of simple traces on the negative slope walls and eventually trace with the bottom rope (or 'to lead'). I managed to finish trace with the bottom rope just with couple of falls, which was very surprising for me. I think I will continue to work on the negative slope with bottom rope during next trainings.

/life :: Link / Comments (0)


Moscow, around 8:00 P.M.


Task has not been completed.

Things to do:

  • polling state machine (complex)
  • async client accepting (part of the above)
  • receiving (part of the above) (complex)
  • userspace code (simple)
  • local disk target (simple)
  • testing (infinite?)
Implemented tasks:
  • moved away from device mapper to raw block device (simple)
  • block layer - disk and request queue allocation, block device initialization (simple)
  • configuration - initial autoconfiguration network protocol (trivial - one structure)
  • networking - sending/receiving/listening per node part (simple)
  • userspace configuration via ioctl (simple and a bit boring - tried to find perfect structures, ended with usual crap)
  • increased code size from several to 20 kbytes (not sure if it is a good sign, but size is already about the same as network block device, which is much simpler)
Not that many bits, but I only worked until the dinner - it was too tasty and I had a bit spartan eating this weekend, so I can not resist to have a bit of rest after taking a food...

Stay tuned, if there will not be any urgent tasks at paid work, initial implementation will be completed very soon.

Likely tomorrow I will write a small draft of the networking communication protocol, which will be used in the distributed storage. It is simple, but should include all possible cases.
And right now I move to climbing zone.

/devel/dst :: Link / Comments (0)


Moscow, around 8:00 A.M.


I'm in office (no, I did not sleep here, sometimes I just wake up early - today about 5:30), looking in my two monitors trying to setup a plan for the day.
Less than in 10 hours I will move to climbing zone, until then I plan no less than to create first version of the storage, which is supposed to do not less than to allow to connect several remote storages and form single one on the local node (for the initial implementation it will be enough to have round-robin writing algo without redundancy).
Time has started...

/devel/dst :: Link / Comments (0)


Morning.


Sova

Origin.

/other :: Link / Comments (0)


Sun, 01 Jul 2007

Moscow lights.


City lights

Origin.

/other :: Link / Comments (0)