Zbr's days.
October
Sun Mon Tue Wed Thu Fri Sat
 
     
2007
Months
Oct

About TODO Blog RSS Old blog Projects Gallery Notes

Wed, 31 Oct 2007

Good resting place.

It was fun evening - we celebrated my boss' birtday in carting center, where we got bloody outstandingly good home-brew, which flushed brains after 50 grams. Later me and Grange with Masha moved around and found a really good place called 'The last drop' on Strastnoy avenue, where sat upto about 1 o'clock, which was really cool.
Especially good excerpts from safety measures:

  • forbidden to start drinking special drinks without address of the body delivering in the back left pocket
  • forbidden to start drinking special drinks without money for the body delivering in the back left pocket
  • forbidden to get 'Chikatilo', 'First drop' mixes without helmet of any colour
  • forbidden to mix non-alcohol drinks to avoid morning hangover
  • forbidden to use barmen to bring your date into heavy alcohol intoxication with selfish ends and without her agreement
I want to move there again :)

/life :: Link / Comments (1)


Distributed storage checksums.

I've changed code to use crc32c Castagnoli checksum instead of Adler - now it is both used in kernel and userspace target. Code is being tested in various interesting configurations (like linear array is being exported to remote node, where it is added to mirror setup with another remote node with userspace target) and bug was fixed especially in unusual cases (like I described yesterday when mirroring with only one node failed to setup).
Checksumming also is being tested by injecting errors after data was written by test client.

I would like to finish this today but have to go...

/devel/dst :: Link / Comments (0)


Tue, 30 Oct 2007

Simplified DST protocol.


Also added flags for future extensions, fixed bug in mirror setup with only single node, added more aggressive protocol checks, overall cleanups.
This is a good release candidate, but I want more testing before final commit and new release, especially because of strong checksums, which require protocol changes. I also expects to have Windows target tomorrow.

Although tomorrow will be a bit shorter day, I expect to make a new release.
Stay tuned.

/devel/dst :: Link / Comments (0)


Asynchronous event notification infrastructure.

Does it sound similar?

Jeff Garzik proposed async events for SCSI/SATA notifications. It is implemeted simply as a helpers for kobject uevents, which in turn are netlink messages, so it has its pros and cons.

/devel/other :: Link / Comments (0)


Mon, 29 Oct 2007

Strong checksumming in DST.

I've implemented strong checksumming (I use Fletcher algorithm from RFC 1146, which is TCP alternate checksum options RFC) in DST and also fixed number of bugs, so this is going to be a new release tomorrow.
I also hope, that windows DST target will be completed tomorrow (this does not depend on me - I'm not an author).

Stay tuned.

/devel/dst :: Link / Comments (2)


Grange became...

He would be an artist or a scientist. He could fly to the space and treats people. He could be a musician or make cookies.

Fortunately that did not happen and he became a man, who always does not know how things go, but want to say to anyone, that work is being done very badly and does not scale, so when deadline will be reached, everyone will be f...

On the whole, he became a CTO of the software department in the company he works in.
RIP My congratulations! :)

/life :: Link / Comments (2)


Sun, 28 Oct 2007

Museum of the actual art in Moscow.

I visited ART4RU museum today - this is first private museum aimed to have 'actual' art. Well, I never liked modern, so was not surprised that composition did not rise a lot of positive emotions, but that was interesting.
Interesting, since it was quite for a while I visited art museums last time, interesting, since I never (by my own decision) wanted to see (pos)modernism art.
I can recommend it to those, who like for example Andy Warhol and likes pictures made of dead (or I hope only models of dead) rats (yes, there is such a composition there)...

It was interesting, but it is quite unlikely I will visit it again soon, although its collection is regulary updated and there is number of items I really liked.

/life :: Link / Comments (0)


Sat, 27 Oct 2007

Decimal calculus.

It appeared when people started to use fingers to calculate objects.
I afraid to imagine how we developed fractions...

/other :: Link / Comments (0)


DST merging plans.

Andrew Morton asked about status of the distributed storage and noted that actually there are no reasons not thinking about merging this.
Although he concerned about quite active development via my blog, but DST itself is essentially completed.
Likely it can require some additional features after I start distributed filesystem development, but right now it does not. Maybe I will add optional strong checksumming of the transferred data though.

/devel/dst :: Link / Comments (7)


Fri, 26 Oct 2007

Climbing evening.

That was bloody perfect traing - although I did not that many traces (actually only two - one quite old with a jump, and new one, which I really wanted to complete on-sight, but failed, since tired a bit waiting for another climber, who was on the way), although quite a lot of bouldering-like starts from new traces.
Although my foot is still quite strongly aching, it does not hurt me when I climb, although I can not say that about process of getting shoes off - that is a real crap...
But nevertheless it was good - I think I'm slowly getting my old shape back, so eventually I will in a good form, today's training showed just that.

/life :: Link / Comments (0)


Ok, the nearest development plan.

There are two companies, which wanted to work with me on distributed storage - both keep silence, I dissapointed about that, but can not and do not want to blame them of course. It theirs business...

So, while part of the brain thinks about interesting on-disk filesystem layout, most of it I decided to devote to the following problem:
let's suppose, that we have a map of objects (only points - 2d coordinates of number of dots), and small piece of it, possibly with different scale and rotated against some unknown point.

The problem is to find, what part of the original map this piece from.
I'm pretty sure this is not a trivial problem, and likely there are some solutions out there, but I want to invent my own.

Stay tuned.

/devel/other :: Link / Comments (0)


Byte range locking algorithm.

I've completed and tested it with one million users writing to the same 100 Mb area with random offsets and sizes - that was done of course in simulator, without actual writing, since its main goal is to determine lock contention and its avoidance.
Unfortunately this simulation can not say about real performance of such approach, since there are no timings for each write and cost of waiting for completion, so I will either extend this test to support tree of timers or will postpone this until there is a real filesystem to test on.

Right now I'm thinking more about how on-disk structures should be organized for the maximum performance.

/devel/fs :: Link / Comments (0)


Crypto analysis.

Dr. Bernstein offers a prize for his cipher analysis: no matter if it is full ($1k prize was awarded for 5-round differential crypto analysis of Salsa20 cipher) or complete, first it should be interesting. I personally like Dr. Bernstein not very much for his code licenses, hard communications and some times self-conceit (likely because I have the same 'problems', except license :), but he is a real hacker - he does for fun (and money, which I believe is not an issue) interesting for himself things.

Some time ago I even knew what is that, I tried to perform different cryptoanalysis (trivial of course), and likely will return to this topics eventually: time is the only limiting factor here.

/devel/other :: Link / Comments (0)


Byte range locking.

Last couple of days I wroked on the tricky algorithm for byte-range locking, which is especially interesting for distributed systems.

Consider the input, which contains of the ordered number of items (for example usual file - ordered set of bytes), and (potentially huge) number of users, who wants to read or write to the arbitrary contiguous areas of the input (like number of threads wants to read or write to the same file with random offsets and data size). Without proper byte-range locking this scenario clearly shows a bottleneck with locking contention.

My algorithm is quite simple - each user requests an ability to write to its area and during that write it locks only it, any subsequent write to the non-overlapping area will be performed in parallel with other writes, writing in case of overlaping regions will suspend second write (which overlaps with current one) until current one is completed. Second write will grab a lock (actually there are no locks, but a tree of permitted writes) for the region maxed of two overlapped regions (i.e. theirs concatenations), so that third write, which overlaps with one of the first two (current or subsequent) would be postponed too. When underlying write is completed, it schedules subsequent write, which in turn will schedule other subsequent write to the overlapping region. Thus all writes become asynchronous and writes to the non-overlapped regions will be perfomed in parallel. This does not match POSIX semantic (since writes to the overlapped regions are postponed and performed in order, writes to the other regions can be completed earlier), but in any given time any read from given region will return correct data for that region.

This problem especially rises when number of threads write to the same file on the different nodes, which can not rely on local page locking (line in Linux VFS).

There are some problems with the algorithm though - it uses (sometimes) too long recursive calls, which should be eliminated before used in real life. That is what I'm working on right now.

/devel/fs :: Link / Comments (4)


Wed, 24 Oct 2007

Distributed filesystems overview.

I will write here my personal opinion on filesystems I either worked with, or just read design notes or at least some documentation. It can be completely wrong - feel free to comment and I will fix wrong issues.

Lustre.
This is distributed filesystem heavily based on ext3 codebase with following issues:

  • since it is heavily based on ext3, it has all its problems with large files, fragmentation and other isses.
  • absence of any redundancy either at filesystem layer or block layer. It requies external storage (like RAID arrays) to handle failures.
  • very strong developers (a lot of former reiser3/4 team)
  • I was asked to work with :)
GPFS.
Really strong filesystem, originally created on top of IBM's tiger shark fs. Following issues deserve additional highlighting:
  • requires expensive hardware to build shared storage
  • supports filesystem replication to two disks for data and metadata
  • really interesting locking techiques used to minimize locking contention
  • in-kernel filesystem, but with large number of mnagement userspace daemons
  • proprietary
PVFS (second version).
I already wrote about it.

GoogleFS also known as GFS.
It was briefly described here already.

GlusterFS.
This is FUSE based parallel filesyste with following features:
  • userspace filesystem
  • filesystem works on top of usual in-kernel filesystem and thus can suffer from its problems
  • requires shared storage to handle failures (developers say that additional redundancy and data striping is a bad idea), although they implemented file replication
  • very bad documentation about filesystem design, but lots of advertisemnt words
GFS.
  • shared-disk filesystem, which locks access at block layer.
  • badly scales
  • second verions of the filesystem is in mainline Linux kernel tree
OCFS (second version).
  • uses in-kernel journalling code (and suffers from its problems)
  • uses in-kernel distributed locking management system (it was created for OCFS2 actually and is not usable by anyone else, requires a bit ugly userspace support).
  • no documentation
  • 32-bit filesystem
  • no words about redundancy support - likely is not supported
cXFS.
SGI clustered XFS filesystem.
  • no redundancy
  • single master server
  • fibre channel luns only (?)
CODA filesystem.
Development started at CMU when I even did not see a computer, but right now its development is very unactive. The only thing I know about CODA is its really interesting offline capabilities, which are not supported in any existing network/distributed filesystems.

KosmosFS.
Google-like filesystem with similar limitations:
  • designed write-once/read-many workload only
  • good performance for large files only
  • mostly sequential access
  • single metadata server
  • no redundancy
GfarmFS.
  • very small amount of documentation
  • looks like having only single metadata server and caching metadata server
  • userspace filesystem
Ceph.
  • uses interesting pseudo-random distribution of the data among data nodes
  • allows to have multiple replicas of the same data
  • has metadata cluster, but metadata is only partitioned between nodes without additional redundancy
  • good documentation of some design notes and very bad on others
  • userspace storage
Various AFS clones.
This is long dead filesystem.

Parallel NFS.
Although it was not yet released by any vendor, its specification allows to store proprietary protocol into the communication core, which means that non-vendor-certified products will not be able to work with this technology.

ChironFS.
This is userspace based filesystem without automatic resync on failure.


P.S. I repeat that this opinion can be damn wrong.

/devel/fs :: Link / Comments (2)


Tue, 23 Oct 2007

DST as shared disk storage.

Yes, it is possible. It is a transport layer for high-performance parallel filesystem, everything is already completed. Consider the case, when multiple filesystem nodes, i.e. nodes which does not contain data, but only metadata, connects to the same storage nodes (which contains storage itself), it is possible to connect several remote nodes to single local export nodes and perform concurrent read/write access. Similar storage is a base for likely all computing cluster filesystems (for example GPFS, PVFS, GFS).

This requires of course a higher layer filessytem to manage locks and concurrency to preserve filesystem state, but storage layer itself is fully implemented in DST already.

/devel/dst :: Link / Comments (0)


Read balancing is not supported in Linux RAID implementation.

I can not believe that, but it looks like this is true.
Even distributed storage supports that.
It looks like after my announcements this was added into kernelnewbies projects :)

/devel/other :: Link / Comments (0)


Studying existing distributed filesystems.

I already wrote short notes about googlefs, hadoop sf (hdfs) and DragonflyBSD hammer as a part of preparation for the new filesystem development.

Now, let's move a bit into different area: IBM's GPFS (originally Tiger Shark FS) and PVFS (second version of course).

Here are my short notes about PVFS2, which I got from its design notes:

  • virtual filesystem, as it works not as real filesystem, but a userspace wrapper on top of usual filesystem - just like googlefs.
  • non-posix compliant - what I do not get, that its interface, which is heavily MPI oriented. It is possible to use usual POSIX syscalls with special kernel module, but it does not have file's sematic - there are no files, there are only some references, which can be deleted without thinking about others, who already opened it.
  • no redundancy - this problem is handled either by having shared storage, or using so called lazy redundancy, which basically means a new helper for user's applications, which allows to force redundancy writes for given file.
  • lockless metadata updates - sounds like a really good idea, which is based on strong state machine of the update process, but in practice it is possible to have complex races and fallbacks, which can be complex enough and does not worth locklesses.
  • userspace IO daemons, PVFS2 uses traditional UNIX filesystem to store data and Berkeley DB to store metadata.
  • really bad at serving several types of loads like executing off the file system, shared mmapping of files, storing mail in mbox format. PVFS2 was designed for different loads.
This filesystem was designed for the only purpose of working with heavy dataflows, created by huge scientific MPI applications. Most of it works in userspace.
But what I really like is how it was written - with bits of fun, self-irony and excellent description of what it is for - no empty advertisement words and other pathos crap.

/devel/fs :: Link / Comments (11)


Mon, 22 Oct 2007

Climbing evening.

This one was quite hard - my foot is still very aching, so it is very challenging to climb and mostly to get shoes on and off.
Nevertheless I completed several traverses and then started to do a small exercises on the negative slope - climb over several holds up and down (usually about 4-5 holds with the length of such 'trace' being about 3-4 meters), and repeat the same several times without the rest in beteen not moving back to the floor.
Although this looks quite simple, especially if holds for simple trace were selected, this flushes power very fast.
It was hard, but very good time there.

/life :: Link / Comments (0)


Meanwhile at appartment development side...

I completed second plaster filling and initial polishing of the first arc and started second one (for the checkroom's door), which is smaller. Also started some minor electricity (socket setup) task in bathroom.
Nothing major actually, but something.

/devel/flat :: Link / Comments (0)


Fri, 19 Oct 2007

Sub release of the DST with turned off debug.

The latest release has debug turned on in mirroring algorithm, so I've released 'nodebug' version and put it into archive too.
Enjoy!

/devel/dst :: Link / Comments (0)


New article at kerneltrap.org about DST.

I have serious problems with english articles...

/devel/dst :: Link / Comments (0)


Thu, 18 Oct 2007

Added ability to store age of the node to the disk in mirror algorithm. New DST release.

This is used for the cases, when mirror node was updated (disks changed or something like that), so that media for failed node does not contain data, which was there previously. In this case dst core will read 'age' of the failed node (unique id stored at the end of the node, which is assigned to the whole storage during its initialization time), so if it does not match current one, the whole node will be marked as dirty and will force total resync.
The same applies to the initial setup - if id for the second or any subsequent node does not match id of the first one, nodes will be marked as dirty and will resync eventually.
This is a good step forward, I think. The only missing bit I'm thinking about right now is on-demand resync, i.e. when node found to be dirty. Right now resync only happens, when there are operations on top of the storage. This is quite minor priority though, as long as new redundancy algorithm.

/devel/dst :: Link / Comments (10)


Filesystem survey. Take 2.

I've started a short survey between my known admins about filesystem usage statistics they have on different systems, and about how it should be handled in the best case.
To make it clear I asked to specify type of the system (like mail server, e-commerce, database and so on), type of the load (a lot of small/big reads/writes, sequential/random access and so on), and expected results (i.e. what load should be preferred and what performance is expected and would be good to have). With results of the previous survey this should create a clear picture of the usage cases for any filesystem.

Feel free to drop in comments your ideas and experience with different systems.

Since those, who origianlly really wanted to participate in distributed system discussion, keep silence, I will postpone new redundancy code implementation (I started it in userspece already, so maybe I will return to it quite soon).
It also looks like vast majority of the systems of such scale use the simplest mirroring.

For the new release of the distributed storage I will definitely implement ability to store dirty/clean bitmap on the disk, mainly for the following purpose: suppose you have a distributed storage with multiple mirror devices, if one of them fails and you want to replace it with new one, which has new media, but higher layer (dst itself) only knows about pending updates, which were not stored on the media, so when new node 'resurrects' it will read its dirty/clean bitmap from the media, find that it is clear (or its age is not equal to age of the currently used system) and thus force DST core to copy all required blocks from different node.
This is not that complex task so I've started it right now.

Stay tuned.

/devel/fs :: Link / Comments (0)


New release of the userspace network stack.

Short changelog:

  • really fixed leak in raw netchannel reading path
  • changed timestamp setup
  • added retransmit checking timer
  • added sanity checks for addresses and ports processed in the stack - in case of packet socket they can be incorect some times (when working over loopback for example)
  • retransmit logic checks - still requires bits of work, it is not 100% correct
This rlease contains number of really useful fixes, but retransmit logic is not yet correct. Since unetstack uses very aggressive (non-rfc-compliant) congestion control algorithm, this can lead (and I see this in practice) to complete dataflow suspending.
I will investigate this problem further later.

/devel/networking :: Link / Comments (0)


Reading userspace network stack code.

	if (!th->ack) {
		ulog("%s: Strange packet.\n", __func__);
		goto out;
	}
Very interesting, what did I mean?

/devel/networking :: Link / Comments (0)


We won! Russia - England 2:1.

Although I did not belive that - I checked radio at clibming zone after first half of the match - 0:1, so I lost the hope, but we managed to win.

Of course I'm very glad!

/other :: Link / Comments (0)


Wed, 17 Oct 2007

Climbing evening.

It was quite for a while when I last time had a climbing training, so it as definitely not my best shape, but nevertheless, it was quite good training. Since Grange, also known as lazy-boom-slacker, sits and emulates heavy load by job tasks, I climbed alone. With rubbed foot, quickly tired arms, eventually rubbed fingers and aching body it was not that easy, but I like that. After couple of hours I stopped to rape the body and moved home. I managed to only complete several traverses and number of boulderings and new starts.
That was good time.

/life :: Link / Comments (1)


Ole-ole-ole-ole, Russia come on!

Russia-UK, football today!
I'm going to climbing though, but will try to watch it over the radio there.

Lazy days - did not perform anything interesting at all - some cleanups of the userspace network stack, found that it has horrible retransmit bugs, which likely will not be fixed easily (if will be at all). I hate such state, when I have a time, but do not want to do anything... Real crap.

/life :: Link / Comments (0)


Tue, 16 Oct 2007

Userspace network stack.

I've released new version of the userspace network stack, which contains a memory leak fix by Salvatore Del Popolo (delpopolo_dit.unitn.it).
Enjoy!

/devel/networking :: Link / Comments (0)


Sun, 14 Oct 2007

Scaling inodes.

I have in mind an interesting design idea of the inode management structure in my (so far theoretical future) filesystem - so called scaling inodes. Basic idea is to have inodes of the several standard sizes like 256, 512, 1k, 2k and 4k, which will contain meta-information about data object and its data embedded at the end of the structure. Such design reduces number of seeks to get data from disk, reduce fragmentation, since bigger inodes will be allocated at the new locations as a new object, allows to reuse old one for new inode, simplifies copy-on-write semantics, greatly reduce directory reading speed (for limited degree though - there is number of factors - for example my current mainline 2.6 git tree contains about 1500 objects, files with smaller than 4k is more than 60%, about the same statistics is for /etc: about 1600 total, where more than 1300 are smaller than 4k).
This is of course a total speculation, but idea worth further investigation...

/devel/fs :: Link / Comments (0)


HAMMER filesystem in DragonflyBSD.

Here are two links about interesting cluster filesystem started by Matt Dillon of DragonflyBSD (desing and performance expectations). By design it is supposed to handle multi-machine installations of the filesystem, online backup, recovery of the failed nodes (because of full replication of the filesystem on different nodes) and eventually multi-master setup (expected to be completed in a year). So far number of design documents were published, I only followed couple of them. Unfortunately they did not contain enough details for deep analysis of the idea, so right now I can not say, how it will look like (except rough words of the underlying structure), but it will be possible to use this filesystem as a usual one, so it guarantees data integrity (compare this with filesystems created for search engines (aka producer-consumer data rings) like googlefs or hadoop filesystem).
I believe this is a good step further for BSD systems to invent new filesystem.
So far the only disadvantage, based on my limited knowledge of design principles, is full filesystem replication (i.e. mirroring mode), which in case of failure of some of the nodes, will require more complex recovery process than block-based replication, but contrary it allows easier snapshot support.
I will try to follow its development a bit more closer to get more knowledge about it, especially if I will work on own distributed filesystem.

/devel/fs :: Link / Comments (0)


Meanwhile in Moscow ...

... there is a snowfall, a 100% humidity, an angry wind and killing cold.
And me of course... sitting and slacking at the office.
I would drink a bit and ate ravioli, I did not eat them for so long already, but can not cook at home obviously. I tried to cook when I was at Marina's birthday - but made a too salty meat - I'm forgetting this bits already, so need to finish faster and start cooking again...
Anyway I will not (at least in the nearest future), so just sending you 'Hello' from this bloody weather, Moscow, Russia (somewhere between Europe and west coast of USA).

/other :: Link / Comments (0)


I live in a great country.

Several years ago one of the former ministers of the tv/radio broadcasting said following phrase (after big license price increase):

We get money not for the paper, but for nature resource usage during transmission of the electro-magnetic waves over the air.
I talked with person, who make some business in that area, he has quite big notebook of such records.
There will be elections in Russia quite soon...

/other :: Link / Comments (0)


I've returned.

Back to drawing board - thinking about distributed storage, new redundancy codes and maybe distributed filesystem...
Enough for now, this is short term tasks.

/life :: Link / Comments (0)


Thu, 11 Oct 2007

GoogleFS.

It sucks for the wast majority of users, here is number of reasons why I think so:

  • designed only to efficiently handle sequential huge read/append
  • it does not handle small files efficiently. Small is smaller than 64 MB
  • it only reliably handles appends of chunk wich does not cross 64 MB boundary to the file
  • dataflow is unreliable, but since it is mostly append, then old data will be returned instead of corrupted one
  • it does not scale to random access at all
  • it forces client to cache metadata
  • reading bottlenecks (like starting a single application on huge number of clients) are handled by ostrich's method - increase number of nodes where block lives to the number, when bottleneck is not very visible
  • there are no words 'low latency' in its design
  • no (even remotely) POSIX API, clients has to be heavily dependant on vendor's helper libs, although following operations are supported: create, delete, open, close, read and write to file
  • single master
  • googlefs is built on top of usual files in some Linux filesystem
Essentially this can only be used by google itself for its own purposes - I do not see any advantages in this technology compared to open Apache Hadoop project and its filesystem (the latter has its own disadvantages too).
This systems just can not be used by third-level applications, and those specially crafted ones can operate over consumer-producer data rings, not over persistent data.

/devel/fs :: Link / Comments (0)


New release of the distributed storage subsystem.

This is a maintenance release, which includes small number of bug fixes, which sit in the tree, but were not yet released.
Check the homepage to get the latest release.

/devel/dst :: Link / Comments (2)


HIFN driver has been imported into cryptodev-2.6 tree.

Driver will be added either into 2.6.24 or 2.6.25 tree eventually.
I've also added a patch to export DES key extension to catch weak keys and use it in HIFN driver. Now it is ready.
Interested reader can find updated HIFN driver in archive.

/devel/acrypto/hifn :: Link / Comments (0)


Wed, 10 Oct 2007

GoogleFS aka GFS.

Downloaded a research publication about GFS - let see what they have, and why do they want people with knowledge of distributed systems to rewrite it...

/devel/other :: Link / Comments (0)


Misalignment access handling has been implemented in HIFN driver. New version has been released.

Although it contains some obscure for the reader comments, like

	/*
	 * Temporary of course...
	 * Kick author if you will catch this one.
	 */
	printk(KERN_ERR "%s: dlen: %u, nbytes: %u,"
		"slen: %u, offset: %u.\n",
		__func__, dlen, nbytes, slen, offset);
	printk(KERN_ERR "%s: please contact author to fix this "
		"issue, generally you should not catch "
		"this path under any condition but who "
		"knows how did you use crypto code.\n"
		"Thank you.\n",	__func__);
	BUG();
This should not happen in real life, but in theory it is probably possible condition, so I added a bug and above prints.
It uses quite tricky copyings over the source/destination buffers in case of misaligned access, but driver passed all tests in tcrypt.c, except DES weak test (hardware can not distinguish weak keys).

I've released new version, which you can find in archive.
Groovy!

/devel/acrypto/hifn :: Link / Comments (0)


Meanwhile at appartment development side.

Yesterday I moved to development shop, got screws, colour and various other small bits, so I worked til the night (finished about 2 a.m.), but completed main part of arc at the room entrance. It is not finished yet - it requires clean polishing, small plaster fixage where needed and painting, but overall it is ready and looks cool. I will implement another arc instead of checkroom's door later, since such a small room does not need a door at all. This is the last piece at appartment development side I will work on this week - tomorrow I will move to sister's birthday and will not be at home couple of days.

/devel/flat :: Link / Comments (0)


Tue, 09 Oct 2007

I think HIFN driver is the most complex one I ever wrote.

I already implemented at least two different designs of misaligned access handling code - both were not good enough (basically did not work in some or other cases), now I'm working on third one - this should be damn good, but not very fast - no one promised fast access, when you put each byte of the word to be encrypted in different page. It will be limited enough to say, that if there is more than predefined number of pages, then caller sucks. And it will be complex. It is complex already.
Actually HIFN driver can handle most of the misaligned access just fine: if data is word aligned (some times two words aligned) and size of the data provided is more than two bytes (one word, or two words in some cases), in other cases copy is needed.
Neither IPsec, nor dm-crypt provides such packets. Protocol just does not allow that, but some obscure crypto users can have such stupid idea in mind, so we have to copy...
Crap.
But it will be ready this week!
And right now I will move to development shop - I've done enough at paid work and for HIFN driver for today, so let's change the business...

/devel/acrypto/hifn :: Link / Comments (0)


Mon, 08 Oct 2007

Climbing evening.

Although it is quite hard to call it 'climbing', since I rubbed my foot very seriously, so I an not even move stright (live anti-advertisement of the climbing zone), not talking about climbing in several sizes smaller shoes. So I hurt myself and my foot couple of hours and moved home - I will have a delay in climbing training until next week, hopefully foot will be ok then.

/life :: Link / Comments (0)


Async IPsec support in Linux kernel.

Herbert Xu (current crypto maintainer) started preparatory work to move IPsec to become asynchoronous - so far he moved common code around into generic helpers, added skb shared structure to hold XFRM (linux IPsec stack) per-packet state (header and sequence number) during IP processing, which includes time while packet is being encrypted, and also removed (or replaced) potentially unused/redundant elements during XFRM processing.
As is, it does not add any async possibilities, since the main output loop, where crypto (ESP or AH) processing function is called, was not broken into completion parts, but with above changes it will be much simpler.
For example acrypto async IPsec patch, which was introduced for 2.6.15 kernel tree, did not have common code for IPv6 and IPv4 processing code, so it only supported IPv4 ESP mode.
I'm pretty sure Herbert will implement async processing as a callback invocation model (although I saw implementation of a crypto function as a busy waiting for completion), when skb will contain all information needed for further packet handling (mainly it is either routing information or device output function, obtained from routing info), and that callback will be provided to encryption device. One big problem with such apporach can be that crypto hardware device (actually the only one real async hardware supported by existing Linux crypto stack is HIFN 795x adapters) can call provided callbacks from hardirq context instead of softirq (or process context) like in current network stack.
We will see, how this will be implemented. I really wish Herbert success and hope it will find its way into 2.6.24 tree (ugh, I need to complete misaligned handling in HIFN driver, but new driver policy is a bit less restrictive in matter of time limits after merge window is opened by Linus, so I will try really hard to kill my laziness, so that driver will be ready this week).

/devel/networking :: Link / Comments (0)


Sun, 07 Oct 2007

Club "Point" aka "Tochka".

I visited this club to listen punk group called "Blond Ksyu" ("Blondinka Ksyu") to recall (actually get new knowledge) what punk muysic is. It is simple band with quite young woman as a lead, a guitarist, a bass and a drummer. Quite simple notes, but band is attractive by its vocalist, she likely acts like Avril Lavigne, although a bit older.
I'm pretty sure, I do not like punk music and would not move to listen "Blond Ksyu" band again, but that was quite fun evening.

/life :: Link / Comments (0)


You can still choose

If you don't swim to win you'll never lose
Excellent theme from upcoming OpenBSD release.
That is how I see things should be done - love the process and do it just because of the process, but not end goal, only in that case you'll always win.

/devel/other :: Link / Comments (0)


The whole country celebrates ...

... the birthday of Vladimir Vladimirovich Putin, our president.
Or none celebrates and waits when he will move away, it depends.

/other :: Link / Comments (0)


Fri, 05 Oct 2007

Misaligned access in crypto stack and HIFN driver.

It is possible, that provided to crypto processor data is misaligned or split between pages so that each page contains not block size aligned chunk. In this case data has to be relocated.
Linux crypto stack provides data to underlying crypto processing driver in scatterlists, which is essentially array of pages and appropriate size/offset information. Thus it is possible that each page in scatterlist has to be relocated. I will create a 'cache' of preallocated pages, which will be used as a temporary storage for crypto data, if cache is empty, crypto processing function will allocate new pages in its context, and then copy data from misaligned page into given page and process it. In interrupt, when the fact, that all pages have been processed, is confirmed, new data will be copied back to requested destination buffer (if needed, for example if source buffer is misaligned, but destination is ok, second copy is not required), this can also be postponed to process context via workqueue, but that will introduce additional latency, which is quite noticeble (as I tested in acrypto crypto stack).

/devel/acrypto/hifn :: Link / Comments (0)


Uchuu.

I think it is a good idea to wake up and look at it. That is what I will draw on my wall.

Uchuu

/devel/flat :: Link / Comments (3)


Thu, 04 Oct 2007

Linux crypto stack issues.

It was not designed for async operations at all - all helpers, I described previously, do not work in async context, i.e. when several chunks of the same request can be encrypted in parallel, since blkcipher_walk* interfaces provides the same destination buffer for all parts of the same request.
So, to handle misaligned data I have to develop own helpers for HIFN driver.

/devel/acrypto/hifn :: Link / Comments (0)


Wed, 03 Oct 2007

Instead of climbing evening.

I moved to climbing zone, started to warm up, saw number of interesting people and then had known Grange's recent happenings - they really deserved to be celebrated, so I moved away (although I missed climbing lotto, but with my luck it was pretty useless) and drunk with him til middle of the night. That was really fun.
I'm pretty sure things will move definitely not worse that right now, so I expect some interesting happenings quite soon.

/life :: Link / Comments (0)


HIFN driver addons and crypto stack issues.

I decided to rewrite crypto session setup in HIFN driver to allow multiple scatter-gather lists (which are now transformed into pages) in single crypto session (even though with multiple descriptor slots being used). Main goal for this step is to allow encryption of buffers, which are split into number of pages, where each chunk is not block size aligned (for example one page contains 2 bytes and another one 14 for single datablock of 16 bytes). Second revision of driver does not support such blocks yet.
To simplify this I started to use generic linux crypto helpers blkcipher_walk_* from block ciphers. But they do not allow to be called in interrupt context, although all allocations are performed like they happen in atomic context.
Rougly code looks like this (error processing ommitted):

struct blkcipher_walk walk;

blkcipher_walk_init(&walk, dst, src, nbytes);
blkcipher_walk_virt(desc, &walk);

while ((nbytes = walk.nbytes)) {
	u8 *iv = encrypt();
	memcpy(walk.iv, iv, ivsize);
	nbytes &= blocksize - 1;
	err = blkcipher_walk_done(desc, &walk, nbytes);
}
Above struct blkcipher_walk contains source and destination page addresses, appropriate sizes and offsets. Variable desc is a struct blkcipher_desc pointer, which contains original parameters of crypto request.

/devel/acrypto/hifn :: Link / Comments (0)


Tue, 02 Oct 2007

HIFN 795x driver for Linux kernel 2.6 is ready.

I've fixed CBC processing bug, added software queue (which I consider a serious ugly hack in existing kernel async crypto stack implementation), added support for DES and 3DES algorithms (all are limited to blocksized chunks), fixed indent and watchdog setup.
Everything I wanted to complete is ready in this driver, so I even updated TODO list a bit.
One can find patch against the latest git in archive.

/devel/acrypto/hifn :: Link / Comments (0)


Mon, 01 Oct 2007

Climbing evening.

That was really hard one - I tried several new complex rtaces, some of them I managed to complete on-sight, other I failed, since was tired as hell. Found again, that on vertical wall I can finish really complex traces without too much efforts, but no negative slope even simpler ones require much more efforts. Physical endurance is a main problem.
Anyway, that was really great time at climbing zone.

/life :: Link / Comments (0)


HIFN driver is ready.

There is number of nitpicks though, but overall it works. Slightly tested with tcrypt testing module, here is related output. Chunks with 'fail' label requires additional work - although decryption works ok in driver (reverse hifn_test() operations for example), tcrypt decryption tests (with 'chunking' which I need to check what is it in tcrypt module) fail. I will investigate this further tomorrow. Patch against the latest 2.6 git tree is available in archive.

[  628.851890] testing ecb(aes) encryption
[  628.857498] hifn_cra_init: tfm: ffff81003a8739c8, dev: hifn0 [ffff81003dd7c2c8].
[  628.865046] test 1 (128 bit key):
[  628.868505] hifn_setkey: tfm: ffff81003a8739c8, ctx: ffff81003a873a08, dev: hifn0 [ffff81003dd7c2c8], len: 16.
[  628.878679] hifn_setup_crypto: req: ffff81003a873f20, tfm: ffff81003a8739c8, ctx: ffff81003a873a08, keylen: 16.
[  628.888943] hifn_setup_session: start
[  628.892652] cmd: i=1, u=0, k=1
[  628.895752] src: i=1, u=1, k=0
[  628.898852] dst: i=1, u=1, k=0
[  628.901952] res: i=1, u=0, k=1
[  628.905054] hifn0: iv: 0000000000000000 [0], key: ffff81003a873a08 [16], mode: 0, op: 1, type: 0.
[  628.913996] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [2], i: 1.2.2.1, u: 2.2.2.2.
[  628.923104] hifn0: ring cleanup 1: i: 2.2.2.2, u: 1.2.2.1, k: 1.0.0.1.
[  628.929676] hifn0: ring cleanup 2: i: 2.2.2.2, u: 0.2.2.0, k: 2.0.0.2.
[  628.937082] 69c4e0d86a7b0430d8cdb78070b4c55a
[  628.942302] pass
[  628.944280] test 2 (192 bit key):
[  628.947732] hifn_setkey: tfm: ffff81003a8739c8, ctx: ffff81003a873a08, dev: hifn0 [ffff81003dd7c2c8], len: 24.
[  628.957917] hifn_setup_crypto: req: ffff81003a873f20, tfm: ffff81003a8739c8, ctx: ffff81003a873a08, keylen: 24.
[  628.968159] hifn_setup_session: start
[  628.971870] cmd: i=2, u=0, k=2
[  628.974969] src: i=2, u=2, k=0
[  628.978069] dst: i=2, u=2, k=0
[  628.981169] res: i=2, u=0, k=2
[  628.984271] hifn0: iv: 0000000000000000 [0], key: ffff81003a873a08 [24], mode: 0, op: 1, type: 1.
[  628.993214] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [3], i: 1.3.3.1, u: 3.3.3.3.
[  629.002321] hifn0: ring cleanup 1: i: 3.3.3.3, u: 1.3.3.1, k: 2.0.0.2.
[  629.008894] hifn0: ring cleanup 2: i: 3.3.3.3, u: 0.3.3.0, k: 3.0.0.3.
[  629.016167] dda97ca4864cdfe06eaf70a0ec0d7191
[  629.021398] pass
[  629.023376] test 3 (256 bit key):
[  629.026827] hifn_setkey: tfm: ffff81003a8739c8, ctx: ffff81003a873a08, dev: hifn0 [ffff81003dd7c2c8], len: 32.
[  629.037001] hifn_setup_crypto: req: ffff81003a873f20, tfm: ffff81003a8739c8, ctx: ffff81003a873a08, keylen: 32.
[  629.047247] hifn_setup_session: start
[  629.050956] cmd: i=3, u=0, k=3
[  629.054055] src: i=3, u=3, k=0
[  629.057158] dst: i=3, u=3, k=0
[  629.060258] res: i=3, u=0, k=3
[  629.063359] hifn0: iv: 0000000000000000 [0], key: ffff81003a873a08 [32], mode: 0, op: 1, type: 2.
[  629.072302] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [4], i: 1.4.4.1, u: 4.4.4.4.
[  629.081408] hifn0: ring cleanup 1: i: 4.4.4.4, u: 1.4.4.1, k: 3.0.0.3.
[  629.087980] hifn0: ring cleanup 2: i: 4.4.4.4, u: 0.4.4.0, k: 4.0.0.4.
[  629.095353] 8ea2b7ca516745bfeafc49904b496089
[  629.100581] pass
[  629.102558] 
[  629.102558] testing ecb(aes) encryption across pages (chunking)
[  629.110232] 
[  629.110233] testing ecb(aes) decryption
[  629.115822] hifn_cra_init: tfm: ffff81003a873f20, dev: hifn0 [ffff81003dd7c2c8].
[  629.123369] test 1 (128 bit key):
[  629.126833] hifn_setkey: tfm: ffff81003a873f20, ctx: ffff81003a873f60, dev: hifn0 [ffff81003dd7c2c8], len: 16.
[  629.136996] hifn_setup_crypto: req: ffff81003a8739c8, tfm: ffff81003a873f20, ctx: ffff81003a873f60, keylen: 16.
[  629.147271] hifn_setup_session: start
[  629.150976] cmd: i=4, u=0, k=4
[  629.154076] src: i=4, u=4, k=0
[  629.157176] dst: i=4, u=4, k=0
[  629.160277] res: i=4, u=0, k=4
[  629.163379] hifn0: iv: 0000000000000000 [0], key: ffff81003a873f60 [16], mode: 0, op: 0, type: 0.
[  629.172322] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [5], i: 1.5.5.1, u: 5.5.5.5.
[  629.181428] hifn0: ring cleanup 1: i: 5.5.5.5, u: 1.5.5.1, k: 4.0.0.4.
[  629.188000] hifn0: ring cleanup 2: i: 5.5.5.5, u: 0.5.5.0, k: 5.0.0.5.
[  629.195404] 00112233445566778899aabbccddeeff
[  629.200642] pass
[  629.207471] test 2 (192 bit key):
[  629.210924] hifn_setkey: tfm: ffff81003a873f20, ctx: ffff81003a873f60, dev: hifn0 [ffff81003dd7c2c8], len: 24.
[  629.221079] hifn_setup_crypto: req: ffff81003a8739c8, tfm: ffff81003a873f20, ctx: ffff81003a873f60, keylen: 24.
[  629.231323] hifn_setup_session: start
[  629.235034] cmd: i=5, u=0, k=5
[  629.238135] src: i=5, u=5, k=0
[  629.241235] dst: i=5, u=5, k=0
[  629.244336] res: i=5, u=0, k=5
[  629.247437] hifn0: iv: 0000000000000000 [0], key: ffff81003a873f60 [24], mode: 0, op: 0, type: 1.
[  629.256379] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [6], i: 1.6.6.1, u: 6.6.6.6.
[  629.265486] hifn0: ring cleanup 1: i: 6.6.6.6, u: 1.6.6.1, k: 5.0.0.5.
[  629.272059] hifn0: ring cleanup 2: i: 6.6.6.6, u: 0.6.6.0, k: 6.0.0.6.
[  629.279309] 00112233445566778899aabbccddeeff
[  629.284545] pass
[  629.286543] test 3 (256 bit key):
[  629.290001] hifn_setkey: tfm: ffff81003a873f20, ctx: ffff81003a873f60, dev: hifn0 [ffff81003dd7c2c8], len: 32.
[  629.300171] hifn_setup_crypto: req: ffff81003a8739c8, tfm: ffff81003a873f20, ctx: ffff81003a873f60, keylen: 32.
[  629.310424] hifn_setup_session: start
[  629.314131] cmd: i=6, u=0, k=6
[  629.317231] src: i=6, u=6, k=0
[  629.320330] dst: i=6, u=6, k=0
[  629.323430] res: i=6, u=0, k=6
[  629.326532] hifn0: iv: 0000000000000000 [0], key: ffff81003a873f60 [32], mode: 0, op: 0, type: 2.
[  629.335473] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [7], i: 1.7.7.1, u: 7.7.7.7.
[  629.344573] hifn0: ring cleanup 1: i: 7.7.7.7, u: 1.7.7.1, k: 6.0.0.6.
[  629.351146] hifn0: ring cleanup 2: i: 7.7.7.7, u: 0.7.7.0, k: 7.0.0.7.
[  629.358529] 00112233445566778899aabbccddeeff
[  629.363762] pass
[  629.365742] 
[  629.365743] testing ecb(aes) decryption across pages (chunking)
[  629.373421] 
[  629.373422] testing cbc(aes) encryption
[  629.379011] hifn_cra_init: tfm: ffff81003a8739c8, dev: hifn0 [ffff81003dd7c2c8].
[  629.386560] test 1 (128 bit key):
[  629.390012] hifn_setkey: tfm: ffff81003a8739c8, ctx: ffff81003a873a08, dev: hifn0 [ffff81003dd7c2c8], len: 16.
[  629.400170] hifn_setup_crypto: req: ffff81003a873f20, tfm: ffff81003a8739c8, ctx: ffff81003a873a08, keylen: 16.
[  629.410430] hifn_setup_session: start
[  629.414143] cmd: i=7, u=0, k=7
[  629.417241] src: i=7, u=7, k=0
[  629.420342] dst: i=7, u=7, k=0
[  629.423442] res: i=7, u=0, k=7
[  629.426543] hifn0: iv: 0000000000000000 [0], key: ffff81003a873a08 [16], mode: 1, op: 1, type: 0.
[  629.435484] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [8], i: 1.8.8.1, u: 8.8.8.8.
[  629.444585] hifn0: ring cleanup 1: i: 8.8.8.8, u: 1.8.8.1, k: 7.0.0.7.
[  629.451156] hifn0: ring cleanup 2: i: 8.8.8.8, u: 0.8.8.0, k: 8.0.0.8.
[  629.458574] 3b629d77f45eff9817c5849f9a0aba71
[  629.463817] fail
[  629.465795] test 2 (128 bit key):
[  629.469246] hifn_setkey: tfm: ffff81003a8739c8, ctx: ffff81003a873a08, dev: hifn0 [ffff81003dd7c2c8], len: 16.
[  629.479405] hifn_setup_crypto: req: ffff81003a873f20, tfm: ffff81003a8739c8, ctx: ffff81003a873a08, keylen: 16.
[  629.489649] hifn_setup_session: start
[  629.493359] cmd: i=8, u=0, k=8
[  629.496458] src: i=8, u=8, k=0
[  629.499558] dst: i=8, u=8, k=0
[  629.502661] res: i=8, u=0, k=8
[  629.505760] hifn0: iv: 0000000000000000 [0], key: ffff81003a873a08 [16], mode: 1, op: 1, type: 0.
[  629.514704] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [9], i: 1.9.9.1, u: 9.9.9.9.
[  629.523810] hifn0: ring cleanup 1: i: 9.9.9.9, u: 1.9.9.1, k: 8.0.0.8.
[  629.530383] hifn0: ring cleanup 2: i: 9.9.9.9, u: 0.9.9.0, k: 9.0.0.9.
[  629.537784] bd0cb8b2220fab0cf10079d1b48ffde82b8bae025030fb5245010d5b7f1fc8c4
[  629.546619] fail
[  629.548600] 
[  629.548601] testing cbc(aes) encryption across pages (chunking)
[  629.556264] 
[  629.556265] testing cbc(aes) decryption
[  629.561849] hifn_cra_init: tfm: ffff81003a873f20, dev: hifn0 [ffff81003dd7c2c8].
[  629.569412] test 1 (128 bit key):
[  629.572868] hifn_setkey: tfm: ffff81003a873f20, ctx: ffff81003a873f60, dev: hifn0 [ffff81003dd7c2c8], len: 16.
[  629.583026] hifn_setup_crypto: req: ffff81003a8739c8, tfm: ffff81003a873f20, ctx: ffff81003a873f60, keylen: 16.
[  629.593270] hifn_setup_session: start
[  629.596981] cmd: i=9, u=0, k=9
[  629.600081] src: i=9, u=9, k=0
[  629.603181] dst: i=9, u=9, k=0
[  629.606283] res: i=9, u=0, k=9
[  629.609384] hifn0: iv: 0000000000000000 [0], key: ffff81003a873f60 [16], mode: 1, op: 0, type: 0.
[  629.618327] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [10], i: 1.10.10.1, u: 10.10.10.10.
[  629.628039] hifn0: ring cleanup 1: i: 10.10.10.10, u: 1.10.10.1, k: 9.0.0.9.
[  629.635131] hifn0: ring cleanup 2: i: 10.10.10.10, u: 0.10.10.0, k: 10.0.0.10.
[  629.643103] 8d95a3b9e1823aeaff452dc6b285c73c
[  629.648346] fail
[  629.650323] test 2 (128 bit key):
[  629.653776] hifn_setkey: tfm: ffff81003a873f20, ctx: ffff81003a873f60, dev: hifn0 [ffff81003dd7c2c8], len: 16.
[  629.663960] hifn_setup_crypto: req: ffff81003a8739c8, tfm: ffff81003a873f20, ctx: ffff81003a873f60, keylen: 16.
[  629.674202] hifn_setup_session: start
[  629.677914] cmd: i=10, u=0, k=10
[  629.681186] src: i=10, u=10, k=0
[  629.684460] dst: i=10, u=10, k=0
[  629.687734] res: i=10, u=0, k=10
[  629.691008] hifn0: iv: 0000000000000000 [0], key: ffff81003a873f60 [16], mode: 1, op: 0, type: 0.
[  629.699951] hifn0: 1 dmacsr: 8898888c, dmareg: 22322023, res: 00100000 [11], i: 1.11.11.1, u: 11.11.11.11.
[  629.709663] hifn0: ring cleanup 1: i: 11.11.11.11, u: 1.11.11.1, k: 10.0.0.10.
[  629.716946] hifn0: ring cleanup 2: i: 11.11.11.11, u: 0.11.11.0, k: 11.0.0.11.
[  629.725063] 23a975b74c30c4d6ce38d6dcf0f57be6101112131415161718191a1b1c1d1e1f
[  629.733895] fail

/devel/acrypto/hifn :: Link / Comments (0)


My stuff has arrived.

Including HIFN 7955 adapter. Wait for the driver very soon.
Among other things my old server, a lot of books, football, tennis racket and sport shoes arrived. There is almost no room for other things in my appartment anymore, so I need to quickly fix things.
The only missing bit is my bicycle.

/life :: Link / Comments (0)