My cellar only contains bottle of Martini now - yesterday the last bottle
of black vodka was completed.
And I was told (even suggested/recommended/asked) to drink it not with usual
company, but with completely different people.
Doubts rarely visit me, but right now they have came.
Thinking...
After drunk a bit of vodka with Grange
we discussed niche of my distributed storage.
Briefly saying (what it will be), this storage is a block device, which connects several disks
over the net. It will allow to implement different redundancy mechanisms to have self-healing
abilities in case of died underlying storages.
That is it.
Now what are the competitors.
First and the simplest one is network block device. It allows to remotely
connect a storage, but one device can handle only one remote peer, so to implement
a storage one needs to group peers connected via network block deivices using for example LVM.
Device mapper on top of network block devices allows to use existing redundancy mechanisms like
software RAID. This looks simple and good, but there are some problems: one device per remote node,
essentially this does not allow to scale to really big storages, since number of devices is limited;
static configuration (local node can not determine size of the remote storage, that data is provided
before startup during configuration and can not be changed after device was started); needs special
userspace process on behalf of which all transactions are performed; network block device uses excessive
acknowledge protocol, which creates a huge overhead in case of flood of requests; global locking.
High-end SCSI storage with hardware redundancy. This one is limited by the price - two unit server
can only contain about 14 disks, its price is already too high, if servers can be connected via
special interconnect, it increases its price even more, so for petabyte (just an example) storage
price will be about the same as number of bytes. Hardware RAIDs are complex to work with,
especially when disks start dying. There are no hardware RAIDs with more than RAID6 redundancy,
which is just two dead disks per array. RAID6 price is too high. More common RAID5 controllers
only allow to have one dead disks per array.
High-end storages with different interconnects (like iSCSI), which actually can be cheaper,
but still requires LVM/Device mapper/whatever to build a single storage on top of them. Does not scale
to really huge volumes because of the price just like usual storage systems.
So, essentially distributed storage just allows to have huge volume storage using
existing network infrastructure without any special expensive components as long
as with additional high-performance low-latency interconnects, but providing
set of features existing systems do not have or its setup is not simple and high perfomance.
Approach I use is quite different from the most known and widely used Lustre
storage by Cluster Filesystem Inc (where I was asked to work some time ago :) (there
are number of very similar designs over there, which I was pointed privately couple of times).
One of the differencies is the fact, that distributed storage does not require
any special API to work with it, it looks just like usual huge storage connected locally
to the node.
I was told, that I will not be able to complete such task, that there is no need in it,
that there are better varians and so on, but frankly I do not care - I like the process,
so there are no problems for me.
That was a good training - I completed several new traces, set of old ones,
eventually tried couple of a bit more complex than usual, but failed. One
of the traces is over passive yellow holds in the central sector - I do not
understand who gave 7a category to it, since it is much more complex than
another 7a I tried before, so it was not a surprise that I failed. Then I
tried new 6b trace over blue holds in the central sector but with part of the
trace over horizontal negative slope - I generally perform quite bad
on such parts due to not enough power endurance and it was the very end of the
training, so I failed couple of times on that part, but completed without any
problems on the vertical wall part of the trace. I'm sure I would complte that trace,
if I started it early without previous complex runs.
Anyway, that was a good time there.
And thus will not be used.
It has serious limitation on table format - each line in table
must form contiguous region, and it is not allowed to change
limits (start and size) in constructor, which effectively
breaks dynamic configuration of the storage.
So, no device mapper - I will use raw block device and create
proper binary protocol to configure device from userspace instead
of string based one for device mapper.
So far I only added couple of bits to dynamically add/remove storages and
encoding algorithms, which actually can be revisited later if needed.
Each algorithm is an entity which will send/receive (and read/write via usual block IO path if
node is local, which is supported too) data from nodes, it will determine if several nodes needs
to be read/written (for example to recover if array is degraded or to update
checksum) and perform all actions.
Initial trivial algorithm will use round-robin mechanism to write sequential data to several nodes.
First codepath is used to flush all in-flight BIOs on demand and is not a common path.
Second one is a common way. It is performed in process context (either via ->readpage()
and do_generic_mapping_read() or via pdflush during writing)
and thus can sleep, there is no serialization lock anywhere in the path,
but block layer ensures that only one make_request_fn() callback runs at a time for given queue.
In case when make_request_fn() is already running, new BIO will be queued.
This allows not to implement any kind of locking in target's mapping function,
but there is an issue with dispatching work for different remote storages.
Let's assume we have two remote devices, one of which is connected via very slow link
(I seriously doubt one can build a distributed storage on top of slow links,
but it is an example - node can be overloaded or down and effect will be the same),
so sending/receiving data to/from that device will take a lot of time,
if processing function will sleep, this essentially means that target's map callback
should not be allowed to sleep and thus will not perform any processing itself.
Having async AIO or at least in-kernel event dispatching would be great here,
but I closed kevent
which fits perfectly here, so I will implement some ->poll() based
dispatching/awakening mechanism to process non-blocking requests on behalf of special thread.
So, target mapping callback will issue a non-blocking request without queueing (checking ->poll()
first is a good idea), if it is not fully completed, the rest will be done in dispatching state machine
in dedicated thread.
It looks like companies do not want me to work
with them partially/contract based, so things will be exactly like they were before -
free time after my pain^Wpaid work will be devoted to hacking.
Right now I have not that much time unfortunately, but I will definitely
make some progress.
Lazy-boom-slacker Grange was not able
to get his ass out of traffic jam, so I climbed alone.
As a result no new traces, but a number of interesting boulderings, some old
traverses. As next level result I would mention damaged shoulder
and leg after wild jumpings, and some interesting new starts (with couple
of heavy falls from 3-4 meters high to the back).
Since I did not climb high, there were plenty of time to talk with another climbers
on the floor...
It was fun training.
I have a qustion about the way data should be organized in distributed storage.
There are two possible ways:
sequential data is stored lineary - first storage is filled until it is full, then
next one and so on. Of course error recovery codes usage requires another
storages to be used, but amount of CRC data is usually noticebly smaller than
amount of actual data.
sequential data is distributed among several storages according to some algorithm (like round-robin).
The former allows to increase random access speed, the latter - sequential access.
Getting into account that even 1Gbit network has about two times faster bulk
transfer speed than modern disks, priority of this issue increases.
Likely round-ribin writing is a good candidate for first implementation.
Pretty trivial: I've created a device mapper target which sends
data pages over the net. It can be configured
to work with any kind of network media (including various sockets).
Well, it is similar to network block device, but does not require
special process. BIOs remapped in distributed
storage target operates with pages, so with appropriate jumbo frame
setup there should be no copies at all (code uses ->sendpage()).
Right now code only sends data and does not receive it at all,
so appropriate protocol should be designed.
There is also open question about how BIOs/pages should be remapped.
Device mapper supports three types (which are not documented actually):
DM_MAPIO_SUBMITTED - this means that target code will process BIO
by itself - either put it and thus call end_io()
callbacks, or submit to another layers.
DM_MAPIO_REMAPPED - this must be returned when target
remapped all fields in place and will not work with given block IO request anymore,
so generic code can call generic_make_request(), process
that request further and eventually put BIO.
DM_MAPIO_REQUEUE - supposed to be used when BIO should be requeued
at end_io() time. It is used only in multiple path target.
There will be a map of the pages for local and remote nodes, according to appropriate
redundancy self-healing algorithm,
so local BIOs will be moved directly to the next layer via generic_make_request(),
and BIOs/pages for remote nodes will be processed accordingly. In both cases
DM_MAPIO_SUBMITTED path will be used (like now).
There might be a problem, when the same BIO will contain pages for local and remote node,
in that case BIO vector will be changed to only contain pages for local node.
Another problem I see is to how to dispatch reading and writing requests withouth locking the channel,
so that during single write requests there would be possible to read another ones.
I have been called from UK embassy about 30 minutes ago, although I was not asked,
but instead accounts department had a conversation, which did not know about my plans
and did not answered embassy questions besides that I do work there.
I want to think, that it will not be a show-stopper.
Nick Piggin from SuSE announced several days ago
his rework of the buffer_head interface, which is used as a layer between block layer
and filesystems. Its main goal is to obtain a memory region which directly
represents content of the storage if read, or memory region,
which will be written to the given position
on storage. Buffer heads have number of disadvantages,
mainly high overhead and possibility to deadlock writeout
(i.e. to write a page to disk to free it it might require
to have additional allocation). Interface is not that good too.
Fsblocks are supposed to fix that. Although it does have set of advantages over
buffer_heads already, not everyone is happy with approach - namelt XFS guys, who want to be able
to map arbitrary blocks to storage, namely extents, so better name
was suggested - 'buffer_heads on steroids' only due to existing limits
of both buffer_heads and fsblocks. So far, only minix filesystem was
converted to fsblocks, so there are quite a lot of work yet to be done.
This discussion forced me to recall my fs
design notes, one of the main issues I wanted was/is to avoid buffer_heads
usage at all (well, it is only needed to map a page without the ugly needs
to have a buffer_head for each block which can reside in the given page),
i.e. each object on the storage is always aligned to a page size (PAGE_CACHE_SIZE
actually, which is 4k), i.e. each inode will have about 100-200 bytes of control
information and then will have the rest of the page filled with file's a
or directory entries. That would greatly speedup small file lookup and processing
as long as directory reading (including fairly trivial directory readahead
absence of which is a serious limitation of ext234 filesystems, which leadsto major directory reading performance degradation when directory contains decent amount of files/subdirs).
Such approach can be described by something which gets good parts from both
extents and delayed allocation.
So far it is a silent sleeping idea, maybe I will discuss it on KS.
One can track this
tag for information about my distributed storage development.
So, to recall how to write in C (I did not do that for about a week or two),
I'm writing initial implemnetation for multiple device stack (the same that
supports software raid and dm-crypt). It will not be the latest revision obviously,
will not support interesting configuration techniques I think about,
will not have special failure-aware encoding (like Reed-Solomon or WEAVER),
but will only be used to create a storage on top of two devices - local
block device and remote one.
Initial goal is not even to achive sequential reading speed equal to
sum of speeds of both devices - only to make it really distributed and recall
what block layer is. Last time I worked with it about 4-5 years ago,
when created asynchronous block device
to test acrypto
async crypto layer.
Getting into account that I ported dm-crypt to acrypto fairly quickly, that should be that
complex task. Expect some news tomorrow.
Switched off...
That was great training - several new traces, bunch of old interesting ones,
jumpings at the end - just bloody perfect.
Physical form seems to be in shape now, so I can start new complex traces,
and to work hardly on non-physical tasks.
Excellent mood.
My sister had a graduation at school this weekend.
That was fun days there, although a bit nervous. Marina grew to quite
effective woman, but she is still a child.
My congratulations and happy way into new life!
Today I moved to the office about 7-30 and then moved to british embassy
to try to give them documents again. I was in the visa's center in 9-30,
it is opened in 9-00, and get number 133 in line. Fortunately there were
about 25 people in front of me with the case like mine, so I only spent
about 1.5 hours and started the process. First they asked me to rewrite the form
(I only tried to give documents), I nicely smiled and said something
to operator girl, so I was allowed not to stay in the line again,
quickly rewriting the form took about 7 minutes. I do not think that
new letters are more readable, but that is a rule.
In 5 days I will know if I got visa or not, so I will order tickets and book
hotel or will stay at home.
Excellent training - started with warming traverses, then continued
with simple on-sight traces coupled and thirded together. Eventually
I moved to very interesting, although already quite old traces - there
were a lot of people, so there were no easy way to select a trace.
End of the trainig was a culmination - there is complex enough trace
on the serious negative slope in the climbing zone, which had
a set of holds, which are supposed to be used via jump between them. Dynamic
jump is not that big, but on negative slope it becomes a problem,
so I ended up fixing a rope above the place and jumped wildly screaming
(likely I said to myself, that I need to fly to given hold no matter how,
in some strange language known only to me and my alter-me). Eventually
I completed that place, which now contain an additional hold, since no one
wanted to jump there.
Excellent finish of the absolutely stupid day.
I spent more than a hour in a queue in the embassy and during that time
only two people moved forward. I had a choice: either wait there until another 70
people will finish, or take a dinner. I think choice is obvious, so tomorrow
I will move there again, this time early morning.
To select perfect CRC coding technique which will be used for distributed
storage I'm about to start designing, I've ran set of tests to determine how
CRC speed depends on size of the word used in calculation. I use simple XOR
CRC which just performs XOR operation over provided data block several times
and then shows its performance (i.e. number of operations per timeframe).
Here is CRC performance for the same blocksize and number of iterations (i.e.
number of CRC sums performed over the same block):
As expected, CRC speed was improved with increasing size of the word,
unfortunately for galois field multiplication speed increase is not enough.
Let's discuss Reed-Solomon coding, which requires galois-field multiplication of checksummed
(or raw) data words.
For 16 bit CRC sum galois filed multiplication can be performed using 2^16 table
(2 lookups per multiplication or division operation) lookup, but for 32 bit
CRC galois field multiplication in GF(2^32) can not be performed using table,
since it will be too big. To perform galois field multiplication essentially one needs
to solve a system of equations, number of equations is equal to size of the
values being multiplied, i.e. 32 in our case. Even the fastest method of solving
system of equations will be much much slower than 16 bit CRC plus its 2 lookups.
Even using Cauchy matrixes for Reed-Solomon encoding with galois multiplication instead of Vandermonde one
results in much slower code than table lookup.
So, for the maximum performance we either need to limit Reed-Solomon system to 16 bit CRC (galois field
multiplication is used in RAID coding for example), or to use different coding method.
I'm studing WEAVER codes
which are faster by design (and use only XOR operations without complex galois field multiplications),
but they have own limits, so I'm trying to understand how distributed system with such
erasure coding will suffer from using it.
I've ran several simple tests with desktop e1000 adapter I managed to find.
Test machine is amd athlon64 3500+ with 1gb of ram.
Another point is dektop core duo 3.4 ghz with 2 gb of ram and sky2 driver.
Simple test included test -> desktop and vice versa traffic with 128 and 4096 block size in netperf-2.4.3 setup.
Test machine runs 2.6.22-rc5-batch and mainline tree (there is a test
with 2.6.22-rc4 and there is a noticeble performance win compared to
that tree in the latest git, likely tcp congestion changes resulted in
better utilisation). Batched xmit has better numbers.
Results:
2.6.20-ubuntu -> test machine
2.6.22-rc4-batch:
128: 725.43 724.14
4096: 698.63 712.77
2.6.22-rc5-mainline:
128: 760.91 762.04
4096: 784.32 788.53
2.6.22-rc5-batch:
128: 766.70
4096: 788.24
test machine -> 2.6.20-ubuntu
2.6.22-rc5-mainline:
128: 558.16 (Desired confidence was not achieved within the specified iterations.)
4096: 814.01
2.6.22-rc5-batch:
128: 569.14 (Desired confidence was not achieved within the specified iterations.)
4096: 822.72
Batch patchset's design is quite simple: instead of setting up DMA transfer for each
new skb, it queues them and if there are several packets in the queue sets up DMA
once for whole batched queue (not more than a threshold though).
I was too lazy for the latest week and actually did not perform anything
interesting. I can only complted testing and bugfixing of Jens Axboe's
network splice (bug happend to be in generic splice code,
but was triggered only by network splice patchset, likely
it will be included in the next release), some work on SLAB page
reference counting (that was even interesting - it was a ground
work for initial network splice mechanism, but eventualy I found it is not enough
and simpler way is to just reference an skb and do not free it
until splice pipe is empty). Right now I'm setting up a testbed for
Jamal's network batching (although I did not want to participate initially -
first because I did not see a big margin in the idea, probably because
I did not understand it fully, and second, work with Jamal was not that productive -
after his request for libevent support of kevent he dissapeared
and frequently all talks were never ended with code, but with empty discussions,
which I do not like). Anyway, I decided to answer, since I was in Cc: list
(hint: if I will be in kernel summit and in good mood, I will rise
a request to always answer if person is in Cc list, original sender
added people on purpose, and if someone has nothing to say, just say that
'it is not interesting for me', 'I have no time', 'I will look at it later'
or 'it is in my backlog, I remember this' but never fucking dissapear like in black hole).
So, I found old e1000 adapter and will test Jamal's network batching code
with simple netperf setup.
Another testing task is to run btrfs tests in my
automatic filesystem testbed
(well, semi-automatic, graph generation scripts require manual run so far, maybe I will extend
it later). Will start, when Chris Mason released 0.3 version (home quite soon).
But frankly, I'm too bored with paid work testing (I need to write a bug reports
for other's automatic testing system, which can not start shell scripts), with splice testing
and batching testing.
I want my own project to continue, so expect something new soon.
That was really good training - I climbed high
on the walls, not only old already boring traverses and boulderings.
There was quite big delay in my training (about a week or so),
so it was a bit complex to start, but nevertheless I completed
one new trace and couple more complex old ones. Eventually I decided to try
one really complex trace (7a), but I was already too tired, so it
was not completed - arms were so aching, that sometimes I could
not even make a single movement.
Sine now I will climb on the walls and shin up frequently
(the last time I climbed high on the walls regulary was almost half year ago!),
I expect serious progress. But even if there will not be anything similar,
it is good as is, since there are a lot new and interesting traces I will climb.
Instead of doing interesting things I spent last two days
reading internet forums about sailing yachts. I do understand
that it is stupid way to spent the time, but I can not resist.
Actually there are at least two problems with it for me: first I need
to understand what type of yacht I like, is it something smaller for local
water (there is number of various water reservoir near Moscow) or something
bigger for the sea. And full absence of time of course - that is the main problem.
There are several other problems like full absence of money and experience,
but it can be solved without major pain.
Actually that is the last time I speak about it - until something changed,
there is absolutely no possibility for me to own a yacht, not because of money,
but because of time.
I can even buy the former, but likely it is not a good idea: I have no time
to work with yacht and do not have experience. Actually the latter is a matter of
time, but there is no time.
So, for the nearest future let it be a dream...
NILFS is log-structured filesystem by Japan telecom company,
which allows to have snapshots and garbage collection over them.
Looks interesting, but it has fundamental flaw - it writes all metadata updates
as a log in a special file, which means heavy fragmentation when oldest updates
are removed. It also supports (at least it looks so) copy-on-write (and it looks
like only for data), which allows to forget about slow recovery after crash
and can improve performance.
NILFS supports checksums for segment data and related metadata. Superblock and management blocks
maintain own checksums. Currently it is only crc32. There are no checksums for individual
data blocks.
NILFS uses simple B-tree structure with 64-bit keys.
Like Btrfs it does not support
sync, direct and mmapped processing and quotas.
It also does not allow to have writable snapshots.
First papers about his filesystem appeared two years ago.
There are no benchmarks on the official site.
It looks like log-structured filesystems become popular
(getting into accout when I first time wrote
about such filesystems in this blog there were zero projects in that direction).
There are number of features currently not implemented,
and which limits design:
Absence of delayed allocation.
This means that fragmentation will kill a filesystem, even if it is extent based
(which is just a block size increase).
Currently btrfs allocates new on-disk chunk for each 8k write.
Bad tree locking
Which does not allow multithreaded writing.
Some minor nits like absence of sync, direct and mapped processing,
which is unlikely to be design problem though.
Actually, I plan to watch this project, but currently limit my filesystems
interests to the block layer - I still want to implement distributed
storage with raid-like (not exactly raid, since for higher order checksums (like 32 bit crc)
it requires slow Galois-field multiplications, but something like that) functionality,
which could be put under the filesystem and allow to put usual filesystems
on top of highly-sized arrays distributed over the net.
That was really interesting - its office building,
tables on the street under the roof, interior - it looks like a
good place to work with.
My main objection against any job except my current one is that
right now I frequently have anough free time after the paid work
projects to do my own, but if I will move to some other place,
there will be interesting projects of course, but practice shows,
that eventually they are completed and boring work starts, and it will take
all the time, while right now it is quite simple tasks, which
I can complete quickly and work with my own projects.
That was until I heared (in rough translation): "You will do
whatever projects you like, if they will be interested for us (Yandex),
it will be a plus."
But there is a small problem with my current place - there are several
prjects, which will suffer a bit without my attention (read: if I will
go away (I have two weeks by russian laws), they will suck), and
thus some people will have troubles (besides the fact, that my bosses
will think I'm a moron, that dropped them in such situation).
So, I have some kind of moral contract.
Chris Mason has announced
new filesystem, which has following feature set:
Extent based file storage (2^64 max file size)
Space efficient packing of small files
Space efficient indexed directories
Dynamic inode allocation
Writable snapshots
Subvolumes (separate internal filesystem roots)
Object level mirroring and striping (not ready)
Checksums on data and metadata (multiple algorithms available)
Strong integration with device mapper for multiple device support (not ready)
Online filesystem check (not ready)
Very fast offline filesystem check
Efficient incremental backup and FS mirroring (not ready)
Quick design notes:
One large Btree per subvolume
Copy on write logging for all data and metadata
Reference count snapshots are the basis of the transaction
system. A transaction is just a snapshot where the old root
is immediately deleted on commit
Subvolumes can be snapshotted any number of times
Snapshots are read/write and can be snapshotted again
Directories are doubly indexed to improve readdir speeds
Impressive - that is a good thing.
Getting into account that it is exactly what I
wanted to implement,
I do not see any interest in continuing that project. And that is a bad thing.
Link.
For me the most interesting was filesystem part - Linux does need a new filesystem,
which must be simple and fast. Neither can satisfy at least partially both parts - each
one is complex and slow in some or all patterns.
I want to change that, developing my own filesystem,
but dues to total lack of time, progress is minimal. Maybe it will even be a totally broken
approach I decided to take, but I want to know that myself.
We will see...
Although received data is not valid (file contains
several repeated chunks sometimes, and sometimes previous pieces of the original file,
likely it is result of incorrect page-boundary crossing processing),
kernel does not crash.
That forced me to change page releasing code in fs/splice.c a bit,
since I think it is not correct, that page can be blindly freed there,
and clone skb for each page splice requests, which is likely too big overhead,
but on receiving fast clone is unused frequently, so maybe there is some gain
there.
Splice is a in-kernel mechanism, which allows to perform
zero-copy transfer of the pages between different users: it is possible
to 'move' data between usersapce (vmsplice) and/or files (file
descriptors). For example sendfile can be implemented via splice call
(and it is for some file types). Receiving splicing, from another
side, is not supported.
There were several attempts to implement receiving zero-copy, I recall
at least three: my work,
patch by Alexey Kuznetsov and work by intel folks (the latter is very
similar to what Alexey proposed, but was more generic, since it was
first splice work, while Alexey's and mine works were purely receiving
zero-copy (Alexey implemented single-copy approach for unaligned data,
while I changed driver to always properly align data)).
Couple of days ago Jens Axboe from Oracle posted his variant,
which used SLAB pages (that pages are allocated using
kmalloc() function and contain network data if driver does not
use pages as fragments), but was quite broken, since SLAB pages
do not have reference counting (the only page which has non-zero reference counter
is first page in the combined set - SLAB uses 0 and higher-order
pages to store objects), and it never change reference counting
when storing data in that pages. So, it is impossible to just increase
a refenrece counter for any SLAB page, since that will end up badly when
page will be reclaimed in SLAB. I tried to fix that issues and eventually
completed reference counting for SLAB pages, which was based heavily
on Jens' work, but here comes another problem.
While SLAB page is not being freed, it can be reused, and thus the same address inside
the page can store different data at a different time. So, if skb, which holds network
packet, will be freed, but splice will not finish with given page, it is possible
that freed pointer will be returned after subsequent allocation, and data will be
overwritten by the next packet. When splice will finish its work (for example dump page
to the disk), incorrect data will be there.
The right way is to stop skb freeing if page, its data referes to, is being used by splice.
Seems simple, but it is not - the same page can contain quite a lot of packets,
so page must hold a reference for every skb, which data is placed into given page, but that
task is not that simple - there are no unused members in page structure.
While I write this post, Jens posted a patch, which implements exactly the same idea, but with
introduction of privite field in the splice private structure.
Let's check this out.
I declined (as with google), but they insisted to meet (not as with google :),
so I will go to see how they work. Actually, if I would ever work in Google or Yandex,
I would definitely like to create a automatic tracking system over
theirs maps, which would allow to put marks on the map and select
the shortest way between the points getting into account information
about traffic jams and so on. There are such systems all over the world,
but they are heavily limited to the specially crafted vectorized maps,
while I would start with plain pixmaps.
but working in such company (no matter if it is Yandex, Google, SWSoft
or anything else) requires to devote much of the time to them, while I
prefer my own projects (without any gain though), so that will eventually
ends up with cancellation of my own ideas. So, no, at least right now.
Failed to resist to give a link to the the absolutely true
article
in "The Indendependent" UK newspaper - the only straight and
completely unbiased newspaper in the world.
Citation:
In the last months of her life, Diana was severely opposed to Putin's efforts to become the new dictator of the Russian Federation
I think world goes stupid faster than a snawball, and I'm sitting and watching.
This one was quite interesting, although I did not climb high on the walls,
but spent most of the time at the bottom doing various boulderings, traverses
and starts. There were couple of new interesting boulderings, but mostly it
was the same as usual training.
I've found a training stream, which allows to quickly warm up, but it is too easy
to cool down, and then feet starts aching when trying to push them in the
climbing shoes (minus 3 sizes), so training must be quite aggressive with smaller
rest times. But,frankly, it is unfair to climb at the bottom - you think
that you do somthing serious and intresting, but as soon as
arms start aching, you drop down and wait, while on the wall there is no
way to move down, since after it you must start from the very beginning, which is
not interesting, so you get the latest bits of power out of the body until
either trace is completed, or you can not lift your arm at all.
Shame on me, I did not mention tas.b
instruction, which does exactly what is needed for spinlock implementation -
it reads a byte, specified in register, then sets a bit in the
returned value, and writes it back. If value read is zero, special
bit is set. All operations are performed with locked bus, so it is atomic.
I read description for that instruction many times, but failed
to translate or understand. But, whatever it was, spilocks are pefectly
valid on SH-4 family, so it would be possible to create SMP system,
since I even found multiprocessor serial communication in 7750R,
which allows to transfer data between multiple CPUs, where each receiver
is addressed via unique ID.
Main one is full absence of locking instructions (or instructions,
which automatically locks memory region access like ppc lwarx/stwcx
or lock prefix for x86),
which means that any atomic operation must contain
irq disable/enable code. Furthermore, it is impossible
to create a spinlock, which can serialize parallel execution on
several CPUs.
It is possible to workaround the problem using locked PIO access,
but that is very slow.
Likely SuperH SMP will only support newer CPU family. SuperH SMP support
is scheduled for 2.6.23 timeframe.
Actually it is more serious problem.
Existing crypotapi is very software-centric, i.e. it does
not have any hardware context nad thus should not initialize it
per request. In hardware (at least in HIFN) each crypto request must
be programmed into device's ROM and registers, so when
user creates a new crypto request (read: new crypto user requests
operations creating crypto TFM structure), it is only assigned
with software context, which contains besides other fields,
crypto functions (encrypt/decrypt/setkey), but does not contain
pointer to crypto device (since software does not need it,
it can store and initialize itself in setkey time).
Another crypto devices (padlock and geode) do not have such problems,
since padlock just uses the same software approach, but with different
asm instructions, and geode has only one device, which has global
pointer accessed from crypto processing functions.
There are no changes after previous release, since API was not changed.
It is a sign, that I do care about this project, although if I would
started it from scratch, I would changed some bits, although it is possible
to do it right now, but I do not have feature requests for that.
I have a photo-camera, a sleeping-bag and liter of black vodka.
And I'm going to canoe trip next weekend (previous trip story).
Wish myself seven feet under the keel.
That was a really good training - I even climbed high -
two sets, each one of two traces without the rest in between,
the former was two on-sight traces (which I did not complete fully -
two fails on the second trace, although it is not that complex,
but I am not in the perfect shape yet) of low complexity (interesting
traces, but not that much to repeat them again), and two old
ones - first one had the same complexity as first on-sight (6a+) and
second one a bit more complex (6c), but it was old, so it was not too hard,
but nevertheless I failed.
Give me two weeks of regular trainings, and I will complete all traces
on the vertical walls, which are less then 7a complexity (and actually
there are couple of interesting 7a I tried, and partially completed).
I want to climb!
The rest of the training was devoted to old traverses and starts - there are
several new traces (starts) I did not yet completed.
Added a lot of cryptoapi bindings.
Driver now supports AES, 3DES and DES ciphers with all four crypto modes
(ECB, CBC, CFB and OFB).
My cat has pissed on my domestic slippers when he saw that.
And believe me, he does know, how to distinguish between good and other code.
Well, I do not have a cat, but if I would, he would definitely
piss on my slippers.
Interested reader can find bootloader code,
which resides in MBR and reads kernel from
offset of 1 segment using standard SuperH IPL (aka BIOS),
in archive.
Size of the image it reads should be less than 1179648 bytes,
otherwise you will need to increase that parameter (see comments).
One can also change offset from where to read a kernel image.
There is also compilation and CF writing script,
which is easily understandible.
To write kernel image (with default parameters in bootlaoder) you just need
to run following command:
SH IPL+g version 0.9, Copyright (C) 2000 Free Software Foundation, Inc.
This software comes with ABSOLUTELY NO WARRANTY; for details type `w'.
This is free software, and you are welcome to redistribute it under
certain conditions; type `l' for details.
2002/09/09 Making. 2004/09/08 I-O DATA NSU Update.
266:133:33 on base clock 22.22MHz and SDRAM 4 burst. CF boot.
PCIC initialization done.
MASTER:48bit LBA mode non support
Disk drive detected: LEXAR ATA FLASH V1.00 11014102039199095066
LBA: 001EBF10
DiskSize: 1031675904Byte
PIO MODE1
Set Transfer Mode result: 50
> b
Set Transfer Mode result: 50
Initialize Device Parameters result: 50
IDLE result: 50
Starting from MBR
Error
00000000
Jumping to second stage
checksum: 0x4f222fe6
input_len: 0x0010baf4
input_data: 0x8c80300c
Uncompressing Linux...
insize: 0x00000000
inptr: 0x00000000
insize: 0x0010baf4
inptr: 0x00000001 Ok, booting the kernel.
cၲͥrrjj:ͪͪj"Bjɕ
B:ͥrrJсRչҚҒ
ISHV$&'H.]X_pfn = 0xc000, low = 0xe000
Zone PFN ranges:
Normal 49152 -> 57344
early_node_map[1] active PFN ranges
0: 49152 -> 57344
Node 0: mem_map starts at 8c1eb000
I-O DATA DEVICE, INC. "LANDISK Series" support.
Built 1 zonelists. Total pages: 8128
Kernel command line: console=ttySC1,9600 console=tty0 mem=32M
Setting GDB trap vector to 0x80000100
PID hash table entries: 128 (order: 7, 512 bytes)
Using tmu for system timer
Using 8.333 MHz high precision timer.
Console: colour dummy device 80x25
Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
Memory: 30520k/30520k available (1345k kernel code, 2248k reserved, 438k data, 72k init)
PVR=04050005 CVR=20480000 PRR=00000113
I-cache : n_ways=2 n_sets=256 way_incr=8192
I-cache : entry_mask=0x00001fe0 alias_mask=0x00001000 n_aliases=2
D-cache : n_ways=2 n_sets=512 way_incr=16384
D-cache : entry_mask=0x00003fe0 alias_mask=0x00003000 n_aliases=4
Mount-cache hash table entries: 512
CPU: SH7751R
NET: Registered protocol family 16
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
Autoconfig PCI channel 0x8c1bacc0
Scanning bus 00, I/O 0xfe240000:0xfe280000, Mem 0xfd000000:0xfe000000
00:00.0 Class 0200: 10ec:8139 (rev 20)
I/O at 0xfe240000 [size=0x100]
Mem at 0xfd000000 [size=0x100]
00:02.0 Class 0c03: 1033:0035 (rev 43)
Mem at 0xfd001000 [size=0x1000]
00:02.1 Class 0c03: 1033:0035 (rev 43)
Mem at 0xfd002000 [size=0x1000]
00:02.2 Class 0c03: 1033:00e0 (rev 04)
Mem at 0xfd003000 [size=0x100]
PCI: Using configuration type 1
Time: SuperH clocksource has been installed.
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 0, 4096 bytes)
TCP established hash table entries: 1024 (order: 1, 8192 bytes)
TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
TCP: Hash tables configured (established 1024 bind 1024)
TCP reno registered
gio: driver initialized
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
SuperH SCI(F) driver initialized
sh-sci: ttySC0 at MMIO 0xffe00000 (irq = 25) is a sci
sh-sci: ttySC1 at MMIO 0xffe80000 (irq = 43) is a scif
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
loop: module loaded
8139too Fast Ethernet driver 0.9.28
8139too 0000:00:00.0: This (id 10ec:8139 rev 20) is an enhanced 8139C+ chip
8139too 0000:00:00.0: Use the "8139cp" driver for improved performance and stability.
eth0: RealTek RTL8139 at 0xfd000000, 00:a0:b0:6c:d0:eb, IRQ 5
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ehci_hcd 0000:00:02.2: EHCI Host Controller
ehci_hcd 0000:00:02.2: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:02.2: irq 5, io mem 0xfd003000
ehci_hcd 0000:00:02.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 5 ports detected
ohci_hcd 0000:00:02.0: OHCI Host Controller
ohci_hcd 0000:00:02.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:02.0: irq 7, io mem 0xfd001000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ohci_hcd 0000:00:02.1: OHCI Host Controller
ohci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:02.1: irq 8, io mem 0xfd002000
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 2 ports detected
usbcore: registered new interface driver libusual
heartbeat: version 0.1.0 loaded
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
VFS: Cannot open root device "" or unknown-block(0,0)
Please append a correct "root=" boot option; here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
I'm now accepting congratulations.
Although Linux is well supported on that board, and actually it was preinstalled there
(until OpenBSD developer's hands murdered it on/from compact flash), this is a serious step
for me.
It is second non-x86 CPU I managed to start Linux on.
First one was
ppc32 (AMCC 405 GPr), I ported Linux to our company's own board about two years ago
(although initial loader was not written by me too).
So, I only wrote an initial loader in SuperH asm, which resides in CF's MBR,
it reads linux image from predefined offset from CF into RAM and jumps there. It does
not initialize caches, MMU and other parameters, since it was done
by IPL (Initial Program Loader), which resides on NAND flash, and actually board
boots from that flash, and only later that code jumps into CF's MBR.
Loader does not even initialize command line, but can perform a checksum of the image,
which can be tested against checksum written to the compact flash.
I will play a bit with this board until Grange
returns from OpenBSD hackathon. I will try to setup a partition table and filesystem,
so that userspace could be started there.
Actually when he will return, I will ask for Xscale board theirs company
develop. They run Linux there without serious problems (it is based on ixp425(?)),
but Linux in general sucks on that CPU, since it does not have support for scheduling
domains (upto 16), which allows to greatly improve rescheduling latency
(or they are not turned on, or whatever, I was quite surprised about
that news actually). OpenBSD has such support, and there is a patch for Linux 2.6,
but it is quite broken (for example shared memory does not work).
All bottles of alcohol in my loft looks really ancient - you know,
all those dusty, covered with cobweb bottles with some troubled liquid.
That is what I have (had until yesterday). Looks cool - that is a
positive moment of living in the developed loft.
Actually the only reason I started to watch it is to try to understand, if I can
understand Linus' and Andrew's speech (btw, does Linus look a bit nervos there?).
I can, although not always.
I talked with someone in english only one time in my life (not counting english
lessons in school and university).
And I'm going to move to Linux Kernel Summit (not 100% sure, need to complete
technical details) to talk with core kernel hackers.. Ugh...
You might get quite a few troubles talking with me, although for me, the only problems
is very small undecent word dictionary.
You have been warned.
And now thinking if I'm stupid or not?
And also thinking about how to bribe my budget
to get some money to buy a camera and to travel to Cambridge
this September to Kernel Summit.
I want to get DSLR camera, although I'm far from being a good photographer,
but I like a process of getting shots using DSLR/SLR cameras much more,
than with automatic digital cameras. Likely I would start
with something simple like Nikon D40 with lens from the kit, later
I would add a long-focus lens (like to photo remote objects, mostly industrial buildings).
As I posted recently
I managed to start kernel booting, but image can not be decrompressed and
thus does not bood beyond initial code. I expected problem to be in the
code which reads data from compact flash into memory - it uses
IPL (BIOS) code to copy data, which is undocumented and likely broken.
Today I started to play with data in RAM and decided to perform a simple checksum
of the image I copy and compare it with on-flash image checksum. I got
pretty simple checksum - 4-byte xor over whole image using following SuperH asm code:
Although it prints reversed hex number, it was enough to find, that
after 1^17 bytes read checksum stopped to be changed
(because of xor property it means that data was not changed after that limit),
and 2^17 was the boundary where
checksum was equal to the one, calculated over compact flash image on the host system.
So, my idea about wrong code, which reads data from compact flash into the RAM,
is correct, now I need to think (or decipher sh-lilo second stage loader's code)
about how to correctly read about one megabyte of data from flash. Main problem
is of course LBA/CHS addressing and some unknown feature called 'device number'
in comment to the reading code in sh-lilo.
Back to drawing board...
As of friday, it was gneral physical endurance training - a lot
of exercises with small weights or without them. This time I started
late and only had about one hour for them, so I only completed
5 rounds of 7 types, each one of 10 exercises in the set. So instead of
10 rounds of 10 exercises for each type I only completed half, but also added
several traverses, most ly warming, but one was a bit complex,
although way too old.
During exrcises something was displaced in the right shoulder again,
but since it is not first time, I got used to the aching parts.
Ancient latins and greeks finished theirs summits with fights, which
frequently was the last argument in the disput, so if I will move to the Linux
Kernel Summit (and likely I will), I need to add punching bag
exrcises into my set - Linus and most of the developers are quite big,
although quite slow I think :)
In this image
(500kb) I see a linux kernel, a broken loader and a blonde in a red dress.
So far it is the only way to determine what I put into the RAM and where inflating code fails.