Zbr's days.
February
Sun Mon Tue Wed Thu Fri Sat
       
     
2007
Months
Feb

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Wed, 28 Feb 2007

Climbing evening. [ND]


That was interesting although not long trainig - I started with usual warming traverses and after an hour Grange arrived and three friend amateurs - I spent quite some time teaching basics of the climbing to them, and eventually they got insurance lesson from instructor and started to climb by themselfs.
I completed one new trce on-sight couple of times - it was not that complex trace, they built it using holds of previous 'mini-cooper' trace over small holds. I liked 'mini-cooper' more, maybe because it was more complex.
Then I climbed old quite complex trace over blue holds in central sector (6c+), but failed - I still not in the best form. Closing trace was 6b+ on-sight - that was a mistake - I completed only half of it witout fail, I need to change the way I start trases - do not know why, but I always start complex and on-sight traces mostly at the end of the trainig when there is almost no power in the muscles.

At home I played my trumpet.
Wellll, not exactly 'played', but instead tried to get somehow not too ugly sounds out of it. I practiced about an hour, but still was not able to make it sound even remotely good. But I have an excuse - I 'played' quite silently, and trumpet IMHO is such an instrument, which works (at least for total newbie) only with full sound.
It is quite good that I do not have neighbours right now...

/life :: Link / Comments (0)


How does run-time kfree() checks depend on APIC on x86_64. [ND].


I created couple of trivial patches to check in run-time situation, when wrong pointer is provided to kfree(), like in this example:

skb = alloc_skb(size, flags);
kfree(skb); // instead of correct kfree_skb(skb);
They are based on the fact, that every allocated object hosts an information about which struct kmem_cache was used for allocation (and kmalloc() is just a set of special pools), so we can detect that object must be returned into kmalloc() pool or not in run-time.

First attempt was just to check every kmalloc() pool against data "embedded" in the freed object.
Eric Dumazet suggested another variant - to mark kmalloc() pool with special bit and only check that bit in the kfree().
Although slba bits are generic nd not specially related to some specific pools, so it might be wrong, I decided to implement it too.
In theory that would be trivial, but practice shows some absolutely unpredictible results.

When second patch is applied, x86_64 APIC code fails to detect a timer, so system can not boot (it does though if 'noapic' parameter is set):
	..MP-BIOS bug: 8254 timer not connected to IO-APIC
	IO-APIC+timer do not work
So I started to dig into x86_64 APIC code (which looks like it was copied as is from i386 one), and eventually increased timer detection timeout, which allows system to boot with kfree() run-time check (btw, it is additional check, there are plenty of others there, and still system worked), but I think it smells rather badly.

I'm puzzled how that things are related to each other, so I posted detaileddescription of the problem and my patch to netdev@ so other hackers could shed some light on it.

/devel/other :: Link / Comments (0)


Tue, 27 Feb 2007

Ext4 extents will (probably) support block checksumming. [ND]


Andreas Dilger wrote:

In the ext4 extents format there is the ability (not implemented yet) to add some extra information into the extent index blocks (previously referred to as the ext3_extent_tail). This is planned to be a checksum of the index block, and a back-pointer to the inode which is using this extent block.

This allows online detection of corrupt index blocks, and also detection of an index block that is written to the wrong location. There is as yet no plan that I'm aware of to have in-filesystem checksums of the extent data.
I plan to have such a block in each inode/data block not only extents, unfortunately it is impossible to implement for existing systems without on-disk format changes, so only extents will be protected by checksums in ext4.

Interesting note later presented by Oracle engneer Martin Petersen about how data integrity is managed in modern hardware, but filesystem checksums are not about corrupted disks, but about corrupted files on perfectly correct hardware.

/devel/fs :: Link / Comments (0)


All fights about kevent vs. threadlet are over. [ND]


I hope so, since wasting a time in completely empty discussions is not a way to have at least some progress.
Eventually we all (Linus, Ingo an me) concluded that better live in piece and have mixed events and thread design - event ring can be used for IO completion events, IO itself can be backed (if blocks) by threadlet.

So, let's calculate what happend:

  • we wasted 48 hours in stupid words thrown into each other
  • I ran some syslets tests and found that:
    • in Ingo web server there are no reschedulings at all in my environments, so threadlets do no work there as cachemiss threads
    • real disk IO case (using Jens Axboe's FIO tool) shows 30% speed degradation with CFQ scheduler with syslets compared to libaio and sync reads, with deadline scheduler degradation is about 8-10%
  • threadlets are simpler to program for simple test cases (when the whole logical processing can be put into single function without any iteraction with other parts), otherwise it can end up as a disaster to watch synchronization problems
  • kevent likely will not be included
  • kevent likely has some bugs or I screwed my aio tree (it includes syslets/threadlets and kevent)
  • Andrew Morton has a talk at FOSDEM 2007, where he also mentioned kevent as a good thing, but it fails to get attention, since it is a big step in a way people use kernel. Either kevent or something similar could be merged. (Thanks to Xavier Nicollet for pointing that).
    My pessimistic prognosis is about kevent declining.
Many thanks to Ingo Molnar, Davide Libenzi, Linus Torvalds and all others for (sometimes) interesting discussion.

/devel/kevent :: Link / Comments (0)


Mon, 26 Feb 2007

Dillon's Infinitely Snapshotable And Segmented Transactional Exabyte Repository. [ND]


Excellent!

That is about upcoming DragonflyBSD filesystem.

After I read its design in details, I can only say that it is not a design, but some very initial draft, so it is early to discuss even ideas.

What rises a question for me is its segmentation.
Storage is divided into segments, each one has some special header to indentify inner blocks, segments can be transformed inot bigger and smaller ones. (one problem I see right now is its indexing, which is 16bits only, while segment size is 4Gb maximum, so we end up with 48-bits addressing mode only).

Interesting ideas, which I plan in my FS too are unlimited snapshotting and absence of log structure to store history of file changes.
I also plan to replicate suberblock (as all fs should do anyway), which was not highlighted in Dillon's paper. I also plan to have plugin layer (which should be able to be turned off to not show a red flag for Linux FS developers) and likely unified cache for directory entries, inodes and pages.

/devel/fs :: Link / Comments (0)


Merde, I've missed climbing training due to endless (and actually empty) discussions.


I'm absolutely sure Ingo will not agree with me (and I even do not say about Linus), so what is the point?
Ingo said - 'fuck kevents', Linus will listen to him - things will be closed in private ring like usual, why do I ever continue?

And I finished bottle of wine (Krimsk's 'Tamansky Heres' - quite tasty thing).

Update: hmm, there is even a (somehow far, but nevertheless) positive moment that they decided to drop kevent - I have plenty of time for other interesting ones. I just would like they to say me that half of a year ago and did not fuck my brain, but from other point of view I got a lot of interesting challenges and support, which is an award by itself :)

/life :: Link / Comments (0)


Kevent vs. epoll vs. threadlet on VIA EPIA.


Small machine (256 mb of ram, 1Ghz):

kevent:		849.72
epoll:		538.16
threadlet:
# gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
In file included from ./evserver_epoll_threadlet.c:30:
./threadlet.h: In function threadlet_exec:
./threadlet.h:46: error: can't find a register in class GENERAL_REGS while reloading asm
That particular threadlet asm optimization does not work.

/devel/kevent :: Link / Comments (0)


How to get attention to your project (secret plans upto now). Kevent and Ingo's threadlets. Scholastic masturbation about events and execution context. [ND]


As you might noticed, kevent was not a favour up to date, it is quite a good subsystem, but no attention. So, steps to get an it.

AIO is a hot topic this days - mainly because of Zach Brown's efforts to create generic AIO subsystem suitable for database usage on top of micro-thread design (blessed by Linus Torvalds some time ago).
As you have noticed from this blog, I'm against such a step, so I started to participate in related threads.
Eventually Ingo Molnar implemented own micro-thread AIO design - it has the same generic disadvantages as all other micro-thread implementations from Zach and Linus, but since Ingo is a scheduler-god, he has some tricks (shit, I would look into a dictionary to check how part of the wear, which starts near the shoulder and covers the arm is called, but since it is time to work without dictionary (and admin at my paid work fucked my (hacked a bit due to bugs in firewall setup and social engineering) connection down to frequently miserable bytes/per second)), well, Ingo has some tricks for x86 which allows to create extremely high-performance kernel threads on x86.
So, kevent is in danger - one of its users - mainly kevent AIO, created as a pure state machine over some VFS bits and network items, kevent AIO can be replaced with micro-thread (ugly) design.
I started some participation (mainly in blog) in related threads, so eventually, when first syslet patch was created, I was in Cc list and started to analyze patchset (both in blog without political correctness, and in mail lists with one).

Now we have following things:

  • Ingo-scheduler-god is for his threadlet/syslet model, but agree, that kevents can be superior in some aspects.
  • Linus-linux-god is against me at all, mainly because we are doing purely theoretical even scholastic masturbation about events vs. execution context (read his mail to David Miller, where he argues that it is idiotic to think about network as AIO model, I will return to that item soon, it is too fun to talk with Linus using his words).
As you can see, Linus and Ingo (and a lot of other hackers) talk about kevent and events vs. context in general. It is good and interesting even not because kevent is discussed.
And it is challenging!

Heh, things has ended up with complete fight against kevent.
Ingo Molnar wrote:
conclusion: currently i dont see a compelling need for the kevents subsystem. epoll is a pretty nice API and it covers most of the event sources and nicely builds upon our existing poll() infrastructure.
I need to say, that I have not ran any tests yet, since my Intel Core2 test machine has x86_64 distro, and I can not download new distro du to bandwidth limitation, and my via epia machine still compiles a tree... So, right now I'm cooking up a tree which includes both kevent and syslet/threadlet patches to test kevent/epoll/threadlet. I will do it on couple of my test machines: via epia (1ghz, 256mb of ram, 100mbit ethernet) and Intel Core2 CPU (2.40GHz, 2gb of ram, gigabit ethernet). Client will be 'ab' on my desktop Intel Core Duo 3.7 Ghz test machine with gigabit ethernet.

So, stay tuned - if I will complete all setup before I go climbing (and wine will not end), I will post all results here (even if kevent will shitassly suck).

Meanwhile on x86_64 several runs of the same
ab -c8000 -n80000 http://192.168.0.48/

kevent			8447.78
			5248.45
			4792.66
			4689.78
			4978.08
			5207.60

epoll			4572.20
			5024.89
			4580.57
			4800.19
			4708.29
Even in median kevent is faster than epoll. And some times it is about two times faster.

/devel/kevent :: Link / Comments (0)


Debian sucks! [ND].


I'm using debian testing on all my test machines and desktop.
Debian testing (further Etch) sucks (shit, I'm selecting a word (but fail without english dictionary, if you would know russian, I can provide a real poem of how it is broken) which would express the whole universe in a couple of letters, and also all my mood in the regard of regualr crashes of gnome-terminal and Iceweasel, but I can not send bug report, since I can not reproduce that problem on demand in debugger). Fedora did not allow such bugs, although its X server ate my memory.

Crap. And bottle of the wine (I got a bottle of 'Heres') is almost over.
And I'm going climbing at 5-30 o'clock.

/devel/other :: Link / Comments (0)


Sun, 25 Feb 2007

DragonflyBSD will have a new FS. [ND].


Design notes can be found here.
Interesting-interesting, let's see how it correlates with my FS ideas.

So far, there are some crossing aspects, but there is a new idea of segments - I planned to use write anywhere filesystem layout compared to that.
I will take closer look tomorrow/today and will comment on the whole design.

/devel/fs :: Link / Comments (0)


Who are you in "Futurama"? (in russian, utf8 encoding).


Результат теста «На кого из героев „Футурамы“ ты похож»:


Проф.

Лила

Гермес

Бендер

Зойдберг

Фрай

Эми

Киф

Зепп
15-52-541-20

Если Вы желаете узнать больше о своем характере и о характере героев «Футурамы», то прочитайте статью «Псих-о-рама».

Пройти тест.

/life :: Link / Comments (0)


Pop-art.


/other :: Link / Comments (0)


Sat, 24 Feb 2007

Doctors have fun. [ND].


LOL (in russian).

/other :: Link / Comments (0)


Some speculations about new trie implementataion. [ND].


I've created very memory optimized trie implementation, which although uses the whole additional pointer to store just two bits, still eats about 30 Mb to store full trie for 2^20 number of 3-dimensional 32-bit elements. Yes, it is 10 Mb per dimension - i.e. the whole trie which covers 2^20 (one million) elements of 32 bit each - minimum amount of memory is 4Mb.
It is a major win compared to previous implementation.
But let's see how speed has changed.

$ ./trie
Init, num: 1048576.
Lookup random, num: 10000.
num: 1048576, num_search: 10000, diff: 8340, speed: 0.834000, mem: 30 Mb.
Speed is the same.

I was not expected the same speed, so I started an investigation.
I created simple test application, which allocates 32 structures, each one has three pointers (left, right and parent) just like usual trie node, and setup all them to point to the next structure after current. So esentially I've built a tree of 32 depth, and setup appropriate pointers to point to the next element.

Running that test application ends up almost in the same performance (note that actually I ran 3 times of $num_search number to emulate 3-dimensional lookup, so 1-dimensional lookup takes about 3 times less):
$ ./test
num_search: 10000, diff: 7369, speed: 0.736900.
I used srand()/rand(), so I replaced them with manual calculation for pseudo-random value:
$ ./test
num_search: 10000, diff: 6547, speed: 0.654700.
Slightly better, but how is it ever possible that 32*3 pointer dereferencings takes 600 nanoseconds? Simple. CPU runs at 3.7 GHz, so it is several instructions per nanosecond, in our case we have about one-two dereferencings per nanosecond.

Just pure:
	gettimeofday(&tm1, NULL);
	for (i=0; i<3*num_search; ++i) {
		int j;
		struct trie_node *n = &root;

		for (j=0; j<32; ++j)
			n = n->right;
		diff += (long)n;
	}
	gettimeofday(&tm2, NULL);
eats about 100 nanoseconds per 3 such lookups, real lookup shifts value and selects a bit to determine which pointer to take (spurious diff added on purpose - compiler will not optimize that away).

So, essentially it is a maximum performance for the full-depth trie.
With caching optimization it is possible to be close to that limit in some situations - when I use pseudo-random generator from rand(3) man page (not rounded to 16 or 15 bits), just 'value = value*1103515245 + 12345;', I get following results for searching for 10k 3-dimensional elements in hash table or trie which contain 2^20 elements:
hash table (table size is 2^20)		195 nanoseconds
trie					130-160 nanoseconds
Memory usage for 3-dimensional trie in that case is about 17Mb.

What were all those speculations about?
I just need to complete trie implementation, put it into kernel and run heavy socket creation/destruction/lookup benchmark.

/devel/networking :: Link / Comments (0)


New lighttpd patch.


I've released new lighttpd patch for kevent, it fixes potential stall of the server - due to full absence of the documentation for programmer (and very good docs for administrator) I did some mistakes when programmed addition and deletion callbacks (and it is possible that there are others, since I still do not understand some fields).
With this version (for kevent-37 release, although it can be used for 26+ releases too) benchmark of kevent vs epoll on lighttpd server do not show any difference (about 4k requests per second), optimized evserver_kevent.c trivial web-server produces upto 7900 req/second with 'ab' benchmark with 14k index page.

/devel/kevent :: Link / Comments (0)


Fri, 23 Feb 2007

Quotation of the week.


Question to Alan Cox in the threadlet/syslet thread:

Do you not understand that real user code touches FPU state at unpredictable (to the kernel) junctures?

/devel/other :: Link / Comments (0)


Thu, 22 Feb 2007

New kevent 'take37' release. [ND].


Short changelog:

  • ported to 2.6.21-rc1
  • documentation cleanups by Frederik Deweerdt
And here is my last words on kevent.

If you are somehow related to kevent development, please do not perform any steps about this post. Thank you.

A bit of history of the project.
I developed it first to create asynchronous network IO as a challenge request from Stephen Hemminger. Back in previous winter I created both kevent as a generic event delivery mechanism and network AIO too. It worked, it showed noticeble perfomance and usability win. I tried to push it upstream but never got a response.
Later previous year David Miller made a good kick for kevent and I started to push it upstrem again.
There were number of changes aimed to improve performance, extend feature list and eventually we have subsystem which indeed can be called as generic event handling mechanism.
Set of features it can work with includes file descriptor events - the same as in usual poll()/select() (there is a patch which implements epoll() over kevent), special high-performance socket and pipe notifications, possibility to get timer expiration events, POSIX timers and signal notifications, possibility to have any private userspace notifications and eventually even network AIO (sendfile(), open+sendfile()+close()).
I ported libevent and lighttpd to kevent.
But I'm continuously getting trouble obtaining feedback from kernel developers despite a huge support from several core kernel hackers and I want to thank them and a lot of other people who helped to develop that subsystem.

I do not care much about kevent inclusion - I hacked it not for that but for process, but such hinged situation takes a lot of time to throw code and words again and again into blackhole. For the my last kick for Andrew Morton and Ulrich Drepper to include kevent I got a response about the fact that it needs some more review and some more comments.

So, -rc1 is out and I've sent the latest kevent release. If there will be no feedback I will not continue to push it upstream. I will continue to support it on per kernel release basis like I do for many years for acrypto (asynchronous crypto layer). It is not too big work to maintain that, but pushing it into the rock wall is a bit uninteresting and actually stupid time spending.
Please do not kick someone to get a review or something like that - if people do not want it right now, there is no need to force them - eventually it can end up with some better system (let's see how acrypto work resulted (or maybe not) in a upcoming async crypto changes in Linux crypto stack) from some other kernel hacker.

As I expect to get some free time after -rc1 is out, I will continue to work on my interesting projects, the nearest ones are M:N theading model and netchannels trie patch to replace socket hash tables.
It looks like it is time to start new generation filesystem implementation too.

/devel/kevent :: Link / Comments (0)


Thinking more on syslets/threadlets and kevent. [ND]


As I showed, thread-like AIO design is utterly broken when things come to networking, so if this will be accepted, network programmer, who cares about performance, will need to wait on two types of events - AIO events, which populates data from disks, and polling for network socket events. Eventually someone will reinvent/extend kevent to support AIO notifications.
Likely if I will add futex waiting into kevent, things will works just out of the box.
There is a post in my draft box, which will be the latest one in kevent serie - I'm only waiting for -rc1 release so that I could resend kevent the last time.

/devel/kevent :: Link / Comments (0)


New syslet/threadlet release by Ingo Molnar.[ND]


Feature set include new entity called threadlet - it is a function, which will be executed in a sync context, and if it blocks (for example in some syslet) new thread will be created on behalf of that function, thus this approach allows to have a mssively parrallel execution.
Unfortunately there are serious caveats in the design - threadlets are not even designed to work with network, where essentially most of the calls sleep.
Main problem here is the fact, that havin the whole thread to handle a simple execution context is wrong - it does not scale - rescheduling is damn to slow in kernelspace->userspace boundary crossing, that is why I develop an alternative M:N threading model in userspace.
I also doubt about its usefullness - people who care will create own thread for that, since POSIX thread creation overhead generally much smaller than time function executes.

Among other changes:

  • get rid of locked pages for completion ring. (Just as prognosed, what else among implemented in kevent ideas will get its place in syslets?)
  • multiple completion rings
  • removed initialization and celanup functions, instead parameters are explicitly transferred into execution syscall
  • bug fixes

/devel/kevent/aio :: Link / Comments (0)


Some thoughts on walking in -25 degress centigrade frost.[ND].


If one of the following issues does take place, then walking in such a frost without a hat in a thin wear will not hurt you:

  • you are drunk (likely very drunk)
  • you are crazy
  • you walk in a sleep
  • you feel yourself too cool for such a frost
  • you are a frog or a bear
  • you do not have a hat and thick waer, and you need to go
I'm still thinking what I have from the above list.

/life :: Link / Comments (0)


Wed, 21 Feb 2007

Grange is a slackass or how to climb without insurance. [ND].


As you might expect, he missed the training again, so I climbed alone. Today I did not run with self-insurance, instead I did old boulderings and traverses. There is some pleasure in doing things without even tacking any care on how others react on that - so I always climb in head-phones with quite loud music and do not ask for insurance. It looks like doing my own boulderings and traverses attracts some attention, since there were several times when people took a photo of my moves - I never participate in photo sessions and publish my photos at all, but I'm not that geek to go to the photographer woman and ask her to delete photos.
Although Theo de Raadt (OpenBSD founder) does.

That was interesting training, although without new traces - main achievement is quite stable trace of the complex 7a trace over red holds in the central sector (although I do it with replacement of one hold).

Btw, with this post I start week of no dictionary posts - generally I use russian-to-english dictionary extensively, especially for some trick words, but let's see how things will go without it. All posts created without dictionary will be marked as [ND] in the title.

/life :: Link / Comments (0)


New features for trie implementation.


I'm still stick with trie even although it was found to be very heavy.
Main advantage is that it can be implemented lockless on read size, and even deletion is essentially free.
Another one is that I still think it has a potential.
So I designed a simple extension for classical trie implementation, which should drastically reduce memory usage in small setups and improve situation both with performance and memory usage in all cases.
Extension consists of the some kind of caching of the rest of the value in the leaf pointer. So, consider the simplest case - empty trie and we add a new 32 bit entry - instead of eating 32 nodes we just put 32bit value into the appropriate leaf pointer in the root entry (instead of pointer it will store a value). When later we will add some new entry, which will share some bits in that field, system will replace shared bits as real nodes and non-shared ones as appropriate values instead of leaf pointers.

/devel/networking :: Link / Comments (0)


Tue, 20 Feb 2007

New optimized for socket-like scenario trie implementation.


I started and almost completed new trie implementation, which is designed for socket-like scenario and thus does not have wildcards completely - search time in 3-dimensional trie was reduced down to about 830 nanosecods in my userspace simulator comapred to 350-400 nanosecods for properly sized hash table access. Given that trie itself can be trivially moved to RCU locking and its constant access time with any number of values stored, it is a good idea to move away from hash tables.
As a bad news - it eats much more memory, since every bit in address eats the whole structure (currently 8 bytes on x86) if it is newly allocated one (or is reused), which does not scale that good at all compared to hash table case, where the same setup eats about 8.5 times less.
For example for 2^20 number of entries (3 dimensions of 32 bit each) in the test, I got following timings per one lookup:

hash table 2^20 entries:		370 nanoseconds
hash table 2^19 entries:		470 nanoseconds
hash table 2^18 entries:		700 nanoseconds
hash table 2^17 entries:		1.14 microseconds
hash table 2^10 entries:		110 microseconds
trie					830 nanoseconds always
So, there is room for improvements yet.

/devel/networking :: Link / Comments (0)


To move or not to move to OLS?


That is a question...

I can even make a presentation (let's say about kevent or netchannels), likely it will be accepted, since it looks like it definitely does not break rules for being a good topic (I do not say about my talk skills, which are poor, only written proposal yet).
If I would be drunk. I would even proposed a talk titled "Why linux-kernel community suck" with kevent as example - but not now.

So, I'm a bit stumbled...
Likely I will not move there, but it is definitely an idea to think about.

/devel/other :: Link / Comments (0)


Mon, 19 Feb 2007

Climbing evening.


It was quite short training today - I only did several of warming traverses, then couple of very old traces and eventually started to climb over new (for me) 7a complex trace I started previous trainig. This time Grange was there, so I could fall on insurance - I failed to do several moments on the trace, but it is definitely very interesting path - a lot of special tricky rollings on just like I like.

Later I was told that my monkish style of life is even stated as example for some people. Where does this world roll to?

/life :: Link / Comments (0)


Sun, 18 Feb 2007

M:N threading model news.


I've fixed number of bugs (actually all known) and optimized some pathes a bit. Comparison of the create/run_empty_function/destroy thread against NPTL can be shown on this picture.

NTL vs. NPTL

NTL is about 30% faster, its maximum time is around 18-20 microsecods, NPTL sometimes fires upto 40 microseconds.

/devel/threading :: Link / Comments (0)


New table design.


I've found that my original enter door is made of quite good wood (likely pine, but I'm not yet sure) and not chipboard, so I decided to make a good table of it. If I will find another door I will create real big two-dimensional table, otherwise a bit simpler one.
I plan to make a good table so that it would be used after development is over as a writing/computer table (actually that is why I ever start building it - HP has returned my laptop from guarantee repair (hard drive exchage), so I need a working place at home to do the things).
Two designs below do not show vertical elements, since how I will made them fully depends on the fact that I will or will not find another door to work with, in the simplest case it will be just several vertical posts. Table will be covered with mordant and varnish at the end to make it look cool.
I would move to the development shop and bought a wood plates (or even a usual computer table in the shop), but it is not interesting.
So, two designs of my future table (note that it will be placed near the window on the left, chair will be placed back to the wall).

Table design

/devel/flat :: Link / Comments (0)


Sat, 17 Feb 2007

Kevent has a git tree now.


Interested party can clone kevent git tree via http (164 Mb currently):

$ git clone http://tservice.net.ru/~s0mbre/archive/kevent/kevent.git/ ./
Here one can find steps needed to create public git repo (which is not that trivial step and it is not described in popular git tutorials).

/devel/kevent :: Link / Comments (0)


Network Appliance rocks!


If this is true then I can only say that NetApp is interesting company to work with.

That is really fun (at least for me), although I'm never a project leader.

Several months ago they asked me to join as a developer, but I doubt that work would be interested for me - I'm too individualistic by nature to work in a big company, but I think if the autmosphere is fun, work is interesting too.

Have a nice time, NetApp and its developers.

P.S. Although they patented write-anywhere-filesystem-layout technique, which is too similar to some of ideas I have in mind for my own fs implementation, I think it is possible to rule that out.

/devel/other :: Link / Comments (0)


Linus and GNOME, kernel and kevent.


Is this a double standards, or just lack of attention?
Linus proposed some patches to GNOME with this descriptiosn:

I've sent out patches. The code is actually _cleaner_ after my patches, and the end result is more capable. We'll see what happens.

... Now the question is, will people take the patches, or will they keep their heads up their arses and claim that configurability is bad, even when it makes things more logical, and code more readable.
Doesn't it sound too similar to kevent mails? Damn yes.

/devel/kevent :: Link / Comments (0)


Fri, 16 Feb 2007

Syslet cancellation.


Davide Libenzi wrote:

What about the busy_async_threads list becoming a hash/rb_tree indexed by syslet_atom ptr. A cancel would lookup the thread and send a signal (of course, signal handling of the async threads should be set properly)?
I tired to repeat, that if AIO enters async path, then it becomes a real event which can be waited for. And it is done through kevent, which supports kevent cancellation already. It also supports waiting for events, reading them through kevent queue and ring buffer. Ring buffer is implemented in a userspace - guess, what will be added in v3 release of the syslets?

/devel/kevent :: Link / Comments (0)


Syslets and events.


Ingo Molnar wrote:

We have kaio that is centered around block drivers - then we have epoll that works best with networking, and inotify that deals with some (but not all) VFS events - but neither supports every IO and event disciple well, at once.
I'm disappointed. I would understand if it was written by Linus Torvalds, but not Ingo...

/devel/kevent :: Link / Comments (0)


Thu, 15 Feb 2007

Snowboarding and skiing.


Hmm, there is a not that bad trace just in 10 minutes of walking from my house even with working elevator - needs to try it someday, although I only tried snowboard couple of times several years ago, and never tried alpine skiing at all.
But I think I learn quickly.

/life :: Link / Comments (0)


How to make a grave for your own linux kernel project.


Well, I started to argue against Linus in the manner he usually make a conversation about syslets - one can find words 'utterly ugly', 'stupid' and the like.
Not that I'm too happy with syslets - but they can be extended (at least its state machine) to support different state machine elements than syscalls - so it can be used to implement really good sendfile and the like. But it is in any way better than all other micro-thread designs, which looks neat and nice, but are utterly broken in some usage cases.
As you might expect, Linus is against syslets, mainly due to its state machine (actually without it it is the same as any other micro-thread design with its pros and cons and some simple features).
But hell, if we can setup iocb structure for AIO, why we can not setup the same for syslets? Although I agree with Linus in that regard, that in the most cases it will not be used by usual users (I can only imagine syslet constructued of some IO hint like fadvice and actual read in its current form).

So, that likely means that Linus will be against me, and likely it is not very good political step to get new friends especially in the area I'm interested in. But I tired to care - they fucked my brain for the last 6 months and still do not paid attention to things already implemented, so - I do not care.

The more I think, the more I make sure I know the reason.
I do not talk about anyone in particular. Absolutely. Only about tendency.
Even more, I am definitely guilty too.
But my main idea is that NIH sydrome is a culprit in all such political discussions.
What is NIH syndrome and how does it affect development?
Briefly saying, NIH (Not Invented Here) syndrom is a case, when new idea can not be moved forward in existing system because of people behind that system do not want to break own monopoly there.

In another words - it is the case when some people really suck, and they know that, but they can not change that, so they do not want to look worse if something new is going into and it was not created by them.

So, if you continue to read this blog, that is a good. Nothing more.
I will return to this entry in a week when rc1 is ready. Do you know what will happen then? I know and can say in advance. Silence. Nothing. Well, as expected.

But hey, stop crying, girls: in any result there are positive consequences. If one thing got closed, there is plenty of time for other ones - better, faster and more cool.
So, stay tuned.

/devel/other :: Link / Comments (0)


Happy New Year!


I congratulate you with eastern (moon) New Year!

Traditinal Gong-Xi-Fa-Cai congratulation (be rich).

/other :: Link / Comments (0)


Wed, 14 Feb 2007

Climbing trainig.


Although my foot is hardly aching, that was great treaning.
As usual, Grange was a lazyboomslackass, so I climbed alone again. Eventually after couple of traverses I started new 7a trace over red holds in the central sector with start on a horisontal negative slope. That was hard, and I failed too many times, but at the end I did that (although I replaced one micro-hold with huge arm, but without it that start is much more complex than 7a, which was confirmed by instructors). To continue I setup self-insurance - i.e. I get both insurence device (gri-gri) and carabine into my system and while shinned up pulled the rope. It was not that easy and I failed a lot again, but I only wanted to complete couple of holds with self-insurance, so it was not a big problem.
Wednesday is now officially members-only day, so it was not too much people.
When I completed wall insulting with self-insurance and had a bit of rest, I tried that with usual insurence by another climber - but all the power was already lost, so it was not interesting finish of the training, but nevertheless quite exhausting and definitely good.
I ended up in a sauna with full absence of power.
Ugh, that was a great time.

/devel :: Link / Comments (0)


Tue, 13 Feb 2007

Ingo Molnar's syslets.


Ingo rocks - while we are talking through blogs and indirect links about possible AIO, he just got that implemented.

So, syslets.
Syslet is so called set of syscall execution requests, bound into one structure, which can be executed on behalf of kernel thread or synchronously. Set of syscalls form a state machine, so it is possible to asynchronously implement a web server in one syslet (accept()/send()/close()) for example, although access to userspace variables is limited.
Doesn't it look similar? Yes, kevent AIO uses exactly the same design for state machine, except that it works with simpler calls which have only one argument (syslet allows up to 6 args), and kevent has set of global threads while syslets have per process ones (I have that entry in TODO for kevents though), I must admit that kevent uses very similar to Tux (in-kernel web server) state machine, although it was designed and implemented independently. Syslets are based on design ideas from Tux.

That looks really cool.
Ingo created extremely good subsystem, which

  • works as proper state machine, but not spawns a micro-thread per syscall
  • allows to batch syscalls and form state machine
  • is asynchronous in that regard, that when syscall blocks, syslet is scheduled into dedicated threads (not exactly - control is returned to userspace from different thread, so there are problems with per-thread variables like TID). If syslet does not block, it will not be rescheduled.

Although there are limitations.
First one - formed state machine is still formed over syscalls, so if syscall blocks, the same thread can not execute next syslet.
In kevent it is solved by breaking blocking rules in some pathes (namely block IO waiting), so it is possible to excute several IO-bound tasks on behalf of the same thread thus increasing performance (think about thousands of pending blocked IO requests and tens threads at most).
And as mentioned there are per-thread problems like TID and (probably) TLS related kernel data (like non-exec stack property).
Ingo also mentioned couple of other open issues except TID, but I do not think it is too hard to resolve them.

So, it is really good work, but both syslet and kevent can be extended to use syslet's state machine (if it will allow to have per-thread queue of requests).

/devel/other :: Link / Comments (0)


All magic behind segments access has been uncovered.


So, I've written simple module which dumps Global Descriptor Table for each CPU in system and started my test application, which essentialy does "movl %%gs:0, %0", so I just present parts of the output so things just become clear and obious:

# gdt_reader /dev/gdt
...
 0/ 2:  6: TLS segment #1 [ glibc's TLS segment ] : start: b7ff06c0, size: 4294963200, 
 seg_type: 0x3, dpl: 0x8, AVL: 1, SEG_PRESENT: 1, DESC_TYPE: code or data, OP_SIZE: 32 bits.
...
 0/ 2: 14: default user CS : start: 00000000, size: 4294963200, 
 	seg_type: 0xb, dpl: 0x8, AVL: 0, SEG_PRESENT: 1, DESC_TYPE: code or data, OP_SIZE: 32 bits.
 0/ 2: 15: default user DS : start: 00000000, size: 4294963200, 
 	seg_type: 0x3, dpl: 0x8, AVL: 0, SEG_PRESENT: 1, DESC_TYPE: code or data, OP_SIZE: 32 bits.
...

ds: reg: 15, table: GDT, rpl: 0x3, val: should fail.
gs: reg: 6, table: GDT, rpl: 0x3, val: b7f736c0.
As we can see, %DS register is setup for the whole adress space (its size is 4gb minus one page), as long as %CS, so it is impossible to dereference them.
%GS points to some memory (accessible indirectly as "%gs:0"), where thread local storage is setup by kernel and libc, its size is the same 4gb minus one page.
That page is likely created to cover GDT itself.
Segmentation fault behind "movl %%ds:0, %0" becomes obvious too - it tries to dereference zero address, which ends up with SIGSEGV.

So, instead of searching for documentation on "movl %%gs:0, %0" command (and mainly to determine what is that magical ':0' is about) I created kernel module, dumped and analyzed GDT and finally detected how gcc inline asm treats that.

Btw, notice that %GS segment start is the same as result of the "movl %%gs:0, %0" command. It is impossible to actually get segment start, it just happens that glibc puts pinter to TCB (thread control block) as first pointer in the struct pthread (which is stored in %gs). TCB structure itself has pointer to itself as a first member.
It is only true on x86, other arches happily have special real register for that purpose instead of ugly indirect dereferencies.

/devel/threading :: Link / Comments (0)


Kevent is not closed.


I've got a feedback from Ulrich Drepper (I never replied since kevent was always failed to compile, but not because it that good) and Andrew, who asked to resend kevent again after rc1 is out.
Ok, let's wait for a week.

Meanwhile I committed documentation cleanups from Frederik Deweerdt.

Thanks everyone for support.

/devel/kevent :: Link / Comments (0)


Mon, 12 Feb 2007

Climbing evening.


I managed to get a callosity or something other cruft on top of the left foot, which aches noticebly each time I start getting my climbing shoes or move that leg, so training was quite challenging.
I completed several old traverses and managed to complete quite heavy boulderings even with aching leg. It was quite small training, but quite interesting.

I found that I do not have a mirror at home, so I'm considering to have a beard (i.e. I'm not considering that actually, but while I'm searching for the place to shave with mirror or mirror itself, beard grows itself).

Due to heavy frost yesterday night (not in the appartments, but it was quite cold there as a consequence) it looks like self-leveling mix did not stick to the ground, so I needed to remove about square meter of it. Since I do not have that mix anymore now, I will fill that with plaster.
So, imagine, yellow (self-leveling mix has something like that colour when it becomes ready) floor with gray circle in the middle, white walls, yellow hinged ceiling; couple of bottles of rum and liquor, bread, small carpet with clothes and hammock - that is how I live.

/life :: Link / Comments (0)


Fun with kernel.


Instead of read some doc about inline gcc syntax (and mainly to understand what does it mean "movl %%gs:0, %0" and why it faults for %%ds) I write a module which will allow to dump GDT on demand - this will allow to check my hypotesis about segment address dereferencing idea described previously.

Well, actually I read a lot of inline asm docs for the last several days, namely following (for interested reader - that is mostly small howtos and simple examples):

and still in doubt.

/devel/threading :: Link / Comments (0)


Fuck soberness.




Full original.
Very interesting livejournal (in russian) of an artist.

/other :: Link / Comments (0)


New kevent 'take36' release. More interesting descriptions.


Short changelog:

  • Fixed typo in Makefile about kevent based replacement for epoll (not included into patchset) which led to compilation failure.
  • Changed AIO description text.
Here is a changes:
 -[take35 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
 +[take36 10/10] kevent: Kevent based generic AIO.

 -Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
 +Kevent based generic AIO.
 +
 +This patch only implements network AIO, which is _COMPLETELY_
 +impossible and broken in _ANY_ micro-thread design. For details
 +and test consider following link:
 +http://tservice.net.ru/~s0mbre/blog/2007/02/10#2007_02_10
 +
 +Designing AIO without network in mind can only be result of heavy hang-over.
 +
 +Kevent AIO is implemented as state machine.
 +There is a patch which implements async open/send_header/sendfile/close.
Changes in main description:
 -[take35 0/10] kevent: Generic event handling mechanism.
 +[take36 0/10] kevent: Generic event handling mechanism [new description text for generic AIO].

 -Generic event handling mechanism.
 +Generic event handling mechanism [new description text for generic AIO].
 +
 + [ Consider reading at least introduction texts for patches ]

  Kevent is a generic subsytem which allows to handle event notifications.
  It supports both level and edge triggered events. It is similar to
  poll/epoll in some cases, but it is more scalable, it is faster and
 +
 + It can serve as a storage for different AIO models as well - in case
 + syscall or other request is ready immediately kevent returns that event
 + in submission point.

 + Number of comments dropped to zero several releases ago -
 + it is a sign that API, design and implementation are perfect.
 +
 Consider for inclusion.
I also asked Andrew Morton (again) about inclusion/declining plans.
If that will be put into void again, I think it is time to close project.

/devel/kevent :: Link / Comments (0)


Driving in Moscow.


Intuitivelly obvious left turn (from red to blue pointers).

/other :: Link / Comments (0)


The magic behind %gs access.


Ok, I've cought bodhi breath and understood, what did that mean.

Each descriptor is actually index shifted to 3 bits left, least significant bits are LDT/GDT selector and permissions bits, so %gs equal to 0x33 is actually descriptor number 6, which is first TLS descriptor in GDT.
As we know, each entry in GDT describes some region of memory, so construction "%%gs:123" is a 123 offset inside the area described by that descriptor, as I understand that, although it still requires to think why similar access to %%ds ends up with segmentation fault (I could understand absence of reading permissions for %%cs, but why data segment is not readable, I still can not understand).

That automagically means, that %%gs (which is priveledged descriptor as all other segment registers) can not be used to store information about userspace threads in M:N model, since it always points to the TLS area for the main process, so to get currently running userspace thread in M:N model stack pointer aligned to stack size can be used, but that requires that all stacks have the same size (which is true right now anyway).
Having ability to access curent thread not through ugly pointer allows to naturally scale on SMP, that is the only reason to invent such mechanism.

/devel/threading :: Link / Comments (0)


Sun, 11 Feb 2007

Some thoughts about M:N threading nad NPTL.


While I was busy kicking various AIO designs I completely missed time to do real interesting low-level threading stuff. So, some thoughts.

Locking into glibc NPTL sources I noticed that x86 uses %gs register as so called thread-pointer register. To get current thread pointer glibc uses following code:

# define THREAD_SELF \
  ({ struct pthread *__self;						      \
       asm ("movl %%gs:%c1,%0" : "=r" (__self)				      \
   	  : "i" (offsetof (struct pthread, header.self)));		      \
       __self;})
%gs itself shows to the descriptor in GDT, but it is always 51, while GDT only holds 32 descriptors, %%cs for example contains 115 - one question (the last minute though - GDT entries are of 8 bytes each, so it is possible, that value stored in register should be divided to 8 to get actual entry... Will check out tomorrow.).
Another one - what does it mean "%%gs:%c1" - it will be transferred as "%%gs:immediate_parameter_1", i.e. "%%gs:some_num", for example "%%gs:0" works, but it does not work with usual register like %%eax, only %%gs. Similar access to %%cs, %%ds and other segment registers ends up with segmentation fault.

It looks like walking in the dark room with exact knowledge that there is at least one rake...
But I will find an exit soon.

/devel/threading :: Link / Comments (0)


Linus on AIO. Limitations of the proposal. Practical example of weakness.


Linus Torvalds wrote on reply to tcp_sendmsg() example:

> Will you create a thread every time tcp_sendmsg() hits the send queue > limits?

No. You use epoll() for those.
I.e. we design Asynchronous IO, which is already limited to not be used with network?
I.e. AIO can not be used in anything connected to the network, since even if disc read/write will be asynchronous, sending will block and thus we just lose all possible advantages.

Continue:
There's a reason why a lot of UNIX system calls are blocking: they just don't make sense as event models, because there is no sensible half-way point that you can keep track of (filename lookup is the most common example).
Linus - blocking IS waiting for an event, which will remove that block.
Linux even uses wait_event_*() calls - don't you think that name has some sence?
Filename lookup is just an inode reading from disk - when it is done, filename is ready, that is an event.
And actually no one uses async filename lookup - people use open() syscall, which is perfectly eventable - block-removal event is readines of the opened file descriptor - it is even used in kevent AIO (surprise?) as a part of the async sendfile transfer state machine (but I must admit, that opening always happens in async mode as part of the state machine, so it will lose some ticks if things are perfectly in the cache, but practice shows that async sendfile is faster).

In another mail Linus continues to burn things out:
You use the AIO stuff for things that you *expect* to be almost instantaneous. Even if you actually start ten thousand IO's in one go, and they all do IO, you would hopefully expect that the first ones start completingn before you've even submitted them all. If that's not true, then you'd just be better off using epoll.
I.e. we should not use AIO for the case, when request really blocks, only when it is synchronous and maybe sometimes block.
Linus, direct IO (used by databases) blocks all the time, sync IO blocks all the time, network blocks, pipes block, readahead blocks - only the simplest case of reading from VFS cache does not block.

And eventually Linus proposes waiting for AIO events:
for (;;) {
	async(epoll);	/* wait for networking events */
	async_wait();	/* wait for epoll _or_ any of the outstanding file IO events */
	handle_completed_events();
}
Linus - you have just introduced a waiting for AIO events - i.e. new type of events, which are supposed to wrap async completions. And since every async syscall is that new event, we can wait for them in userspace loop.
You do not know, but kevent is supposed to wait on every possibly type of events - you do not need to wrap sync-events-waiting calls (like epoll()) into async helper and then wait for that - just register it with kevents where you are currently forks in patchset.

And to draw the line: AIO by micro-threads is not even supposed to work in environments where it will block all the time (like network or direct IO), instead in blocking environments events should be used, since they are much more scalable.

Micro-thread AIO sucks even in reading from file - practice example: if file is happend on bad block, reading will block for too long (seconds!), and system can be just killed with rescheduling when there are a lot of threads waiting for read completion on that blocks.

And to finally kill such design, here is another test I created.
Consider a directory with high number of inner dirs and files (hundreds), theirs total size is 3 times smaller than amount of RAM (1gb vs. 300 Mb) and several applications which run and randomly copy data from one file to another.
I've put several printks in __lock_page() (i.e. when requesting application blocks and thus new thread would be created) and watch a nice picture when upto hundred of blocks happend per second (and that is just for the case, when size of the test dir is 3 times smaller than RAM, what will happen when size of the dir will be more than amount of RAM I even do not want to imagine):
printk: 84 messages suppressed.
__lock_page: aio_new_thread: 6650.
printk: 118 messages suppressed.
__lock_page: aio_new_thread: 6769.

Conclusion: 'f toppku', i.e. into the furnace.

/devel/kevent/aio :: Link / Comments (0)


Meanwhile on appartments development side.


I've completed the most dirty part - I've covered my floor with self-leveling mix, which is hard as a rock and does not create dust. The only missing real dirty thing is leveling parts of the walls and hinged ceiling and filling with second plaster layer on the ceiling and walls parts which require that (not that much actually).
After I complete that I will paint ceiling, which requires

  • move to development shop
  • select colour
  • paint the ceiling
. I become more and more lazy - previously I worked on appartments developent several hours several days each week (and at least the whole Sunday), but now I only work couple of hours even not every Sunday... I need to create some kind of motivation.

/devel/flat :: Link / Comments (0)


Sat, 10 Feb 2007

Watching '24' series...


It is for real maniacs - maniacs of series!
Little introduction: this serie tells us about one day (24 hours) of life of USA on the edge of some catastrophe, and CTU agent Jack Bauer (Kiefer Sutherland) saves the world.

I was told about first season (first 24 hours - 24 series), and trash over them was exactly how I like - so I started to watch second season, I managed to complete it (althgough I watched only half of series) and was quite exhausted, but the last serie asks for next season, so I asked Bass to tell me in short about third season - he replied something like 'trash all over the place - killing, duying, viruses, guns, fights, death, sex, rock-n-roll, main hero is a hero'. Yes, I wanted exactly that description, but again, third season asks for the next one, but fourth one was not seen even by Bass, so he jumped in the middle of the fifth, where everyone seems to die except Jack Bauer.
Lazyweb, please, if you know the end of the fifth season, do not hesitate and drop me a mail about it - I do not want to watch the whole season - it is like a drug.
Bass seems found the solution - he will record me the first and the last series from each season.
Good.
Looking to official site, it looks like there is a season number 6.
Bad.

P.S. And the last question - will that bad guy on yacht, who at the end of the second season 'started' third one by saying something like 'you will see' to another bad guy, be killed by Jack and who is he at all? (do not even say he is an unknown illegitimate brother of Jack Bauer or mr. President who revenges them)
Thanks.

/life :: Link / Comments (0)


Test which shows how broken is thread-like AIO design.


Linus has proposed yet another way to do async syscalls.
It is a bit similar to fibrils, but different in that regard, that Linus' patch just creates a new real thread if async call blocks. So, when syscall blocks, system returns to user as a different thread.

There is a huge problem with that - per syscall thread creation/destruction. Linus, why do you think people do not create new thread each time new client has connected to web server?

Artificial example with sys_stat64() does not count - try to have thousands of such threads.

That approach sucks even more than fibrils, imho, althogh Zach's one has a problem, that fibril is always created no matter if call does not block.

Rescheduling is a problem.

To prove that this is a huge problem I've setup a simple test - I changed all sockets allocation from process context (actually only TCP sending functions) to GFP_ATOMIC, so when they will fail, and thus process will put into sleep, since previous allocation policy was GFP_KERNEL, a new thread would be created.
So, I got following results:

tcp_sendmsg: sock: ffff810038e57900, wait: 562.
tcp_sendmsg: sock: ffff810038e57340, wait: 563.
tcp_sendmsg: sock: ffff810038e56d80, wait: 564.
tcp_sendmsg: sock: ffff810038e567c0, wait: 565.
tcp_sendmsg: sock: ffff810038e56200, wait: 566.
printk: 20458 messages suppressed.
tcp_sendmsg: sock: ffff81003363d300, wait: 21025.
and the like...

That was a simple couple of seconds test run of ab benchmark against 2.6.20 kernel with lighttpd web server - about 4k connections per second, 80k connections total, trivial index page (got from debian installer) on athlon64 with 1gb of ram connected over 1gbit link.

And during that simple test system would created 21k threads?
No way, it is just broken design. It is wrong.

So, read my lips - ev-e-ry-thing con-nec-ted to the net-work sle-eps.

Linus, if you read this (although I doubt), please, do not make terrible mistake.
Do not include kevent, if you do not want, but please think about above test before it is too late.

/devel/kevent/aio :: Link / Comments (0)


Fri, 09 Feb 2007

:))


I've just found that my blog was posted to LWN (likely in 'Kernel fibrillation' topic, which I'm not subscribed to, so I can not say for sure) and to linux-kernel.

People say about frustration about kernel development process.

Well, I'm not frustrated by myself - I hack not for the inclusion, but for the process itself, so it does not matter that much for me, although inclusion is a sign, that work has been done good.

What about the whole kernel development process - yes, I think it was changed for the last several years significantly.
Main issue, IMHO, that there are no more hackers in kernel community, but instead there are too many entreprise emplyees. I.e. I mean number of the former is much smaller than the latter - so mind changes.
For example Linus - he lost hacker's nature, I think, he prefer to have a consensus between different areas, while only couple of years ago he would just throw everything away just to create really good system.
Today I see quite a lot of NIH syndrome too - but well, likely it is due to the people's nature - I have it too indeed.
That was a good time, although I like things like they are.

So, personally I feel myself and even situation about kevent quite normal, although it would be better, if the whole proess would be more friendly - the same kevent patchset is not read by anybody I bet, although it does solve a lot of problems, which people strike everyday.

So, do not be frustrated, be good, and be cool.

/devel :: Link / Comments (0)


Climbing evening.


I had a one week delay in trainings and that influenced me a lot - I tried not that complex thing, although eventually run several old traverses over relief (simple varian with feet on all holds) without the rest in-between, which in addition with sauna, completely killed by body. I ran only old things, although managed to modify some old boulderings and created new movings.
I tired as hell, and the whole body is in pain, but as you probably noticed - if you have awakende and nothing aching, then you are likely dead.
Usual after-climbing shower and sauna and move home and hammock.

/life :: Link / Comments (0)


Thu, 08 Feb 2007

Addition filesystem feature.


It is quite good to have possibility for administrator to put some files together (i.e. close to each other on disk), which will greatly speed up setup, where all them are needed in a short period of time - like boot up or process startup. It can not be solved by delayed allocation, since kernel does not know about relations between files, which in turn results in a need for some kind of on-line disk layout changer (in ext4 it is known as defragmenter), which I was against for in my FS design notes.

/devel/fs :: Link / Comments (0)


Glibc function pointer encryption.


It was actually committed years ago - at the end of 2005, but Ulrich Drepper only announced it quite recently. Here is bits of presentation:

One of the remaining attack vectors in the runtime are function pointers in writable memory. Overwrite the value and you can redirect execution. Of course the pointer must actually be used and randomization must be overcome, but it's theoretically possible.

The remedy I've implemented in libc internally is to encrypt function pointers. I.e., they are not stored as-is but instead in a mangled form. This mangling consists in my code of XOR-ing the pointer value with a random 32/64-bit value. Each process has its own random value. The code was publicly committed back in December 2005 and is in FC6.

What is protected? I hope meanwhile most function pointers in libc. Some are probably still missing and others cannot be handled this way since they are visible to the outside. For some broken programs (including UML) the setjmp change was the biggest. These programs tried to access the stored code address which now is not really useful anymore (program don't know how to decrypt the value). Other pointers which are encrypted are the iconv and atexit structures as well as some function pointer tables people don't really know about, they are completely internal.
But let's see, what exactly is implemented? Searching for PTR_MANGLE/PTR_DEMANGLE macros shows, that only some registers in setjmp() code and atexit() and iconv related pointers, all of them are stored in private areas already, and hacker will crack his head just to find, for example, list of exit functions to change. It is much easier to overwrite GOT for example or implement other 'return-to-glibc' technique, instead of searching for private function pointers in the glibc...
And even that encryption is not that complex - pointer is XORed with value stored in TCB at fixed location from the start, so if attacked can access %gs register, he knows that secret value.

It is of course a good step, but it does not provide any real security, which is advertised by RedHat.

Likely it was the only reason I started to upgrade FC5 to FC6 - I wanted to check how pointers are encrypted and tried to determine if it is possible to hack that.

/devel/other :: Link / Comments (0)


Hackish way to upgrade FC5 to FC6 using yum on x86_64.


It does not work as described in the web due to problems in ELF libs dependency. And you can not replace it with --nodeps flag since rpm stops to work with new utils. So, what you need, is following steps:

  • backup content of the following FC5 rpms:
    • elfutils-0.119-1.2.1.x86_64.rpm
    • elfutils-libelf-0.119-1.2.1.x86_64.rpm
    • elfutils-libs-0.119-1.2.1.x86_64.rpm
    By content I mean files and dirs which will be put into /usr dir when installed (use mc or cpio).
  • install following libs from FC6 distro with --nodeps flag:
    • elfutils-0.123-1.fc6.x86_64.rpm
    • elfutils-libelf-0.123-1.fc6.x86_64.rpm
    • elfutils-libs-0.123-1.fc6.x86_64.rpm
  • run yum update (implied you have installed fedora-release-6* rpms)
  • copy content of the elf libs from FC6 and remove old elf libs from FC5
RedHat still lacks good support for x86_64 and again requires some hacks to uprade using yum.

/devel/other :: Link / Comments (0)


Wed, 07 Feb 2007

New CARP - Common Address Redundancy Protocol - release.


There are no design changes - just maintenance work - CARP has been ported to 2.6.20 kernel. It also includes a trivial one-line fix for the case when you cross-compile it.

/devel/networking :: Link / Comments (0)


New kevent slogan - 'kevent can do everything you ever thought of. Completely.'


What this thread shows to me is a fact, that I created quite good kevent subsystem, but unfortunately quite a few people know about it. Let's see:

  • most AIO will not block, so we do not need special setup, those who will block, needs to wait by Linus Torvalds - kevent does allow non-blocking events - and it does not differentiate between them, if it is ready immediately, it will be returned as ready on submission time, if it is not ready, one will wait for it in the queue
  • 90% of database AIO will block by Joel Becker from Oracle - kevent works perfectly ok with such loads. It was designed for them.
  • file binding is too expensive... we want to wait on different kinds of events by Linus Torlvalds and Oracle folks - kevent was designed for that scenario. It works for any kind of events, setup cost is way too cheaper than the whole file structure.
  • AIO is better to be implemented as state machine by Ingo Molnar. I've done it already in kevent AIO.
  • AIO must be implemented as a scheduled away micro-threads by Zach Brown from Oracle - kevent allows to wait on that fibrils, which block and are scheduled away. Kevent also allows to be used as a storage for information for async syscall - if syscall does not block, kevent returns at submission point, otherwise it is possible to wait until it is ready either through kevent queue or ring buffer.
  • AIO syscalls should be in a form of struct with syscall number and array of its args - kevent allows to be a storage for that - see above, when syscall being executed does not block, kevent will return immediately (if several kevents are submitted in one batch, number of ready kevents will be returned, ready kevents will be copied into submission array from the beginning), if syscall blocks, kevent allows to wait until it is ready through its queue or ring buffer.
  • what design kevent is going to crack today?

/devel/kevent :: Link / Comments (0)


Continue on AIO thread and kevents.


Linus Torvalds wrote:

Don't be silly. AIO isn't an event. AIO is an *action*.

The event part is hopefully something that doesn't even *happen*.

Why do people ignore this? Look at a web server: I can pretty much guarantee that 99% of all filesystem accesses are cached, and doing them as "events" would be a total and utter waste of time.

You want to do them synchronously, as fast as possible, and you do NOT want to see them as any kind of asynchronous events.
It is up to AIO interface - it can request completion event after it has checked that system blocks, if it can not check that in advance it still can perfectly fine use kevents, since kevents are not blocked and returned immediately (if special flag KEVENT_REQ_ALWAYS_QUEUE is not set) in case of underlying subsystem does not block and can provide data immediately. Setup for kevent is a cache allocation and queueing into the binary tree, so one can calculate how fast it is - it is just bloody faster than process creation or rescheduling.
Actually that is silly - if that would have anyhow similar performance costs for thread creation and kevent creation, no one ever used polling - people would use real threads per event. Fibril allocation (and it happens always) is slower than kevent one, and in case of block - price becomes just too expensive.

Yeah, in 1% of all cases it will block, and you'll want to wait for them. Maybe the kevent queue works then, but if it needs any more setup than the nonblocking case, that's a big no.
Linus, your never ever read what I posted in all kevent related threads - shame on you, since it was you who appreciated similar design (well, it was created by you, but it was much simpler and feature free) several years ago - but emoutions must go away - kevent does not differentiate between blocking and non-blocking mode until special flags are set - in the usual case kevent is allocated, its fields are filled and it is queued into the tree.

Eric Dumazet wrote:
It seems to me that kevent was designed to handle many events sources on a single endpoint, like epoll (but with different internals). Typical load of thousand of sockets/pipes providers glued into one queue.

In the fibril case, I guess a thread wont have many fibrils lying around...

Also, kevent needs a fd lookup/fput to retrieve some queued events, and that may be a performance hit for the AIO case, (fget/fput in a multi-threaded program cost some atomic ops)
Let me clarify what are event sources and how events are delivered to userspace - it is possible to have tons of event sources (timers, sockets, pipes, fibrils, anything) in the same kevent tree, but they are posted back to userspace one-by-one through ring buffer or through syscall to save order (one can have multiple threads reading the same ring buffer though). If there will not be enormous number of fibrils per process, then it is fine - theirs number does not play any significant role.

To get events through kevent queue fd loopkup is needed - but it is much-much-much less costly than rescheduling and thus fibril completion. If there are multiple threads doing completion processing, they should use ring buffer - it was created exacly for that purpose, so kevent queue will not have a lot of lookups per event at all and will not have simultaneous access to atomic operation too - think about compare and exchange operation in modern CPUs, which Ulrich wanted to implement in threads reading kevent ring buffer.

Kent Overstreet wrote:
An app can have a bunch of cheap, fast user space threads servicing whatever; as they run, they can push their system calls onto a global stack. When no more can run, it does a giant asys_submit (something similar to io_submit), then the io_getevents equivilant, running the user threads that had their syscalls complete.
That is exactly how kevents work - no need to invent new wheel. And that is, btw, how AIO sendfile works (except underlying state machine) - it gathers file descriptor or path, socket and length/offset parameters, combine a private structure and creates a kevent, which will be completed either immediately if sending does not block, or will return error and state machine will handle that (in case file descriptor or socket is in non-blocking mode), or handling thread blocks and is scheduled away, so new one can pickup next request.
The whole sub-thread started after this message is specilating about how we could invent new things, which happend to be implemented yar ago and called kevent. Hey, it is already done.

Davide Libenzi wrote:
Note that it's not a trivial tasks to extract a long enough level of parallelism, that would make you feel pain in having to walk through the submission array. Think about the trivial web server case. Remote HTTP client asks one page, and you may think to batch a few ops together (like a stat, open, send headers, and sendfile for example), but those cannot be vectored since they have to complete in order. The stat would even trigger different response to the HTTP client. You need the open() fd to submit the send-headers and sendfile.
Really? And how do you think it can be solved? by issuing tons of async IO? No, it is way to faster to be solved as a state machine suggested by Ingo Molnar and me - kevent AIO works that way, AIO sendfile was implemented that way, and it was proven that state machine (file path and header are provided into the call, it internally opens file, sends its content (populates pages into VFS cache if needed), and closes file) works faster even for pages in VFS cache (which are not supposed to block). Who is going to eat own's hat?

Possible AIO design:
struct async_submit {
        void *cookie;
	int sysc_nbr;
	int nargs;
	long args[ASYNC_MAX_ARGS];
};
struct async_result {
	void *cookie;
	long result:
};
Silly, if you are going to implement it that way - kevent already can do it for you. it will allow to way (btw, Linus, databases, which perform direct IO will block 90% of all theirs requests, so they do wait, so they do need kevent, but that is another story).

/devel/kevent :: Link / Comments (0)


My home is my castle.


Weight of the enter door construction is more than 100 kg, total steel plates are 10 mm thick.
Anyone can feel safe there.

/life :: Link / Comments (0)


Tue, 06 Feb 2007

A pipe.


A pipe

/other :: Link / Comments (0)


Linus Torvalds again.


He wrote:

Also, quite frankly, I tend to find Uli over-designs things.
Linus, do not see for design, watch the fucking implementation, which is there for too long already, and was created way long before Ulrich created his design, which in first edition was a subset of kevent (later were added ring buffer and possible thread id).
One just needs to patch and run - it is already done.
Totally and completely.

Another sentence:
We want less code. The whole (and really, the _only_) point of the fibrils, at least as far as I'm concerned, is to *not* have special code for aio_read/write/whatever.
I tend to agree here, but Linus, you are missing the main problem - if more code means faster processing, it is better, so fibril rescheduling just can not compete there with specially written AIO calls - like C written copy can not compete with asm one - fibrils do not scale with SMP, rescheduling of the system with 10k fibrils is impossible, task allocation (even smaller task) is slow - there are too many things where fibrils suck.

/devel/kevent :: Link / Comments (0)


Continue on fibril AIO thread. This time Linus Torvalds.


Linus Torvalds wrote:

But if you want to, we could have a *separate* "convert async cookie to fd" so that you can poll for it, or something. I doubt very many people want to do that. It would tend to simply be nicer to do
	async(poll);
	async(waitpid);
	async(.. wait foranything else ..)
followed by a
wait_for_async();
That's just a much NICER approach, I would argue. And it automatically and very naturally solves the "wait for different kinds of events" question, in a way that "poll()" never did (except by turning all events into file descriptors or signals).
Linus, wake up, it is done already. I post a new patchset each week for the last several months (started more than a year ago), which implemented that already.
Linus even says magical words: wait for different kinds of events, which is written in every mail about kevent I sent for the last year.

Kevent already allows to wait on different kinds of events - from file descriptors down to signals or timers. On any kind of events. Trivially.

/devel/kevent :: Link / Comments (0)


Thinking about future...


So, I have following things in mind for the nearest future (casted by hidden drining cranberry liqueur aka nastoyka in office):

  • move my trumpet from office to home and start practicing more regulary
  • create an anounymous (i.e. not related to name 'Evgeniy Polyakov') livejournal account, where I will write (in russian) my thoughs about things - there I will open my another (hidden) nature
  • complete appartment development (ugh...)
  • start making some photos, write essays, paint pictures, play music - i.e. start trying to express myself in non-technical area (this requires quite a bit of magic(al liquid), which would transfer my mind into different shape, but that is what I actually like)
  • move away from my current work and start my own - it only requires to pay all my debts (about $30k-$40k), so until something happens, it will be postponed to 2009
  • throw away my phone and number, like I did couple of years ago, but due to appartments building I needed new one
  • world domination, although it is related to the technical area for now

/life :: Link / Comments (0)


It has happend!


I've returned back to home and gotten (officially) keys for my apartments - this forced me to spent a lot of time in queues and pay about $300 (essentially that price is just for two w1 ibuttons with 64bit ids - I, author of the w1 stack for Linux kernel, who worked on project which would allow to sniff w1 traffic and simulate any id using GPIO based device, needed to pay for them - I do not even say about legality of the requested payments. Actually car absence stopped me from developing some tricky scheme to enter the building without ibutton - it is quite cold, and it requires time, so car is needed.), but nevertheless it is done, and tomorrow I will replace my enter door with something more serious, which will allow me to say "My house (apartments) is my castle".
Excellent!

/life :: Link / Comments (0)


Notes about RCU and hash tables.


As you might noticed there is some work on moving socket hash tables to RCU. As David Miller correctly wrote - hash tables RCU'fication is quite challenging process, since there is always a window, when the same entry in hash table can be used by different sockets.
He also points to some ongoing work in that area (made by some unknown (likely known but currently secret) person), but I still wonder, why hash tables with all its problems are used, when we can use tree (trie)?
There is an entry in TODO list to replace socket hash tables with netchannel's multidimensional trie - there will be constant search/insert/delete time (modulo number of allocations/freeings, which can be trivially reduced to one allocation/freeing) in the multidimensional trie and it will be equal to O(m), where m is a number of directions (symbol O implies 32 or 128 checks depending on length of the checked field per socket lookup). That's it - it does not depend on number of sockets/netchannels in the trie until they are wildcards - in that case things becomes slower, but socket code does not have an API to implement wildcard sockets, although netchannels have it, (except listening sockets bound to 0.0.0.0, which is a constant addon (not per socket) of additional m lookups (m was described above, usually it is 4 - two addresses and two ports) into searching time). Trie by its nature allows trivial RCU'fication.

That will be challenging to beat socket hash tables, but I take it - so there are two ongoing works to change socket selection code.

As of the fact, that RCU sucks, such postponing is not a big surprise - I found similar problems when studied how RCU affects network stack, but with trie, where its entries itself do not hold any useful information, searching, freeing and deleting protected by RCU results in a really good behaviour.

/devel/networking :: Link / Comments (0)


Is kevent perfect?


David Miller wrote:

I'd be quiet if there were some well formulated objections to his work being posted, but people are posting nothing. So either it's a perfect API or people aren't giving it the attention and consideration it deserves.

Obviously, kevent is perfect!

/devel/kevent :: Link / Comments (0)


Interesting, where does Oracle get that?


Scot McKinley at Oracle wrote:

As Joel mentioned earlier, from an Oracle perspective, one of the key things we are looking for is a nice clean *common* wait point. We don't really care whether this common wait point is the old libaio:async-poll, epoll, or "wait_for_async". And if "wait_for_async" has the added benefit of scaling, all the better.
However, it is desirable for that common wait-routine to have the ability to return explicit completions, instead of requiring a follow-on call to some other query/wait for events/completions for each of the different type of async submissions done (poll, pid, i/o, ...). Obviously not a "must-have", but desirable.
It is also desirable (if possible) to have immediate completions (either immediate errs or async submissions that complete synchronously) communicated at submission time, instead of via the common wait-routine.
Finally, it is agreed that neg-errno is a much better approach for the return code. The threading/concurrency issues associated w/ the current unix errno has always been buggy area for Oracle Networking code.
Scot, you did your homework quite bad - just chek linux-kernel@ and read kevent mails I post about weekly - everything you described above is implemented about a year ago already - and as a bonus it has a lot of additional high-performance features, interfaces and usage cases added later.

/devel/kevent :: Link / Comments (0)


Continue fibril AIO discussion.


David Miller has entered fibril discussion with kevent support, thanks a lot David.
Let's see what he got as answers.

Davide Libenzi wrote:

Zab's async syscall interface is a pretty simple one. It accepts the syscall number, the parameters for the syscall, and a cookie. It returns a syscall result code, and your cookie...
Could this submission/retrieval be inglobated inside a "generic" submission/retrieval API? Sure you can. But then you end up having submission/event structures with 17 members, 3 of which are valid at each time. The API becomes more difficult to use IMO, because suddendly you have to know which field are good for each event you're submitting/fetching.

So, what's wrong with all that people? Kevent's structure does gets as parameters event type and id - which is exactly syscall number, there is also a user's pointer, which can be used as cookie or as a pointer to data, where syscall parameters live.

But thinking some more on it, it becomes completely wrong discussion - one can use any userspace interface he/she likes - kevent is event handling mechanism, it is used for AIO completions, it is used in POSIX timers - fibril AIO (which design I'm not agree with) can use kevent as a queue, which will return all its results - just see how it is implemented in kevent AIO or POSIX timers and adopt it to fibrils, or at least read documentation or mails annotations, which describe kevent usage cases.

What Davide Libenzi is wrong about in his mind, is the fact, that neither kevent, nor poll()/epoll() are supposed to be an AIO mechanism - they were created to deliver events - AIO completion is one of the events, which can be perfectly delivered by kevent and poll()/epoll(), the latter requires heavy file->bindings, while the former does not at all.
Kevent AIO or fibril AIO, or FSAIO - they are different mechanisms of doing AIO, but all of them can use kevent to deliver its own events like completion or errors.

/devel/kevent :: Link / Comments (0)


Mon, 05 Feb 2007

More comments on fibrills AIO.


Zach Brown wrote:

Being able to wait on that with file->poll() obviously requires juggling file-> associations which sounds like more weight than we might want. Or it'd be optional and we'd get more moving parts and divergent paths to test.

Davide Libenzi wrote:
Yes, no need for the above. We can just host a poll/epoll in an async() operation, and demultiplex once that gets ready.

If you follow kevent blog and know how it works, you will understand, why I want to crack my head against the wall.

/devel/kevent :: Link / Comments (0)


Kevent compilation is broken - patch has been released.


Blame me for that - I generate patches from my git tree automatically and never test them for correctness in the form of patches, only as git tree, so small error sneaked into, which was caught by David M. Lloyd. Fix is trivial.
This hunk is related to my epoll replacement with kevent (which I did not include into main patchset due to simple reason - I do not want to kill someone's work without author's approval).

/devel/kevent :: Link / Comments (0)


I've refused to participate in trainig courses.


I do not even know why - I did not ask about money, I did not coordinate tesises, I did not discuss topics and times - do not even know why I refused.
My first lecture was somehow fun and maybe interesting, but there were no fires, there is no some hold to get and build something really interesting for me.
So, I refused.
As a lame exuse, I got new project at my paid job (boring and simple again, but likely as previous and pending ones this will take a lot of time in politics and related nontechnical stuff...).

/devel/other :: Link / Comments (0)


Sun, 04 Feb 2007

Major step in M:N threading implementation - moved to partial userspace context usage.


As a first step I just started to use struct ucontext instead of sigframe - this allows to use getcontext()/makecontext() for initalization, which is a bit faster than signal waiting, but potentially can be much faster (I just need to remove a signal mask initialization through syscall in getcontext()). It also allows some code refactoring, which hides per-arch signal frame definition into implementation files from headers.
I created simple convertor from signal stack frame (mainly sigcontext() structure) into ucontext (mcontext_t) structure, which is used as storage. Although they have exactly the same binary layout, x86 segment registers should not be copied. Current code does not properly saves/restores FPU state too, but this requires more learning.
Next step is to implement getcontext() by hands so that code would not initialize signal mask at all. Then I will think about proper SMP scaling, which will use stack tricks instead of 'current' pointers.

There is a problem though - new thread is started only after parent's timeslice is over, i.e. it is possible to create set of threads with empty functions and they will exists quite for a while - for example 500 threads started in a loop still are created too quickly to be