Zbr's days.
January
Sun Mon Tue Wed Thu Fri Sat
 
     
2007
Months
Jan

About TODO Blog RSS Old blog Projects Gallery Notes

Wed, 31 Jan 2007

Climbing evening and recent happenings.


It was quite easy trainig, since the last several days I have extremely bad mood. Do not know exactly why, maybe season vitamins absence or due to guarding angel and muse are on vacations.
Anyway not a lot of things happend recently - I brought my laptop to HP guarantee service due to hard drive problems, I completed some small initial legal action related to my apartment.
As for climbing, I only did simple traces today, although tried to do them without the rest in between, so I completed 5 or 6 simple (5b-5c+) traces, couple of traverses and finished with it.

Due to musa absence I do not perform intresting tasks right now, but I schedule for the nearest future M:N threading library update, which will fix SMP bug and possibility to create context by hand, which will eliminate unneded signal invocation in NTL in some places. I also plan to replace nework stack's hash tables with netchannel's trie, as described in TODO. likely tomorrow I will release new kevent version, which will be just a port to the msot recent kernel tree.

/life :: Link / Comments (0)


Mon, 29 Jan 2007

Climbing evening.


I've bought myself new climbing shoes - this time I got "La Sportiva" instead of "Boreal", and it looks like my choice was right, although it looks like I got slightly biger size than the best (I have russian 43 foot size, and climbing shoes are usually about 40.5, but with "La Sportiva" it could be even smaller due to the way shoes are designed). I tried several old traces, and failed one of them which although quite complex, but I finished it without major problems before (green 7a in central sector in Skala-city). At the end I tried some jumping I discovered in old trace, but failed again - probably it is due to the new shoes, which are quite different for the feet, or because of the slightly different size, time will show.

/life :: Link / Comments (0)


M-on-N threading library got a homepage.


Here one can find some design ideas and notes compiled from threading blog tag, which you are currently reading.

/devel/threading :: Link / Comments (0)


Atomic locks and updated starting route for M-on-N threading model.


After I started to use atomic locks (lock prefix on x86) instead of semaphores, thread start/empty exec/stop was reduced down to 0.3 microseconds compared to 14 microsecods for POSIX NPTL case.

But there are problems.
First one is that I perform initial context setup through signal invokation, which is at least two syscalls. They are slow.
Another one is that thread is really started only after rescheduling, which is another signal, so another two syscalls.
Third on is that there must exist different locking primitives - for signal context and for process context, which must block signals, which in turn adds additional overhead of sigprocmask() syscall.

After I fixed all above issues (actually not fixed, but confirmed that they must exist), performance reduced to 9 microseconds compared to 14 microsecods for POSIX NPTL case for empty thread creation/destruction.

This can be fixed, if I would have created arch-specific getcontext()-like calls, which would be mutually transformable into signal context information (existing getcontext() and friends produces different data than signal context has at least on x86). But I can not right now, since I do not know enough x86 ABI (I learned a lot for past several days, as you can notice from this blog, but it is still even remotely not enough).

Currently M-on-N threading model uses ugly arch-specific hacks to start new threads, which actually are something remotely similar to makecontext().
So, the solution, which will rock M-on-N threading implementation is to convert or create getcontext() and friends calls which can be used with signal context information.

But let's Linux motto "release early, release frequently" fly - I plan to release alpha version today even without syscall substitution support and send it to linux-kernel@ and libc-hacker@ for review.

/devel/threading :: Link / Comments (0)


Sat, 27 Jan 2007

New userspce preemptive scheduler for M-on-N threading model is completed.


It was built on the idea, that kernel saves in stack information about interrupted context, so it can be extracted there and changed to anything else. I do not use makecontext() and friends at all now, since its internals are completely different from what is stored in signal's stack. System works, but not 100% reliable - there is a race in scheduler when it is possible to reference a thread which was just exited, I will fix this with introduction of atomic operations, which will also reduce thread creation overhead related to futex() syscall, which is how semapphores are implemented, which are currently main locks in my threading library.

Such approach currently can only be used, when sources are compiled without position-independent code support.

If timer signal based scheduler is disabled, no bugs happens (which is quite obvous, since it becomes synchronous).

I did not tested SMP scalability and there is no syscall dynamic substitution, which will be added after I complete bug fixing in the scheduler.

Current test for speed of the thread creation shows (thread is allocated, its empty function is called, then it is removed), that speed for one thread creation/execution/destruction is about 1.9 microseconds, compared to 14 microseconds for NPTL POSIX threads.

/devel/threading :: Link / Comments (0)


Thu, 25 Jan 2007

I will run some trainings this year in Auriga and MSU.


They will be devoted to linux kernel hacking. The first lecture will be spent in about a week in Auriga, where I will talk about Linux USB driver creation with example of real hardware - w1 USB bus master driver, which you can find in kernel sources (drivers/w1/masters/ds2490.c, although to get it work I needed to resolder some pins, so I do not 100% sure that my driver for this hardware in Linux kernel is correct, but do not tell anyone).
If things will go smooth, I will create couple of other lectures (I plan to talk about networking)and will have a presentation in MSU in a course, which starts there in a month.

I consider this as a preparation for my possible talks in Linux kernel summit and/or other development conferences, if stars will go stright and I will participate there.

There is non-zero probability that I will run some lectures in India this spring too with Auriga team (with kevent talk as a bonus), but it is too early to talk about it.

/life :: Link / Comments (0)


New kevent version 'take34' has been released.


The only change is a header pointer in aio_sendfile_path(). Now one can use aio_sendfile_path() to send a file with header in one syscall instead of three: send(header), open(file), sendfile().

/devel/kevent :: Link / Comments (0)


Wed, 24 Jan 2007

I congratulate Grange with his birthday!


Be good and be cool!

/life :: Link / Comments (0)


Climbing evening.


It was simple training - one several old traces, and I magically tired to hell, so I finished with 'men's start' - climbing without legs - and after this 'trace' there was no power anymore, so I completed some exercises, got usual sauna, shower and moved away.

/life :: Link / Comments (0)


Execution different function after returning from signal handler.


It looks a bit like writing an exploit actually, but I managed to change signal execution path to call my own function after signal handler is completed instead of returning to previously running context. Currently that function is executed in previously running context, only %eip was changed.

Code is quite simple and generic enough:

	struct sigframe *frame = (void *)get_ebp + sizeof(void *);
	struct sigcontext *sc = &frame->sc;
Above is correct at least on x86 and x86_64 (except that register is %rbp), although above structures are cpu-specific. If %eip (or %rip) is changed in signal handler to pointer to the new function, it will be called instead of function, which is supposed to run in that context.

Here is a log:
main 1814: scheduling first, context size: 88, fpstte: 624.
call_me_func: ebp: 0xb7f19ff8, stack: 0xb7f1a008, diff: 16.
call_me_func: esp: 0xb7f19fb0, stack: 0xb7f1a008, diff: 88.
1169655167:       th: 0x8049c00, stack: 0xb7efa008, id: 1, esp: b7f19fb0, ebp: b7f19ff8.
alarm_sighandler: ebp: 0xb7f19b08, esp: 0xb7f19ad0, func: 0x804868a, frame: 0xb7f19b0c, call_me_func: 0x804856c.
alarm_sighandler: prev: esp: b7f19de8, ebp: b7f19fa8, eip: 3404db.
alarm_sighandler: eip set to 804865a.
sched_return: func: 0x804865a, ebp: b7f19de4, esp: b7f19dcc.
1169655168:       th: 0x8049c00, stack: 0xb7efa008, id: 2, esp: b7f19fb0, ebp: b7f19ff8.
1169655171:       th: 0x8049c00, stack: 0xb7efa008, id: 3, esp: b7f19fb0, ebp: b7f19ff8.
1169655174:       th: 0x8049c00, stack: 0xb7efa008, id: 4, esp: b7f19fb0, ebp: b7f19ff8.
As you can see, sched_return() is called instead of old function, which prints next string since sched_return() returns.

To implement correct userspace scheduling I only need to replace the whole struct sigframe function with context from different thread. So far this looks simple, how it will be in practice I will check tomorrow, and now I need some climbing.

P.S. in previous story about how signals work I made a mistake saying that new signal stack is allocated - no, the same process' stack is used, or alternative one if it is available and thus special flag is set.

/devel/threading :: Link / Comments (0)


Tue, 23 Jan 2007

How do signals work in Linux?


Likely it is the same in all other Unix systems too, but I only checked Linux kernel.

There are two types of signals - synchronous, which are for example result of error operation, they always happen synchronously after the wrong operation, and asynchronous ones - they can be delivered at any time, for example using kill() call.
No matter what type of signal we received, it was produced exactly the same way.

When signal is generated and it is not blocked, mask of pending signals is updated. Later (when process is scheduled for execution) kernel will examine mask of pending signals and if there are any, it will start to deliver them one-by-one. If handler is set to SIG_DFL or SIG_IGN either process will be killed (actually default action will take place), or signal will be dropped. The most interesting case thus is when there is our own handler.
In that case kernel will eventually call setup_frame() function, which will setup new signal stack (or use existing if SA_ONSTACK flag is set), save current context (copy registers, error value, some thread info and other interesting information), setup return call (function which will be called when signal handler is completed to return back to kernel). Save context procedure includes filling struct sigframe, which contains all info needed to continue to run interrupted task and/or schedule it after some time.

After some GNU asm learning and googling, I've managed to run function on top of its own stack. Code (for x86) is pretty simple:

	asm volatile (
		"mov	%0,%%eax	\n" /* Start address */
		"mov	%1,%%ebx	\n" /*  Arguments */
		"mov	%%esp,%%edx	\n" /* save old sp to edx */
		"mov	%2,%%esp	\n" /* change stack */

		"push	%%edx		\n"
		"push	%%ebx		\n" /* copy arguments to new stack */

		"call	*%%eax		\n" /* call (*func)(arg); */
		"mov	4(%%esp),%%edx	\n"
		"mov	%%edx,%%esp\n" /* restore old stack */
		:
		: "g"(func), "g"(data), "g"(stack+stack_size)
		: "eax", "ebx", "edx", "esp"
		);
It was lurked in PTL threading library, which is the only one which does support preemptive userspace scheduling. Provided function is indeed called on its own stack. But when signal is delivered, its $ebp does not contain that stack, but instead it contains address somewhere in the middle of the new stack, which rised a suspicion, that glibc installs own signal handler, and then calls my, but experiments with x86_64 test machine showed, that kernel indeed jumps directly into my own signal handler. Unfortunately I do not have fast x86 test machine, and do not want to spread power to two arches currently, so I will setup my VIA C3 test machine and will run some signal tests on its modified kernel to detect, where exactly stack pointer for interrupted context is stored. When this issue will be resolved, correct preemptible scheduler for M-on-N threading model implementation will be just a matter of hours.

/devel/threading :: Link / Comments (0)


This year Linux Kernel Summit will be held in Cambridge, England.


According to Theodore Ts'o this will happen September 5-6.

Up to that date I have some projects to complete, so I think I would participate, if starts got into the straight line.
Main projects are kevent of course and netchannels (with grand network stack breakage and replacement of socket hash tibles).
Additionally I think threading issues can form a good talk and of course new filesystem (but only if there will be some results up to that point).

Although I believe that probability of my talks is quite low (and if you likely do not know, but my english speech skills are somewhere between zero and void, although it is not a problem for me), it is still possible to move there for a day and meet Abr (although he is in London) and Mephody (although he is in Limerick, Ireland).

/devel/other :: Link / Comments (0)


True context switching in userspace.


After some thinking, I've understood, that setcontext() approach does not allow to have real context switching - when this function is called, context is restored from the point where getcontext()/makecontext() was called last time, so even if one have possibility to restore new context, context being removed must save its state by itself - obviously no one will do it. So this approach will not be used at all in my context switching implementation (although I will check getcontext()/makecontext()s sources, since they contain needed bits).

Let's slightly move away from topic and concentrate on how signals work.
According to my knowledge, signal is just a call for some function in stack - kernel saves needed context, setups small region on stack of currently executed thread, saves current context and calls a special function which ends up with registered handler invocation. (Interesting note for investigation - how do signals work with non-executable stack - likely special page is allocated for that purpose, but if it is so, then there is a way to write explits which can run on stack even when system setups it to be non-executable).
Signal handler can not be reentered, when it is exiting, it restores previous context, thus creating real context switching.

Returning back to our topic - scheduler's work thus become obvious - system's signal helper will store current execution context on stack, then new thread will be selected by registered handler and its context will be restored when signal will exit instead of old one, thus true context switching will be performed.

This looks simple in theory, but on practice there are couple of small problems:

  • I do not know (even x86) asm enough to code like in C (but I always wanted to hack low-level stuff)
  • I do not 100% sure that signals work like I described (but I surely want to know)
  • I do not know how it will be easy or not to save/restore context (according to glibc sources, getcontext()/makecontext() and friends are not too big though)
Actually context switch is a bit more complex - for example virtual mapping should be restored/saved too for TLS (thread-local storage), on x86 MMU registers must be saved and more generally FPU state must be saved too, but for initial implementation I think it is not too relevant.

So, enough for theories, it is about an 1 A.M. and I need to sleep - I'm sure this day will be very interesting.

/devel/threading :: Link / Comments (0)


Mon, 22 Jan 2007

Signals and contexts.


I was a bit optimistic when said that scheduling works - it can not work at all, completely, since it is not allowed to call setcontext()/swapcontext() from signal handler. It is only needed to schedule away CPU-bound tasks which do not perform syscalls, since syscall will be a rescheduling point too.

To solve this situation system needs to either spawn additional control thread, or allow kernel to create different signal types, which will be safe for setcontext()/swapcontext(). The former is platform-independent approach, although I would like to implement the latter, but for now first one will be done.

I've just found that there is no preemptive userspace threading library - IBM's closed NGPT library is based on GNU Pth and they both proide only non-preemptive scheduling.

Now I understand why I have so much problems with preemptive scheduling, and actually it does not look like there is easy solution even with control thread - all *context() functions work only with current context.

This requires some deep thinking...

/devel/threading :: Link / Comments (0)


Sun, 21 Jan 2007

Cognac, vodka, friends...


I've visited my friends in Dolgoprudniy - Bass (Andrew Shcheglov), Silich/Sviridovskaya family, Gora and Shtrom family in Lobnya - that was excellent trip. I will stay in Bass' home in Dolgoprudniy for some time, maybe I will visit Alma Mater and even meet my old friends which work there.

I found a new very interesting drink based on cognac, althoug I generally do not like it (for exmaple today I drunk several years old 'Hennessy', but did not found it to be extremely tasty definitely, I still think that cognac is a bad-tasted coloured vodka (at lest the latter can be drunk with additional after-drink like juice)). This is so called "Cranberry in cognac", which is not that strong like vodka, and thus allows to drink it without additional after-drink/after-food and has extremely good taste.

/life :: Link / Comments (0)


Sat, 20 Jan 2007

M-on-N threading scheduler is ready.


It works similar to O(1) scheduler in that regard, that it has two queues of tasks too. When task runs out of its timeslice, it is moved into inactive queue, which becomes active, when active queue is empty.
It seems, that system scales good with increased number of CPUs, at least when system created several busy-loop threads, they ate both physical CPUs on my Core Duo system (according to 'top'), although it is possible that distribution is unfair.
Code is not yet ready, I know about at least two nasty bugs there, and it lacks syscall substitution part, which is a major part, but there is progress indeed.

/devel/threading :: Link / Comments (0)


Why is kernel.org so slow some tims (LWN.net article).


Here is an article about problems with filesystem access kernel.org started to face recently which ended with the fact that Linux does not scale too good when number of directory reads becomes too high. Another part of the problem is the fact, that ext3 (current kernel.org fs) does not group directories close to each other to reduce number of seeks needed to read several subdirectories, but it was pointed that for example XFS does have some methods to group directory information close to each other on disc. Another problem is absence of readahead for directories.

/devel/fs :: Link / Comments (0)


RIAA/MPAA and others.


I always woder how is it ever possible to sue someone for getting something for free - how they can sue you for downloading a music files? When you gown down to street and see a dollar, no one will sue you for stealing if you get it. The same applies for book, which is someone's intellectual property. Then why Internet is different?

Of course I understand, that amount of money in Internet is not even remotely near to the amount of books one can find on the street or any other similar situations.
But why defendant are so inert, why do they only try to say, that they did it by accident?
RIAA/MPAA allow to sell _information_, which in Internet means set of bits. And they want to limit your rights to do whatever is allowed by Constitution.
When someone gets a disk with songs, one has exactly those bits - it is _product_, it is not service. One can do enything with product, since after it was bought, it belongs to the new owner, but service can only be used according to rules.

So, when I see a freely downloadable music in Internet, I just see some information, which does not belong to RIAA/MPAA, since it was bought by someone as product (bits of information on CD is a product, not a service), so one can use it whatever it wants including sharing.

I agree that pirates make a copy in cinema (with bad quality always) illegally, since cinema provides service, which has rules. But if information has been sold, it does not belong to previous user no matter what he wants to say - consider situation, when you are buing a TV, and shop does not permit to watch some chanels on it.

I think there must be an attack on that front, not swampy defence.

Idea has been casted by reading Slashdot about all those idiotic RIAA/MPAA legal actions...

P.S. it may sound to naive though, since I'm completely not a lawyer, but I do not understand what is the different between information in form of bits on disc and any other products, which EULA if breaks laws becomes invalid.

/other :: Link / Comments (0)


Fri, 19 Jan 2007

Userspce scheduling in M-on-N threading model.


The main purpose of the scheduler as a saparated system is to hold fairness in spreading CPU clocks between tasks.
Current scheduler just gets next task and gives it its own timeslice (currently equal to 10 milliseconds). If that task is sleeping in syscall or performing CPU intensive operations, system will not take that into account. So tasks which are so called IO-bound, i.e. waiting in syscall for IO comlpetion (or actually any other sleeping), are not getting CPU clocks in a fair manner, since theirs timeslices are spent mostly sleeping. CPU-bound tasks in that model does not fully utilize CPU too, since part of the time CPU is idle since scheduler put IO-bound task into execution, and that task is waiting, while it would be possible to start CPU-bound task.

So, the solution is to provide each task a given timeslice, which will be decreased when task is actively executed on CPU. If task puts itself into sleep, its timeslice is reduced according to time, which was used for active execution. When rescheduling happens either in syscall time or in a signal, scheduler will select task with the highest timeslice left. Priority of the task will correspond to the length of the timeslice each task obtains.

Userspace scheduler also has access to the information, what exactly any task in question is doing, so if it is known that it is waiting in syscall, it will not be awakened at all until scheduler receives kevent which given task is waiting for.

/devel/threading :: Link / Comments (0)


Kevent feature request for aio_sendfile().


Suparna Bhattacharya (IBM) has requested new feature in the asynchronous file sending syscall - header pointer, which will be put into socket queue before file's data.
Although Linux syscall overhead is extremely small compared to other Unix systems, it is still not zero, so since I already "optimized" (i.e. removed) open() call in aio_sendfile_path(), I think things will not became worse if I will put there header pointer and length too.
I plan to release new kevent version after M-on-N threading model implementation with this feature implemented.

/devel/kevent :: Link / Comments (0)


Brilliant idea on how to break Linux networking code completely.


I'm going to substitute sockets with netchannels, but with having backward compatibility - mainly I will replace socket lookup hash tables with netchannel's trie and will make socket as special type of netchannels, like netfilter or userspace netchannels currently.
This change is completely transparent to all users, since no API will be changed, only socket allocation/lookup/freeing.
This change us intended to allow unlimited scaling of number of sockets with constant search time (since there is only one type of wildcards sockets - listening ones, which expect incoming connection from 0.0.0.0 address, there will be only one wildcard per trie, thus searching time will be constant).

I've put this item into TODO and schedule this changes after M-on-N threading model implementation.

/devel/networking :: Link / Comments (0)


Thu, 18 Jan 2007

iDrink.

iDrink

Got from here via www.idiot.ru (politic news, btw).

/other :: Link / Comments (0)


Wed, 17 Jan 2007

Climbing evening.


It was not that bad training, although with some negative moments. I tried new traces this day - most of them were quite doable, but they were complex enough and I performed several complex boulderings before, I failed. The most interesting was yellow trace in the central sector on vertical wall with all completely passive holds - I managed to complete about half of the trace on-sight, and then it started to present surprises, or maybe I was just too tired for the on-sight climbing of that complexity. Anyway, at the top I was completely out of power, and even managed to get rope over the head and damage my shoulder, but eventually I quicly fixed my position, but teared my handphones cord.

/life :: Link / Comments (0)


Breakthrough ideas are not from teams. Hans von Ohain.


Interesting note... I would even say 'ego boosting'. I like it.

/other :: Link / Comments (0)


Threading part of the NTL M-on-N threading library is ready.


Although not without problems - there is no scheduler (well, there is round-robin one, which is not what I want), I did not run any kind of benchmarks to test SMP scalability and timer signal overhead (the latter is the most problematic part - although 'top' shows zero CPU usage for pool of 100 threads sleeping in infinite loop, it is still possible that actual CPU usage due to signal delivery overhead can be noticeble).

Code does not contain kevent syscall wrappers yet. I will think about dynamic library loading tricks, which allow to 'replace' syscalls in runtime.

Enough for today - I'm going climbing.

/devel/threading :: Link / Comments (0)


pthread_create() vs. clone().


Did you ever tried to use clone() directly? I bet you never tried it at least with recent kernels.
First, exported clone() does not correspond to what kernel expects, it looks like it is only provided for compatibility. Manpage for that call is utterly obsoleted and incorrect (except useful flag description it contains _wrong_ descrition of parameters at least for i386).
But I do not search for easy ways - I have glibc sources and can dig into them.
That was my first impression of the man, who in theory can climb the Everest, fly to the space and understand math behind string theory (the latter only if time permits, it looks like the whole life can be spent there digging into more and more new subtheories).
Now I think that all three tasks described above can be much-much-much more solvable than digging in the glibc sources. And those people says that I poorly described kevent - hey, look into glibc NPTL implementation (and I even do not talk about its coding style) and pray you will never see this again, or just try to start a new thread using clone().

After about an hour of reverse engineering process trying to make __clone() work (note, that clone() does not work at all, just forget about this call, only __clone() is correct for i386 and 2.6 kernel), I managed to start new thread. It was a win, except very small problem, that it crashed somewhere in the provided function calling chain.
I want you to know, that I do not know low-level i386 arch enough to easily read and understand asm code (some years ago I managed to write asm application which entered protected mode in DOS, but I do not recall asm already, and actually never understood gas semantic good enough) found in sysdeps/unix/sysv/linux/i386/clone.S, so I miserably failed to proceed.

Yes, I started to use pthread_create() for SMP scalability. I do not hear how you scream 'loser', since you would be there too, but those of you, who still lurks here and ironically nod your head would better point me to something useful for understanding of how modern i386 (or actually any other arch) starts and works with threads/processes.

/devel/threading :: Link / Comments (0)


Initial implementation of the ntl (new threading library) M-on-N threading library.


Well, I can not find better prefix than ntl, which is extremely non-ordinary abbreviation for 'new threading library'.

Anyway, current version is very initial, it does not contain scheduler and does not contain kevent-driven wrappers on top os usual IO syscalls, but it already has all initialization mechanisms, cache of threads and all structures required for scheduling.

There are two major problems uncovered with this initial implementation.
First one is scheduling problem. Since NTL does not contain dedicated schduling thread, it is quite hard to perfrom scheduling of the functions which does not do syscalls, for example with those which just do while(1); loop and eat 100% CPU and never enters NTL layer. To solve this problem I need to add timer and appropriate signal handler, where reschduling will happen, which in theory can lead to performance degradation and to problem with alarm signal registered in thread function (although that should be fixed with kevent timer notifications).

Another problem is futex performance.
In current code there are two locks implemented as semaphores, which in modern Linux are transfered into futexes - schduler lock, which guards queue of threads, and stack cache lock, which guards list of free thread stacks.
So usual thread creation, empty function and thread exiting in NTL changes from this operations to:

mmap2(NULL, 8396800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7684000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
futex(0xb7fdac20, FUTEX_WAKE, 1)        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
futex(0xb7fdac20, FUTEX_WAKE, 1)        = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1)        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1)        = 0
...
munmap(0xb7fd7000, 4096)                = 0
so we get aditional four futex calls - two locks are processed: one when stack is unlinked and returned to stack cache, and another when thread is added and removed from scheduler's queue.

Performance differs noticebly (test case includes creation of the thread, which exits immediately, which is repeated requested number of times):
$ ./ntl_test 100000
num: 100000, diff: 388234, speed: 3.882340.
Compared to 1.793600 microseconds without futex calls.

In this situation there is no concurency at all - it is synthetic test, so actually one _empty_ futex call gets about 0.5 microseconds, where pure syscall overhead is 50% (this is Intel Core Duo 3.40GHz (running 3.7 Ghz) test machine).
I can not say if futex performance is slow of fast - but I would like to avoid this, so in practice semaphores should not be used for thread serialization, instead lightweight locks must be introduced.
In current code all locks are abstracted and implemented in separate file, so lock changes are trivial, but I do not want to introduce per-arch usage right now.

/devel/threading :: Link / Comments (0)


Bruce Schneier's facts.


Super!

When Bruce Schneier observes a quantum particle, it remains in the same state until he has finished observing it.

Most people use passwords. Some people use passphrases. Bruce Schneier uses an epic passpoem, detailing the life and works of seven mythical Norse heroes.

Bruce Schneier writes his books and essays by generating random alphanumeric text of an appropriate length and then decrypting it.

/other :: Link / Comments (0)


New kevent 'take33' release.


It is minor release which only contains following changes:

  • Updated documentation (aio_sendfile_path()).
  • Fixed typo in forward declaration.

/devel/kevent :: Link / Comments (0)


Tue, 16 Jan 2007

The world becomes demented.


I was invited to participate in training course in The International Institute of Information Technology (I2IT) of India with topic "Advanced Linux Kernel Programming" as a lead teacher (or as a knowledge base for teachers?) by russsian company which held similar course previous year.

Actually I even do not know what to answer. Really.

I do think I'm not that bad kernel hacker, although definitely not brilliant, and have limited knowledge compared to many of those, whose names you can find in Linux kernel sources, but I would never teach people how to hack Linux kernel (actually I doubt I would like to teach anyone to do something at all, maybe ony sometimes you can hear some advices fly out).

Kernel hacking is like a breeze - you either like it or not - if you like it you just start doing this, initially from small steps, later with big ones, just like with any other task, and if you do not - no matter how good teacher is, result will unlikely to be really positive.

/devel/other :: Link / Comments (0)


Mon, 15 Jan 2007

Extremely powerful climbing evening.


That was rgeat training - a lot of traces, really a lot, most of them combined into pairs or threes with the rest in between - I even managed to complete some new one and really old traces. And magically I was not that tired, of course hands became weaker with time, but that means traces completion became more technical. I completed trace with the horizontal (negative) slope start with only minor problems (and if you get into account that it was almost the last one I climbed this day, it can be considered as a good climbing), which was a bit of surprise to me.
Later dry sauna actually almost killed my body - it became so slack, that I even sat several minutes without moving just to get myself into the shape.

It was just perfect training. Excellent time.

/life :: Link / Comments (0)


Initial benchmarking of pthread_create() vs. makecontext().


Benchmark is simple - allocate new thread, thread function immediately exits, parent thread waits for cancellation and starts again.
One case is pure pthread_create()+pthread_join(), another one is getcontext(), stack allocation (8mb as of my current rlimit), makecontext(), swapcontext() thread function immediately exits, stack is being freeing.

Obviously I expected that makecontext() will be much faster, but (time is a number of microseconds to create/destroy one thread, i.e. perform sequence described aboved):

$ ./test_pthread 100000
num: 100000, diff: 1402225, time: 14.022250.
$ ./test_context 100000
num: 100000, diff: 1322459, time: 13.224590.
Impossible, something was completely wrong and another world's magic was mixed, that was my first impression.
But when I studied in MIPT, I was told on every physics lab, that there is no magic, so I started to think.
The only thing my brain could think about, was to run strace.
So I did, and found following interesting moments.

Pthread case:
...
mmap2(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7622000
...
clone(child_stack=0xb7e214c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
	CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, 
	parent_tidptr=0xb7e21bf8, {entry_number:6, base_addr:0xb7e21bb0, limit:1048575, seg_32bit:1, 
	contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e21bf8) = 24426
clone(child_stack=0xb7e214c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
	CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, 
	parent_tidptr=0xb7e21bf8, {entry_number:6, base_addr:0xb7e21bb0, limit:1048575, seg_32bit:1, 
	contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e21bf8) = 24427
...
and so on - everything looks ok - one stack allocation, and then it was reused since stack was not freed, but was put into cache by nptl implmentation.

Here is ucontext case:
...
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7659000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
munmap(0xb7659000, 8392704)             = 0
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7659000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
munmap(0xb7659000, 8392704)
...
So, mmap()/munmap() was the culprit - my context allocation code did not used cache of stacks, instead it allocated/freed new one for each new context. After I emulated cache usage, I got following strace for context case:
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb75fc000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
...
munmap(0xb75fc000, 8392704)             = 0
And following results:
$ ./test_pthread 100000
num: 100000, diff: 1402225, time: 14.022250.
$ ./test_context 100000
num: 100000, diff: 179360, time: 1.793600.
As expected - there is no magic, userspace context switching is about 7 times faster than real thread creation, and mmap()/munmap() syscalls provide exactly clone() overhead. Empty syscall on this machine is about 0.25 microseconds, so its overhead is negligible.

/devel/threading :: Link / Comments (0)


Bring me candies from Linux.Conf.Au.


If I would be there I would visited following presentations:

Non-development activity I would like to participate:

/devel/other :: Link / Comments (0)


Sun, 14 Jan 2007

NPTL thread stack usage.


NPTL (as of glibc-2.5) uses following tricky stack usage model: when new thread is created some stack must be allocated for it, so NPTL stores cache of stacks, each of which has different size, so when thread is created that cache is searched for the entry with the nearest but bigger cache, if cache is found and it is not that bigger than requested (it is less than 4 times bigger than requested), it is returned to the user (and global variable which holds size of the cached stacks is reduced). If there is no appropriate stack, it is allocated from anonymous memory using mmap() with at least MAP_PRIVATE | MAP_ANONYMOUS flags, which in particular means that pages are not allocated at all, but copy-on-write mechanism is used (i.e. real allocation happens only when user writes to allocated pages).
Maximum size of the thread stack is 40 Mb, default value is taken from rlimits.
Stack is guarded by some pages (if required) and part of it is used as control block.
Simple and good model.

Exactly the same model will be used for stack allocation for M-on-N threading model.

There is another interesting memory issue in NPTL - so called thread-local storage (aka TLS).
It allows to have for example global errno variable to be per-thread. This requires to extend programming language (C and C++ have __thread variables) and ELF header. Reasons to have thread-local storage are obvious and the main one is performance, but it has too many problems with symbol lookups, object allocation and other issues (more details can be found in this article), so I will not implement it.

Furthermore, my library will be actually _trivial_ layer on top of glibc, which will just provide _new_ symbols for programmers, like new_write(), and if approach will be considered as good, glibc maintainers (namely Ulrich Drepper, with whom I had some discussion about kevent (it looks like we still not 100% agree on all issues, but there was a major progress)) could adopt given design.
My library will use makecontext() and friends functions from glibc as a basis, which are actually just wrappers, which work with registers and other meaningfull for process running information.

/devel/threading :: Link / Comments (0)


Meanwhile on flat development front.


As you have already found, I 90% completed hinged ceiling painting - I selected shine orange colour to specify sleeping area (that is where hinged ceiling is located), later I will paint the rest of the ceiling likely into something calm like slightly blue or green. Kitchen will have likely beige colours.
I also completed self-leveling floor filling everywhere except half of the room, where developing materials are placed - I even started to move in the room without shoes, although it is quite cold.
Seems, that I've fixed central heating radiator, or at least made it work a bit better - it required water lowering, flowing mechanism disassembling and assembling (only two part can be easily removed though), and some physical (and indecent wording) "influence". Now I feel myself much more comfortable in the apartments, although it is quite cold there, at least without clothes at 6 A.M.
If things will stay in good direction, I plan to paint the whole ceiling this week and then start to glue wallpapers and put floor covering.
Eventually I will start bathroom development...

/devel/flat :: Link / Comments (0)


A super quest.


What is it?

A quest

Answer can be found in gallery, starting from this item, but do not look there until your first answer is ready, I almost sure it was wrong, but if you are that lucky, feel free to ask me for beer.
This super orange colour (better look in gallery) selected to specify sleeping area.

/devel/flat :: Link / Comments (0)


40 santimeters of snow in Moscow!


40 santimeters of snow

/other :: Link / Comments (0)


Sat, 13 Jan 2007

Filesystem quest.


I've asked my known Unix admins (mostly Solaris and Linux, bits of AIX and *BSD and a lot of Windows knowledge) what FS features are the most required and requested ones. Here is short list in order of priority:

  • clearness of the idea. I.e. no reiserfs-in-reiserfs problem or things like Solaris UFS on highly fragmented disk.
  • reliability in expluatation. Different modes of journalling. Very fast recovery and fsck checks, for example ext3 and reiserfs with theirs 4 minutes per 250 MB when there are _no_ errors is completely unnacceptible when storage contains 20 and more partitions.
  • ACL support and quotas.
  • Defragmentation. At least tool to determine how high is fragmentation of the current partition and possible defragmentation. Currently it can be fixed through round data moving over empty partition.
    (Although I personally consider defragmentatino as a hack, tool for showing its status could be interesting. Main solution for that problem I see in delayed allocation and defragmenting through delayed allocation when file is read into VFS cache).
  • Speed. Although this item is on different places between admins, it is very significant issue to have high read speed compared to hardware disk speed at least in most required setups (web/file/mail server) - current hardware easily allows to have more than 50 MB/sec speeds, while FS rarely get over 25 MB/sec limit.
  • Snapshots. For backup purposes read-only snapshots are the killer feature. It is also interesting feature to have read-write snapshots (when main partition is mounted first time it would be useful to allow to mount it read-write from some point in past).
  • Additional data processing units like compression, cryptography and so on, especially with interface to plug external modules.
As you probably already understood, it will be TODO list for my own FS development.

/devel/fs :: Link / Comments (0)


Hinged ceiling is almost ready.


My hinged ceiling

I think it was the most complex and heavy part of my development process. It has two layers with small boardings, eventually I will place neon cord into boarding and dotty lights into ceiling, but right now it contains only three layers of plaster and waits for the last one which will colour it into slightly yellow-beige colour, similar to my hammock, but smoother. Plaster was placed not as stright surface, but with random curviness similar to the surface made by spreaded material (I've found words for that - consider paining made by oil-paints using brush - that it similar to what I want my ceiling to look like). Stucco plaster is called "Venetian patterns" or something like that, so you could imagine how it will look. If I will complete it tomorrow, I will post some photos too.

Some more photos can be found in gallery.

Meanwhile I've ordered my enter door (second time already) - it is simple but very hard-to-break steel door. It will be ready about Jan 23-24, when it will be setup, I will finally feel myself like living in the apartments, not in the construction.

/devel/flat :: Link / Comments (0)


I'm pleased to congratulate you with The New Year!


Yes, again. We are so cool, that we celebrate New Year two times per year - Jan 1 and Jan 13.
I think you know what to do when new year has come, so do not look here - go and make something cool.

I'm also congratulate Abr with his birthday - since he is in London, I can not celebrate with him and his family (Hello, Tanya and small Anton Andreevich), but I transmit my wishes over the Internet (if you ever read my blog).
Happy Birthday!

/life :: Link / Comments (0)


Fri, 12 Jan 2007

Climbing eveninig.


Short but quite productive training - not too much traces were completed, since training started one hour after usual start at 18-30, but nevertheless I completed several warming simple traces and four quite complex old traces - found that some of them were slightly changed, which forced me to fail, but only for a moment. I was not 100% recovered from previous (first time this year) training, so I did not even tried something completely new and really complex, but nevertheless body is aching and that is a good sign.

/life :: Link / Comments (0)


SMP scalability in M-on-N threading model.


Actually it is simple.
Since only kernel can schedule thread (actually not even thread or process, but its own kernel's representation, so called kernel's virtual process) to run on specified CPU, M-on-N threading model should have several real threads (for example several current POSIX threads), its number should be equal to number of real CPUs, and then library layer will schedule execution of context of different real threads, each of which in turn can run on separate CPU.

So, userspace will create new real threads when pthread_create() is called until number of them is less than number of real CPUs, each real thread in turn is a context in the global tree of contexts, where fake context will be added with all subsequent pthread_create() calls, and userspace scheduler (backed by real threads) will pick up several contexts from the tree and execute them on the real CPUs.

Simple and looks really good.

/devel/threading :: Link / Comments (0)


Buy Sealand.


Sealand is small island/platform situated at a distance of approximately 7 nautical miles from the coast of UK in North Sea.
It has sovereignty.
One can buy (and/or develop) her/his own nation there with own laws, which in turn allows to register there some firms without any taxes even better than today's off-shores.
Its price is $500 millions, but there are other islands too, which prices start from just $50k.

For example this set of islands with price less than $250k, althgough I'm not sure that one can declare a sovereignty there.

/other :: Link / Comments (0)


Winter in Russia.


One year ago temperature in Moscow was 2 degrees higher than in Antarctica, it was the coldest winter in the history (with instrumental measurements). Now we have about +/-1 degrees Centigrade - the hottest winter in the history.

Winter is a time of planned Matrix update. To get calculation resources for garbage collection, Matrix shortens light day, removes leafs from trees, and paints sky into uniform gray colour. This reduces picture calculation. Previously Matrix covered the whole surface with uniform white snow, but with new powerfull servers it is not needed anymore. One says, that after next upgrade there will not be any need for dedicated winter at all.

/other :: Link / Comments (0)


Thu, 11 Jan 2007

Userspace threading and theirs benefits and drawbacks.


Benefits.

1. Fast scheduling.
There is no need to cross userspace/kernelspace boundary to schedule new thread execution (just watch what happens with userspace network stack compared to kernel's one when there are a lot of syscalls performed for small packets receiving/sending).

2. Fast thread creation and destruction.
It just becomes an allocation of the structure in the userspace, no need for full creation process which is performed in clone() syscall.

3. Smaller number of cache misses.
Since there is only one process instead of several threads, cache locality is increased greatly with reduced number of misses.

Drawbacks.

1. Scheduling fairness.
Since kernel does not know about multiple threads behind given process, it can not add it appropriate number of timeslices for execution.
Can be solved either by more tight collaboarion of the userspace nad kernelspace schedulers or simply by increasing process' nice value.

2. All communications are performed through one kevent pipe.
Which can be problematic (although interface was specially designed to be scalable).

3. Complex code for good SMP scalability and userspace scheduler.
I wanted to put it into 'Benefits' section, since that is exactly why I started this project.

/devel/threading :: Link / Comments (0)


Threading issues and ways to resolve them.


1. Signals.
POSIX requires that signal must be delivered on per-thread basis, but signal handler, and thus the fact that signal is ignored or not, is per-process property. With kevent's possibility to deliver signals through its queue problem can be solved in the very elegant way - main process receives a signal event notification through its kevent queue and then check all its threads, which have that signal unblocked, all appropriate threads receives signal through the alternative signal stack.

2. Kevent/poll usage in the threads.
Poll() and select() must be translated into kevent request in syscall wrapper, for example how I implemented epoll on top of kevent, and then that event will be put into main kevent queue.

3. Sleep and the list system calls.
Kevent has timer notification which will be used to emulate such calls. Call for POSIX timers can be emulated through kevent POSIX timers support, but probably I will not consider this for initial implementation.

4. Blocking inter-process communications like semaphores.
It must be converted to userspace kevent notifications.

All above can look like it is old LinuxThreads days before NPTL, when there was a special management thread which performed a lot of that functionality (namely signal handling, resource cleaning, which is not a problem for this new implementaion, since all resources will be automatically cleaned when process exits, and no process-visible resources like file descriptors are closed on thread cancellation, and signals can be handled perfectly with kevent's capabilities), but now it has moved into layer between kernel (or glibc for initial implementation) and application (i.e. scheduler, I think it is correct name, since main task of that layer is exactly scheduling). But actually it completely does not differ from what we have right now with NPTL and 1-on-1 thread model - exactly the same tasks are performed by kernel, but with additional layer crossing overhead.

/devel/threading :: Link / Comments (0)


Initial thoughs about userspace threads (or M-on-N threading model).


Let's see, what we already have.
Glibc provides us makecontext() and friends functions, which are essentially a part of the userspace execution mechanism - one can create context, run it, swap it and so one. That is something I want to implement, except its problems - context switch can be performed from the outside thread (that is how IBM NGPT was implemented), it is not the main issue, although I really do not like such an approach, the main problem is the fact, that if such a context is going to block, that fact can not be detected from another contexts, and thus it is impossible to swap context with another one. Even if some check will be done in each syscall, or even if each syscall will be a rescheduling point, that means that either each syscall must be non-blocking, or the whole process will go to sleep in syscall, since kernel does not know that there are several context in the same process.

So, the solution is to have some kind of a thin layer between kernel and userspace (in a real world it is called glibc), which will convert all syscalls into non-blocking operations (including nanosleep() and the like), and keep a track of what each context performed. In practice glibc rewrite is not what I would like to do, but instead some layer on top of it will be implemented, which will convert syscalls into kevent operations, and become a rescheduling point. I will even consider to implement not exactly known syscalls, but instead (at least for the initial implementation) introduce new calls, which will be a wrapper to known ones - like new_write() will be a kevent and new threading model based wrapper, which will setup all appropriate requests (like POLLIN) and if possible, call write() itself. When all execution context are put into the sleep, the whole process will park itself in the waiting syscall like kevent_get_events().

Main issues with such approach are following:

  • scheduling algorithm
  • SMP scalability
  • syscall wrapper in the glibc or completely new calls (like described above)
At least first two issues are interesting technical challenges, the last one will be first implemented with new calls.

/devel/threading :: Link / Comments (0)


Filesystem corruption bug recently found in Linux kernel.


LWN.net article about it clearly shows how complex VFS is, but its conclusion and Linus words about buffer heads are interesting. Conclusion is basically 'do not use buffer heads'. Indeed, all the time I worked with VFS ( kevent and AIO, receiving zero-copy, test block device for acrypto) I never ever tried to use buffer heads - why is it needed, when this days we operate with pages already - and eventually filesystems operate with pages too - they have special set of callbacks to write page, read them and so on.
So, filesystem must be simple in that regard - do not split page into buffer heads, always work with pages and provide appropriate callbacks where they are needed (inode operations at least), and that is how my FS will work.

/devel/fs :: Link / Comments (0)


Wed, 10 Jan 2007

Climbing evening. First one this year.


After Week of continuous drinking, lazy-boom-slacking and other non-sport stuff, I eventually started training. And miracle has happend - Grange took his fifth point to the climbing zone too. So I started to climb on the walls, but not usual boulderings and traverses - although not a lot of traces were completed (and only one of them was new, which I miserably failed on-sight, but not because it was complex, but only due to the fact, that it was the last trace after 3 hours of climbing, and I tried it on-sight), eventually I completed it.
And now (I write it in the middle of the Jan 11, when I eventually bring myself to the office) I feel that every piece of my body is in pain. It is extremely pleasant masochistic feeling - I like it.
I think if you have awakened and nothing aches, then you are likely dead.
Excellent time.

/life :: Link / Comments (0)


Further development. Ideas. Plans. Agenda.


It looks like kevent is going to be frozen (see looksthis thread in linux-kernel@ about too fast rate of new code creation).
Netchannels are also implemented.
So I have split my TODO list into TODO and DONE parts. Currently it has following items in TODO section:

  • Integration of kevents into mainline
  • Some thoughts about FS (especially about journalling and WAFL approach (I think log-structured fs is really progressive idea)), so it could be used with receiving zero-copy, or even be a network filesystem with distributed capabilities.
    Also need to consider filesystem stress testing and emulation tool.
  • Some thoughts about different threading models. Especially after this analysis.
Things which originally appeared in TODO, but was implemented and essentially completed.
  • Network tree allocator, full sending and receiving zero-copy networking
  • Complete userspace network stack (which actually requires just to sync with TCP stack used in netchannels) and move netchannels into userspace (with help of full zero-copy support and userspace stack)
  • Fast NAT, which will not use Linux connection tracking system (which is extremely slow)
  • True asynchronous IO. My thoughts are described here and here.


Actually kevent integration could be put into DONE section too, since it does not require too much of efforts to maintain it in the separate tree with regular sync with mainline - no one requests more, so it will silently live and work in the own repository.

Although netchannels do not work with my network tree allocator, I will not for a while work with this project due to absence of real need for zero-copy.

Next item in TODO list is filesystem and related development - it is indeed very interesting task, and my next killer project will definitely new filesystem, which, as a minimum must, will behave faster than speed of light and will scale more than universe.
But it is very complex task and I want to have some time something different than kernel, but not less interesting.

So I select new threading model implementation - essentially it will be so called N:M threading model, implemented on top of POSIX interface - I want to create a library, which would be placed just instead of glibc libpthread with at least main functionality.

Or maybe just get a vacation and move to the edge of the Earth. Stop, since Earth is roughly a sphere, I already on the edge. Crap.

Hmm, or maybe drop my current work and organize something own - the more I do own project, the more I like the idea, but not right now.

Will go hacking.

/devel :: Link / Comments (0)


New kevent 'take32' release.


This early morning hack (I hope no one at work reads what I'm doing) contains following major changes:

  • Added aio_sendfile_path() - this syscall allows to asynchronosly transfer file specified by provided pathname to destination socket. Opened file descriptor is returned.
  • Added trivial scheduler which selects execution thread. It allows to specify given thread 'by-hands', but since kaio provides '-1' it uses round-robin to get processing thread. In theory it can be bound to scheduler statistics or gamma-ray receiver data.
  • Number of bug fixes in kevent based AIO mpage_readpages().
Benchmark of the 100 1MB files transfer (files are in VFS already) using sync sendfile() against aio_sendfile_path() shows about 10MB/sec performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path().

Well, call me a looser, but I started 3 days resending timeout.
I know it is annoying and disturbing, and I really doubt it is a good way to tell the world about my work, and I bet you all tired from those pathos words, but I really would like to get some feedback, since I want to start to work on network AIO, but sending mails into unfeedbackable (I used word 'destination' for hackers and maillists, but you actually know what I mean) really does not motivate me for that.

/devel/kevent :: Link / Comments (0)


Mon, 08 Jan 2007

New kevent 'take31' release.


Short changelog:

  • AIO state machine
  • aio_sendfile() implementation
  • moved kevent_user_get/kevent_user_put into header
  • use *zalloc where needed
aio_sendfile() is implemented as described here, except destructor callbacks, which are essentially the same, except that they reschedule AIO processing.

Benchmarks of simple sendfile() vs. aio_sendfile() did not show noticeble win of any approach, but I want to notice, that my receiving side is not that best in my case (I managed to create a test where usual sendfile() stuck until signal received, the same hapend with aio_sendfile() too, but the latter can be buggy).
It would be good to use it in lighttpd (I will create a patch after some feedback received and approach will be considered as good.

AIO state machine is a base for network AIO (which becomes quite trivial), but I will not start implementation until roadback of kevent as a whole and AIO implementation become more clear.

Patches and userspace test application are available in archive from kevent homepage.

/devel/kevent :: Link / Comments (0)


Merry Christmas!


Although it was yesterday, we celebrated Christmas and Mephody's moving from Moscow to Limerick, Ireland. We sat first in "5 oborotov" on Sadovaya-Triumfalnaya, where Schtrom's family, Lyasha with wife and old Mephody's friend Andrew with wife joined us. We sat there upto 24, and then moved to the next bar - eventually we landed in "Aero cafe" on Pokrovka, where spent the whole night and moved home around 6 A.M.
What I got from that time besides good time with old friends, is strong feeling that the most interesting and fun moments always happen when I'm drunk. I do not know why. And that I am too old for modern electronic music and clubs.
And it looks like I finally tired of drinking - and since Meph returned to Ireland and all others are so inert to get them to restaurant, eventually no new adventures are on the nearest horizon, so I can concentrate on hacking (which happend actully only couple of times this year), climbing, flat development and other healthy things.

/life :: Link / Comments (0)


Sat, 06 Jan 2007

Kevent celebrates its first birthday!


As present I have created initial aio_sendfile() implementation.

/devel/kevent :: Link / Comments (0)


Initial aio_sendfile() implementation has been committed into kevent tree.


It is yet very rough and definitely must be cleaned (and some known bugs fixed), but major part is done.

aio_sendfile() contains of two major parts: AIO state machine and page processing code.
The former is just a small subsystem, which allows to queue callback for theirs invocation in process' context on behalf of pool of kernel threads. It allows to queue caches of callbacks to the local thread or to any other specified. Each cache of callbacks is processed until there are callbacks in it, callbacks can requeue themselfs into the same cache.

Real work is being done in page processing code - code which populates pages into VFS cache and then sends pages to the destination socket via ->sendpage().
Unlike previous aio_sendfile() implementation, new one does not require low-level filesystem specific callbacks at all, instead I extended struct address_space_operations to contain new member called ->aio_readpages(), which is exactly the same as ->readpage() (read: mpage_readpages()) except different BIO allocation and sumbission routines.
I changed mpage_readpages() to provide mpage_alloc() and mpage_bio_submit() to the new function called __mpage_readpages(), which is exactly old mpage_readpages() with provided callback invocation instead of usage for old functions. mpage_readpages_aio() provides kevent specific callbacks, which calls old functions, but with different destructor callbacks, which are essentially the same, except that if page becomes uptodate, it is not unlocked, so that it could not be removed until it is sent, and only then it is unlocked.

Code does contain bug (at least one) I know about - subsequent try to send pages happens not after BIO is ready and thus pages are populated into VFS cache (i.e. pages are marked as uptodate), but repeatedly in the state machine (rescheduling must happen in BIO destructor, not in the code, which allocates pages). Another issues is that it is currently impossible to receive kevent notification when aio_sendfile() is really completed.

/devel/kevent/aio :: Link / Comments (0)


Thu, 04 Jan 2007

Visited Alma Mater - walked and searched for the ghosts from the past...


I visited MIPT and walked in the campus this evening and I had not met anyone - no even remotely known face or voice.
Things change.

Then moved to Mephody's wife Ira parents place (it is neighbour house to where I bought my apartment, where they stopped while visiting Moscow) to drink my new shiny liter or irish wiskey, which he presented to me from Ireland. It was fun time with Meph and Ira - we stopped about 5 A.M. and recalled friends, talked about life, its changes, about Alma Mater. Eventually washed bones to politics, music. Compared life in Limerick (Ireland) and Moscow (Russia), and found that no matter how strange life here is, it is much-much-much more interesting than that in stable old Europe. Meph and Ira concluded that they would like to return to Russia, or move to France - they visited Paris recently and showed a lot of interesting photos from the trip - The Louvre, Versailles, Eiffel's tower, Notre Dame De Paris - they are beautiful places, the French are good and interesting people.

/life :: Link / Comments (0)


AIO (sub) state machine has been completed.


It is small subsystem, which lives in kernel/kevent/kevent_aio.c file, which allows to queue and asynchronously invoke callbacks, which are intended to populate pages into VFS cache, send data to the destination socket, copy data to/from userspace and so on.

Real working callbacks itself are not implemented yet.

I will only implement three of them - open file by filename, populate file's pages into VFS cache, send pages to destination socket.
Probably will also add writing page to userspace.
This set will allow to implement aio_sendfile() as sequence of that callbacks - open file by file path, then populate its pages into VFS cache in some chunks or one-by-one and eventually send them to the destination socket.
There is a problem of the order of sending one page and populating its neighbour though, since having the whole VFS cache filled with locked pages from one file is not a good idea, but locking is required to allow sending itself - so page would not be swapped out. But I will either stop further populating until pages are sent, or will not firgure this out at all - depending on results from initial implementation.
Each subtask above - i.e. each callback, is an elementary chunk, which will be handled by kevent. Completeness of the whole task will be handled by kevent too.

/devel/kevent/aio :: Link / Comments (0)


Wed, 03 Jan 2007

Initial thoughs on the 'true AIO'.


Here was first announce of the idea, and now I will open it a bit more. This was written after some studing of Intel Dan Williams' work on async copy found here, the whole thread can be also interested for those who want to know what is AIO developemnt status and some ideas about its improovement.

A generic solution must be used to select appropriate device to perform actual data processing. We had a very brief discussion about asynchronous crypto layer (acrypto) and how its ideas could be used for async dma engines - user should not even know how his data has been transferred - it calls async_copy(), which selects appropriate device (and sync copy is just an additional usual device in that case) from the list of devices, exported its functionality, selection can be done in millions of different ways from getting the fisrt one from the list (this is essentially how your approach is implemented right now), or using special (including run-time updated) heueristics (like it is done in acrypto).

Thinking further, async_copy() is just a usual case for async class of operations. So the same above logic must be applied on this layer too.

But

	layers are the way to design protocols, not implement them.
        	David Miller on netchannels.
So, user should not even know about layers - it should just say 'copy data from pointer A to pointer B', or 'copy data from pointer A to socket B' or even 'copy it from file "/tmp/file" to "192.168.0.1:80:tcp"', without ever knowing that there are sockets and/or memcpy() calls, and if user requests to perform it asynchronously, it must be later notified (one might expect, that I will prefer to use kevent :) The same approach thus can be used by NFS/SAMBA/CIFS and other users.

That is how I start to implement AIO (it looks like it becomes popular):
  • system exports set of operations it supports (send, receive, copy, crypto, ....)
  • each operation has subsequent set of suboptions (different crypto types, for example)
  • each operation has set of low-level drivers, which support it (with optional performance or any other parameters)
  • each driver when loaded publishes its capabilities (async copy with speed A, XOR and so on)
From user's point of view its aio_sendfile() or async_copy() will look following:
  • call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578')
  • call aio_schedule_file_socket(source='/tmp/file', dest='socket')
  • call aio_schedule_file_addr(source='/tmp/file',dest='192.168.0.1:80:tcp')
or any other similar call and then wait for received descriptor in kevent_get_events() or provide own cookie in each call.

Each request is then converted into FIFO of smaller request like 'open file', 'open socket', 'get in user pages' and so on, each of which should be handled on appropriate device (hardware or software), completeness of each request starts procesing of the next one.

Reading microthreading design notes created by Zach Brown (Oracle), I recall comparison of the NPTL and Erlang threading models - they are _completely_ different models, NPTL creates real threads, which is supposed (I hope NOT) to be implemented in microthreading design too. It is slow.
(Or is it not, Zach, we are intrigued :)
It's damn bloody slow to create a thread compared to the correct non-blocking state machine. TUX state machine is similar to what I had in my first kevent based FS and network AIO patchset, and what I will use for current async processing work.

A bit of empty words actually, but it can provide some food for thoughts.

/devel/kevent/aio :: Link / Comments (0)


Tue, 02 Jan 2007

Happy New Year!


I congratulate you with this excellent day and wish you to be even cooler than you were!

I perfectly ok celebrated the New Year in small nice cellar - we had an excellent company, tasty food and drinks, a lot of fun and cool thigs.
There are some things I can not stop and getting a bit mad when doing - it includes drink in a good company (actually I do not drink too much), good-idea-hacking (well, I do hack a lot, which is frequently frowned upon by those, who lose my attention due to that though) and some other issues I will not write here (indeed), and a lot of that happend in New Year vacations.
It was just bloody cool celebration!

Since I'm quite alive and even feel myself good, I will write bits of myself in the way Rusty did.

Here we go:

  • My first contribution to free software was a kernel patch for ELF or 'misc' binary format loader. Sounds too cool, doesn't it? But... - it was patch which added error check (three lines!) for some function, copied from other loader. But it was my first one.
  • When I was 25 I did not took ballroom or any other dance classes, instead I played guitar (actually I already do not recall it completely, although not that much time has gone away). Now I have a trumpet, but can not play it (yet), can only rape neighbours ears.
  • When I was in primary school, I played chess. I did it good, so good that even recall how figures move and managed (several years ago) to win couple of times against current second-prize-owner of the World Youth Chess Championship (although among kids smaller than 12 years). When we played, Illya was 6 though.
  • When I was in high school, I did karate and aviation modelling (the former only about a half of a year, although I did kick asses since then couple of times and still can almost make the splits, the latter was about two or three years). When I was in university I played footbal, skis and gym. The latter about half a year (do not even know why), the former two about 3-4 years (since then I hate racing skis). Now I do (and like) climbing.
  • I am not brilliant, although quite cool and have a lot of good friends all over the world. I do not have a girl, instead they have me.
I would like to feature Grange and David Miller. Likely others do not have blogs.

This year I expect to be my year. Although I completely do not have money, have debts, have boring work and other uninteresting stuff, I do expect to get world domination. At least in some specific areas.
So, stay tuned and have a nice year!

/life :: Link / Comments (0)