|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Wed, 31 Jan 2007
Climbing evening and recent happenings.
It was quite easy trainig, since the last several days I have extremely bad mood.
Do not know exactly why, maybe season vitamins absence or
due to guarding angel and muse are on vacations.
Anyway not a lot of things happend recently - I brought my laptop
to HP guarantee service due to hard drive problems, I completed
some small initial legal action related to my apartment.
As for climbing, I only did simple traces today, although
tried to do them without the rest in between, so I completed
5 or 6 simple (5b-5c+) traces, couple of traverses and finished
with it.
Due to musa absence I do not perform intresting tasks right now, but
I schedule for the nearest future M:N threading library
update, which will fix SMP bug and possibility to create context
by hand, which will eliminate unneded signal invocation in
NTL in some places. I also plan to replace nework stack's hash
tables with netchannel's
trie, as described in TODO.
likely tomorrow I will release new kevent version, which will be just
a port to the msot recent kernel tree.
/life :: Link / Comments (0)
Mon, 29 Jan 2007
Climbing evening.
I've bought myself new climbing shoes - this time I got
"La Sportiva" instead of "Boreal", and it looks like
my choice was right, although it looks like I got slightly
biger size than the best (I have russian 43 foot size, and climbing
shoes are usually about 40.5, but with "La Sportiva" it could be even
smaller due to the way shoes are designed). I tried several old traces,
and failed one of them which although quite complex, but I finished it without
major problems before (green 7a in central sector in Skala-city).
At the end I tried some jumping I discovered in old trace, but failed again -
probably it is due to the new shoes, which are quite different for the feet,
or because of the slightly different size, time will show.
/life :: Link / Comments (0)
M-on-N threading library got a homepage.
Here
one can find some design ideas and notes compiled from
threading
blog tag, which you are currently reading.
/devel/threading :: Link / Comments (0)
Atomic locks and updated starting route for M-on-N threading model.
After I started to use atomic locks (lock prefix on x86)
instead of semaphores, thread start/empty exec/stop was reduced down
to 0.3 microseconds compared to 14 microsecods for POSIX NPTL
case.
But there are problems.
First one is that I perform initial context setup through signal invokation,
which is at least two syscalls. They are slow.
Another one is that thread is really started only after rescheduling, which
is another signal, so another two syscalls.
Third on is that there must exist different locking primitives - for signal context
and for process context, which must block signals, which in turn adds
additional overhead of sigprocmask() syscall.
After I fixed all above issues (actually not fixed, but confirmed that they must
exist), performance reduced to 9 microseconds compared to 14 microsecods for POSIX NPTL
case for empty thread creation/destruction.
This can be fixed, if I would have created arch-specific getcontext()-like
calls, which would be mutually transformable into signal context information
(existing getcontext() and friends produces different data than
signal context has at least on x86). But I can not right now, since I do not know
enough x86 ABI (I learned a lot for past several days, as you can notice from
this blog, but it is still even remotely not enough).
Currently M-on-N threading model uses ugly arch-specific hacks to start new threads,
which actually are something remotely similar to makecontext().
So, the solution, which will rock M-on-N threading implementation is to
convert or create getcontext() and friends calls which can be used with
signal context information.
But let's Linux motto "release early, release frequently" fly - I plan to release
alpha version today even without syscall substitution support and send it to linux-kernel@ and
libc-hacker@ for review.
/devel/threading :: Link / Comments (0)
Sat, 27 Jan 2007
New userspce preemptive scheduler for M-on-N threading model is completed.
It was built on the idea, that kernel saves in stack information
about interrupted context, so it can be extracted there and changed
to anything else. I do not use makecontext() and friends at all
now, since its internals are completely different from what is stored in
signal's stack. System works, but not 100% reliable - there is a race
in scheduler when it is possible to reference a thread which was just
exited, I will fix this with introduction of atomic operations, which will
also reduce thread creation overhead related to futex() syscall,
which is how semapphores are implemented, which are currently main locks
in my threading library.
Such approach currently can only be used, when sources are compiled
without position-independent code support.
If timer signal based scheduler is disabled, no bugs happens (which is quite obvous,
since it becomes synchronous).
I did not tested SMP scalability and there is no syscall dynamic substitution,
which will be added after I complete bug fixing in the scheduler.
Current test for speed of the thread creation shows (thread is allocated,
its empty function is called, then it is removed), that speed for
one thread creation/execution/destruction is about 1.9 microseconds,
compared to 14 microseconds for NPTL POSIX threads.
/devel/threading :: Link / Comments (0)
Thu, 25 Jan 2007
I will run some trainings this year in Auriga and MSU.
They will be devoted to linux kernel hacking. The first lecture will be spent
in about a week in Auriga, where I will talk
about Linux USB driver creation with example of real hardware - w1 USB bus master driver,
which you can find in kernel sources (drivers/w1/masters/ds2490.c,
although to get it work I needed to resolder some pins, so I do not 100% sure that
my driver for this hardware in Linux kernel is correct, but do not tell anyone).
If things will go smooth, I will create couple of other lectures
(I plan to talk about networking)and will
have a presentation in MSU
in a course, which starts there in a month.
I consider this as a preparation for my possible talks in Linux kernel summit
and/or other development conferences, if stars will go stright and I will participate there.
There is non-zero probability that I will run some lectures in India this spring too
with Auriga team (with kevent talk as a bonus),
but it is too early to talk about it.
/life :: Link / Comments (0)
New kevent version 'take34' has been released.
The only change is a header pointer in aio_sendfile_path().
Now one can use aio_sendfile_path() to send a file with header
in one syscall instead of three: send(header), open(file), sendfile().
/devel/kevent :: Link / Comments (0)
Wed, 24 Jan 2007
I congratulate Grange with his birthday!
Be good and be cool!
/life :: Link / Comments (0)
Climbing evening.
It was simple training - one several old traces, and I magically tired to hell,
so I finished with 'men's start' - climbing without legs - and after this 'trace'
there was no power anymore, so I completed some exercises, got usual sauna, shower
and moved away.
/life :: Link / Comments (0)
Execution different function after returning from signal handler.
It looks a bit like writing an exploit actually, but I managed to
change signal execution path to call my own function after signal handler
is completed instead of returning to previously running context. Currently
that function is executed in previously running context, only %eip was changed.
Code is quite simple and generic enough:
struct sigframe *frame = (void *)get_ebp + sizeof(void *);
struct sigcontext *sc = &frame->sc;
Above is correct at least on x86 and x86_64 (except that register is %rbp),
although above structures are cpu-specific. If %eip (or %rip)
is changed in signal handler to pointer to the new function, it will be called instead
of function, which is supposed to run in that context.
Here is a log:
main 1814: scheduling first, context size: 88, fpstte: 624.
call_me_func: ebp: 0xb7f19ff8, stack: 0xb7f1a008, diff: 16.
call_me_func: esp: 0xb7f19fb0, stack: 0xb7f1a008, diff: 88.
1169655167: th: 0x8049c00, stack: 0xb7efa008, id: 1, esp: b7f19fb0, ebp: b7f19ff8.
alarm_sighandler: ebp: 0xb7f19b08, esp: 0xb7f19ad0, func: 0x804868a, frame: 0xb7f19b0c, call_me_func: 0x804856c.
alarm_sighandler: prev: esp: b7f19de8, ebp: b7f19fa8, eip: 3404db.
alarm_sighandler: eip set to 804865a.
sched_return: func: 0x804865a, ebp: b7f19de4, esp: b7f19dcc.
1169655168: th: 0x8049c00, stack: 0xb7efa008, id: 2, esp: b7f19fb0, ebp: b7f19ff8.
1169655171: th: 0x8049c00, stack: 0xb7efa008, id: 3, esp: b7f19fb0, ebp: b7f19ff8.
1169655174: th: 0x8049c00, stack: 0xb7efa008, id: 4, esp: b7f19fb0, ebp: b7f19ff8.
As you can see, sched_return() is called instead of old function,
which prints next string since sched_return() returns.
To implement correct userspace scheduling I only need to replace the whole struct sigframe
function with context from different thread. So far this looks simple,
how it will be in practice I will check tomorrow, and now I need some climbing.
P.S. in previous story
about how signals work I made a mistake saying that new signal stack is allocated - no, the same
process' stack is used, or alternative one if it is available and thus special flag is set.
/devel/threading :: Link / Comments (0)
Tue, 23 Jan 2007
How do signals work in Linux?
Likely it is the same in all other Unix systems too,
but I only checked Linux kernel.
There are two types of signals - synchronous, which are
for example result of error operation,
they always happen synchronously after the wrong operation,
and asynchronous ones - they can be delivered at any time,
for example using kill() call.
No matter what type of signal we received, it was produced exactly the same way.
When signal is generated and it is not blocked, mask of pending signals is updated.
Later (when process is scheduled for execution) kernel will examine mask of pending signals
and if there are any, it will start to deliver them one-by-one.
If handler is set to SIG_DFL or SIG_IGN
either process will be killed (actually default action will take place),
or signal will be dropped. The most interesting case thus is when there is
our own handler.
In that case kernel will eventually call setup_frame() function,
which will setup new signal stack (or use existing if
SA_ONSTACK flag is set), save current context (copy registers,
error value, some thread info and other interesting information), setup
return call (function which will be called when signal handler is completed
to return back to kernel). Save context procedure includes filling
struct sigframe, which contains all info needed to continue to run
interrupted task and/or schedule it after some time.
After some GNU asm learning and googling, I've managed to run
function on top of its own stack. Code (for x86) is pretty simple:
asm volatile (
"mov %0,%%eax \n" /* Start address */
"mov %1,%%ebx \n" /* Arguments */
"mov %%esp,%%edx \n" /* save old sp to edx */
"mov %2,%%esp \n" /* change stack */
"push %%edx \n"
"push %%ebx \n" /* copy arguments to new stack */
"call *%%eax \n" /* call (*func)(arg); */
"mov 4(%%esp),%%edx \n"
"mov %%edx,%%esp\n" /* restore old stack */
:
: "g"(func), "g"(data), "g"(stack+stack_size)
: "eax", "ebx", "edx", "esp"
);
It was lurked in PTL threading library,
which is the only one which does support preemptive userspace scheduling. Provided
function is indeed called on its own stack. But when signal is delivered,
its $ebp does not contain that stack, but instead it contains address
somewhere in the middle of the new stack, which rised a suspicion, that glibc
installs own signal handler, and then calls my, but experiments with x86_64 test machine
showed, that kernel indeed jumps directly into my own signal handler. Unfortunately
I do not have fast x86 test machine, and do not want to spread power to two arches currently,
so I will setup my VIA C3 test machine and will run some signal tests on its modified
kernel to detect, where exactly stack pointer for interrupted context is stored.
When this issue will be resolved, correct preemptible scheduler for M-on-N threading
model implementation will be just a matter of hours.
/devel/threading :: Link / Comments (0)
This year Linux Kernel Summit will be held in Cambridge, England.
According to Theodore Ts'o
this will happen September 5-6.
Up to that date I have some projects to complete, so I think I would participate,
if starts got into the straight line.
Main projects are kevent
of course and netchannels
(with grand network stack breakage and replacement of socket hash tibles).
Additionally I think threading
issues can form a good talk and of course new filesystem
(but only if there will be some results up to that point).
Although I believe that probability of my talks is quite low (and if you likely do not know,
but my english speech skills are somewhere between zero and void, although it is
not a problem for me), it is still possible to move there for a day and meet Abr
(although he is in London) and Mephody (although he is in Limerick, Ireland).
/devel/other :: Link / Comments (0)
True context switching in userspace.
After some thinking, I've understood, that setcontext()
approach does not allow to have real context switching - when this function is called,
context is restored from the point where getcontext()/makecontext()
was called last time, so even if one have possibility to restore new context, context being removed
must save its state by itself - obviously no one will do it. So this approach will not be used
at all in my context switching implementation (although I will check getcontext()/makecontext()s
sources, since they contain needed bits).
Let's slightly move away from topic and concentrate on how signals work.
According to my knowledge, signal is just a call for some function in stack -
kernel saves needed context, setups small region on stack of currently executed thread,
saves current context
and calls a special function which ends up with registered handler invocation.
(Interesting note for investigation - how do signals work with non-executable stack -
likely special page is allocated for that purpose, but if it is so, then there is a way
to write explits which can run on stack even when system setups it to be non-executable).
Signal handler can not be reentered, when it is exiting, it restores previous context,
thus creating real context switching.
Returning back to our topic - scheduler's work thus become obvious -
system's signal helper will store current execution context on stack,
then new thread will be selected by registered handler
and its context will be restored when signal will exit instead of
old one, thus true context switching will be performed.
This looks simple in theory, but on practice there are couple of small problems:
- I do not know (even x86) asm enough to code like in C (but I always wanted to hack low-level stuff)
- I do not 100% sure that signals work like I described (but I surely want to know)
- I do not know how it will be easy or not to save/restore context (according to glibc sources,
getcontext()/makecontext() and friends are not too big though)
Actually context switch is a bit more complex - for example virtual mapping should be restored/saved too
for TLS (thread-local storage), on x86 MMU registers must be saved and more generally FPU state
must be saved too, but for initial implementation I think it is not too relevant.
So, enough for theories, it is about an 1 A.M. and I need to sleep - I'm sure this day will be very interesting.
/devel/threading :: Link / Comments (0)
Mon, 22 Jan 2007
Signals and contexts.
I was a bit optimistic when said that scheduling works -
it can not work at all, completely, since it is not allowed
to call setcontext()/swapcontext() from signal handler.
It is only needed to schedule away CPU-bound tasks which do not
perform syscalls, since syscall will be a rescheduling point too.
To solve this situation system needs to either spawn additional
control thread, or allow kernel to create different signal types,
which will be safe for setcontext()/swapcontext().
The former is platform-independent approach, although I would like
to implement the latter, but for now first one will be done.
I've just found that there is no preemptive userspace
threading library - IBM's closed NGPT library is based on GNU Pth
and they both proide only non-preemptive scheduling.
Now I understand why I have so much problems with preemptive scheduling,
and actually it does not look like there is easy solution even with
control thread - all *context() functions work only with
current context.
This requires some deep thinking...
/devel/threading :: Link / Comments (0)
Sun, 21 Jan 2007
Cognac, vodka, friends...
I've visited my friends in Dolgoprudniy - Bass (Andrew Shcheglov),
Silich/Sviridovskaya family, Gora and Shtrom family in Lobnya -
that was excellent trip. I will stay in Bass' home in Dolgoprudniy
for some time, maybe I will visit Alma Mater
and even meet my old friends which work there.
I found a new very interesting drink based on cognac, althoug I generally do not like it
(for exmaple today I drunk several years old 'Hennessy', but did not found it
to be extremely tasty definitely, I still think that cognac is a bad-tasted coloured
vodka (at lest the latter can be drunk with additional after-drink like juice)).
This is so called "Cranberry in cognac", which is not that strong like vodka,
and thus allows to drink it without additional after-drink/after-food and has extremely
good taste.
/life :: Link / Comments (0)
Sat, 20 Jan 2007
M-on-N threading scheduler is ready.
It works similar to O(1) scheduler in that regard, that it has two
queues of tasks too.
When task runs out of its timeslice, it is moved into inactive queue,
which becomes active, when active queue is empty.
It seems, that system scales good with increased number of CPUs,
at least when system created several busy-loop threads, they ate both physical CPUs
on my Core Duo system (according to 'top'), although it is possible that distribution is unfair.
Code is not yet ready, I know about at least two nasty bugs there,
and it lacks syscall substitution part, which is a major part, but
there is progress indeed.
/devel/threading :: Link / Comments (0)
Why is kernel.org so slow some tims (LWN.net article).
Here is an article
about problems with filesystem access kernel.org started to face recently which
ended with the fact that Linux does not scale too good when number of
directory reads becomes too high. Another part of the problem is the fact,
that ext3 (current kernel.org fs) does not group directories close to each other
to reduce number of seeks needed to read several subdirectories,
but it was pointed that for example XFS does have some methods to group directory information
close to each other on disc. Another problem is absence of readahead for directories.
/devel/fs :: Link / Comments (0)
RIAA/MPAA and others.
I always woder how is it ever possible to sue someone
for getting something for free - how they can sue you
for downloading a music files? When you gown down to street
and see a dollar, no one will sue you for stealing if you get it.
The same applies for book, which is someone's intellectual property.
Then why Internet is different?
Of course I understand, that amount of money in Internet is not even remotely
near to the amount of books one can find on the street or any other similar
situations.
But why defendant are so inert, why do they only try to say, that they did it by
accident?
RIAA/MPAA allow to sell _information_, which in Internet means set of bits.
And they want to limit your rights to do whatever is allowed by Constitution.
When someone gets a disk with songs, one has exactly those bits - it is _product_,
it is not service. One can do enything with product, since after it was bought, it belongs
to the new owner, but service can only be used according to rules.
So, when I see a freely downloadable music in Internet, I just see some information, which
does not belong to RIAA/MPAA, since it was bought by someone as product (bits of
information on CD is a product, not a service), so one can use it whatever it wants
including sharing.
I agree that pirates make a copy in cinema (with bad quality always) illegally, since
cinema provides service, which has rules. But if information has been sold, it does not belong
to previous user no matter what he wants to say - consider situation, when you are buing a TV,
and shop does not permit to watch some chanels on it.
I think there must be an attack on that front, not swampy defence.
Idea has been casted by reading Slashdot about all those idiotic RIAA/MPAA legal actions...
P.S. it may sound to naive though, since I'm completely not a lawyer, but I do not understand
what is the different between information in form of bits on disc and any other
products, which EULA if breaks laws becomes invalid.
/other :: Link / Comments (0)
Fri, 19 Jan 2007
Userspce scheduling in M-on-N threading model.
The main purpose of the scheduler as a saparated system is to
hold fairness in spreading CPU clocks between tasks.
Current scheduler just gets next task and gives it its own timeslice (currently equal to 10 milliseconds).
If that task is sleeping in syscall or performing CPU intensive operations,
system will not take that into account.
So tasks which are so called IO-bound,
i.e. waiting in syscall for IO comlpetion (or actually any other sleeping),
are not getting CPU clocks in a fair manner, since theirs timeslices
are spent mostly sleeping. CPU-bound tasks in that model does not fully utilize
CPU too, since part of the time CPU is idle since scheduler put IO-bound
task into execution, and that task is waiting, while it would be possible to
start CPU-bound task.
So, the solution is to provide each task a given timeslice, which
will be decreased when task is actively executed on CPU. If task
puts itself into sleep, its timeslice is reduced according to time,
which was used for active execution. When rescheduling happens
either in syscall time or in a signal, scheduler will select
task with the highest timeslice left. Priority of the task will
correspond to the length of the timeslice each task obtains.
Userspace scheduler also has access to the information, what exactly any task in question
is doing, so if it is known that it is waiting in syscall, it will not be awakened
at all until scheduler receives kevent which given task is waiting for.
/devel/threading :: Link / Comments (0)
Kevent feature request for aio_sendfile().
Suparna Bhattacharya (IBM) has requested new feature in the asynchronous
file sending syscall - header pointer, which will be put into socket
queue before file's data.
Although Linux syscall overhead is extremely small compared to other Unix
systems, it is still not zero, so since I already "optimized" (i.e. removed)
open() call in aio_sendfile_path(), I think
things will not became worse if I will put there header pointer and length
too.
I plan to release new kevent version after
M-on-N threading model implementation
with this feature implemented.
/devel/kevent :: Link / Comments (0)
Brilliant idea on how to break Linux networking code completely.
I'm going to substitute sockets with netchannels,
but with having backward compatibility - mainly I will replace socket lookup hash tables
with netchannel's trie and will make socket as special type of netchannels,
like netfilter or userspace netchannels currently.
This change is completely transparent to all users, since no API will be changed,
only socket allocation/lookup/freeing.
This change us intended to allow unlimited scaling of number of sockets with
constant search time (since there is only one type of wildcards sockets - listening
ones, which expect incoming connection from 0.0.0.0 address, there will be
only one wildcard per trie, thus searching time will be constant).
I've put this item into
TODO and schedule this changes after
M-on-N threading model implementation.
/devel/networking :: Link / Comments (0)
Thu, 18 Jan 2007
iDrink.

Got from here
via www.idiot.ru (politic news, btw).
/other :: Link / Comments (0)
Wed, 17 Jan 2007
Climbing evening.
It was not that bad training, although with some negative moments.
I tried new traces this day - most of them were quite doable, but they
were complex enough and I performed several complex boulderings before,
I failed. The most interesting was yellow trace in the central sector
on vertical wall with all completely passive holds - I managed to complete
about half of the trace on-sight, and then it started to present surprises,
or maybe I was just too tired for the on-sight climbing of that complexity.
Anyway, at the top I was completely out of power, and even managed to
get rope over the head and damage my shoulder, but eventually I quicly fixed my position,
but teared my handphones cord.
/life :: Link / Comments (0)
Breakthrough ideas are not from teams. Hans von Ohain.
Interesting note... I would even say 'ego boosting'. I like it.
/other :: Link / Comments (0)
Threading part of the NTL M-on-N threading library is ready.
Although not without problems - there is no scheduler (well, there is round-robin one,
which is not what I want), I did not run any kind of benchmarks to test
SMP scalability and timer signal overhead (the latter is the most
problematic part - although 'top' shows zero CPU
usage for pool of 100 threads sleeping in infinite loop,
it is still possible that actual CPU usage due to
signal delivery overhead can be noticeble).
Code does not contain kevent syscall wrappers yet. I will think about dynamic
library loading tricks, which allow to 'replace' syscalls in runtime.
Enough for today - I'm going climbing.
/devel/threading :: Link / Comments (0)
pthread_create() vs. clone().
Did you ever tried to use clone() directly?
I bet you never tried it at least with recent kernels.
First, exported clone() does not correspond to
what kernel expects, it looks like it is only provided for
compatibility. Manpage for that call is utterly obsoleted
and incorrect (except useful flag description it contains
_wrong_ descrition of parameters at least for i386).
But I do not search for easy ways - I have glibc sources and can
dig into them.
That was my first impression of the man, who in theory can climb the Everest,
fly to the space and understand math behind string theory (the latter only if time permits,
it looks like the whole life can be spent there digging into more and more
new subtheories).
Now I think that all three tasks described above can be much-much-much
more solvable than digging in the glibc sources. And those people
says that I poorly described kevent - hey, look into glibc NPTL
implementation (and I even do not talk about its coding style)
and pray you will never see this again,
or just try to start a new thread using clone().
After about an hour of reverse engineering process trying to make __clone()
work (note, that clone() does not work at all, just forget
about this call, only __clone() is correct for i386 and 2.6 kernel),
I managed to start new thread. It was a win, except very small problem,
that it crashed somewhere in the provided function calling chain.
I want you to know, that I do not know low-level i386 arch enough
to easily read and understand asm code (some years ago I managed to
write asm application which entered protected mode in DOS,
but I do not recall asm already, and actually never understood
gas semantic good enough) found in sysdeps/unix/sysv/linux/i386/clone.S,
so I miserably failed to proceed.
Yes, I started to use pthread_create() for SMP scalability.
I do not hear how you scream 'loser', since you would be there too, but those of you,
who still lurks here and ironically nod your head would better point
me to something useful for understanding of how modern i386 (or actually any other arch)
starts and works with threads/processes.
/devel/threading :: Link / Comments (0)
Initial implementation of the ntl (new threading library) M-on-N threading library.
Well, I can not find better prefix than ntl,
which is extremely non-ordinary abbreviation for 'new threading library'.
Anyway, current version is very initial, it does not contain scheduler
and does not contain kevent-driven wrappers on top os usual IO syscalls,
but it already has all initialization mechanisms, cache of threads
and all structures required for scheduling.
There are two major problems uncovered with this initial implementation.
First one is scheduling problem. Since NTL does not contain dedicated schduling thread,
it is quite hard to perfrom scheduling of the functions which does not
do syscalls, for example with those which just do while(1); loop
and eat 100% CPU and never enters NTL layer. To solve this problem I need
to add timer and appropriate signal handler, where reschduling will happen,
which in theory can lead to performance degradation and to problem with alarm
signal registered in thread function (although that should be fixed with kevent
timer notifications).
Another problem is futex performance.
In current code there are two locks implemented as semaphores, which
in modern Linux are transfered into futexes - schduler lock,
which guards queue of threads, and stack cache lock, which guards
list of free thread stacks.
So usual thread creation, empty function and thread exiting in NTL changes
from this
operations to:
mmap2(NULL, 8396800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7684000
sigprocmask(SIG_BLOCK, NULL, []) = 0
futex(0xb7fdac20, FUTEX_WAKE, 1) = 0
sigprocmask(SIG_SETMASK, [], []) = 0
futex(0xb7fdac20, FUTEX_WAKE, 1) = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1) = 0
sigprocmask(SIG_SETMASK, [], NULL) = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1) = 0
...
munmap(0xb7fd7000, 4096) = 0
so we get aditional four futex calls - two locks are processed: one when stack is unlinked
and returned to stack cache, and another when thread is added and removed from scheduler's queue.
Performance differs noticebly (test case includes creation of the thread, which exits immediately,
which is repeated requested number of times):
$ ./ntl_test 100000
num: 100000, diff: 388234, speed: 3.882340.
Compared to 1.793600 microseconds without futex calls.
In this situation there is no concurency at all - it is synthetic test,
so actually one _empty_ futex call gets about 0.5 microseconds, where
pure syscall overhead is 50% (this is Intel Core Duo 3.40GHz (running 3.7 Ghz) test machine).
I can not say if futex performance is slow of fast - but I would like to avoid this,
so in practice semaphores should not be used for thread serialization, instead
lightweight locks must be introduced.
In current code all locks are abstracted and implemented in separate file, so
lock changes are trivial, but I do not want to introduce per-arch usage right now.
/devel/threading :: Link / Comments (0)
Bruce Schneier's facts.
Super!
When Bruce Schneier observes a quantum particle, it remains in the same state until he has finished observing it.
Most people use passwords. Some people use passphrases.
Bruce Schneier uses an epic passpoem, detailing the life and works of seven mythical Norse heroes.
Bruce Schneier writes his books and essays by generating random alphanumeric text of an appropriate length and then decrypting it.
/other :: Link / Comments (0)
New kevent 'take33' release.
It is minor release which only contains following changes:
- Updated documentation (
aio_sendfile_path()).
- Fixed typo in forward declaration.
/devel/kevent :: Link / Comments (0)
Tue, 16 Jan 2007
The world becomes demented.
I was invited to participate in training course
in The International Institute of Information Technology (I2IT) of India
with topic "Advanced Linux Kernel Programming" as a lead teacher
(or as a knowledge base for teachers?)
by russsian company which held similar course previous year.
Actually I even do not know what to answer. Really.
I do think I'm not that bad kernel hacker, although definitely not brilliant,
and have limited knowledge compared to many of those, whose names you can find in
Linux kernel sources, but I would never teach people how to hack Linux kernel
(actually I doubt I would like to teach anyone to do something at all,
maybe ony sometimes you can hear some advices fly out).
Kernel hacking is like a breeze - you either like it or not - if you like
it you just start doing this, initially from small steps, later with big ones,
just like with any other task, and if you do not - no matter how good teacher is,
result will unlikely to be really positive.
/devel/other :: Link / Comments (0)
Mon, 15 Jan 2007
Extremely powerful climbing evening.
That was rgeat training - a lot of traces, really a lot, most of them combined
into pairs or threes with the rest in between - I even managed to complete
some new one and really old traces. And magically I was not that tired,
of course hands became weaker with time, but that means traces completion
became more technical. I completed trace with the horizontal (negative)
slope start with only minor problems (and if you get into account that
it was almost the last one I climbed this day, it can be considered
as a good climbing), which was a bit of surprise to me.
Later dry sauna actually almost killed my body - it became so slack,
that I even sat several minutes without moving just to get myself
into the shape.
It was just perfect training. Excellent time.
/life :: Link / Comments (0)
Initial benchmarking of pthread_create() vs. makecontext().
Benchmark is simple - allocate new thread, thread function immediately exits,
parent thread waits for cancellation and starts again.
One case is pure pthread_create()+pthread_join(),
another one is getcontext(), stack allocation (8mb as of my current rlimit),
makecontext(), swapcontext() thread function immediately exits, stack is being freeing.
Obviously I expected that makecontext() will be
much faster, but (time is a number of microseconds to create/destroy one thread,
i.e. perform sequence described aboved):
$ ./test_pthread 100000
num: 100000, diff: 1402225, time: 14.022250.
$ ./test_context 100000
num: 100000, diff: 1322459, time: 13.224590.
Impossible, something was completely wrong and another world's magic was mixed, that was my first impression.
But when I studied in MIPT, I was told on every physics lab, that there is no magic, so I started to think.
The only thing my brain could think about, was to run strace.
So I did, and found following interesting moments.
Pthread case:
...
mmap2(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7622000
...
clone(child_stack=0xb7e214c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED,
parent_tidptr=0xb7e21bf8, {entry_number:6, base_addr:0xb7e21bb0, limit:1048575, seg_32bit:1,
contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e21bf8) = 24426
clone(child_stack=0xb7e214c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED,
parent_tidptr=0xb7e21bf8, {entry_number:6, base_addr:0xb7e21bb0, limit:1048575, seg_32bit:1,
contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e21bf8) = 24427
...
and so on - everything looks ok - one stack allocation, and then it was reused since stack was not freed,
but was put into cache by nptl implmentation.
Here is ucontext case:
...
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7659000
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_SETMASK, [], []) = 0
sigprocmask(SIG_SETMASK, [], NULL) = 0
munmap(0xb7659000, 8392704) = 0
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7659000
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_SETMASK, [], []) = 0
sigprocmask(SIG_SETMASK, [], NULL) = 0
munmap(0xb7659000, 8392704)
...
So, mmap()/munmap() was the culprit - my context allocation code
did not used cache of stacks, instead it allocated/freed new one for each new context.
After I emulated cache usage, I got following strace for context case:
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb75fc000
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_SETMASK, [], []) = 0
sigprocmask(SIG_SETMASK, [], NULL) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_SETMASK, [], []) = 0
sigprocmask(SIG_SETMASK, [], NULL) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
...
munmap(0xb75fc000, 8392704) = 0
And following results:
$ ./test_pthread 100000
num: 100000, diff: 1402225, time: 14.022250.
$ ./test_context 100000
num: 100000, diff: 179360, time: 1.793600.
As expected - there is no magic, userspace context switching is about 7 times faster than
real thread creation,
and mmap()/munmap() syscalls provide exactly clone() overhead.
Empty syscall on this machine is about 0.25 microseconds, so its overhead is negligible.
/devel/threading :: Link / Comments (0)
Bring me candies from Linux.Conf.Au.
If I would be there I would visited following presentations:
- Dr. Andrew S. Tanenbaum presentation - just to see how he looks, I seriously doubt
his talk has something outstanding - maybe even incorrect, for example my latest TV (which I throuwn away from window
when I was a student) did not show most of the channels and had other problems - it is like
my Gnome terminal crashing once per week - both did/does not work properly.
- The Kernel Report by Jonathan Corbet - iteresting
to know, how life is in an environment I'm hacking in.
- The Linux kernel hacker generations by Andi Kleen -
name says for itself.
- nouveau - reverse engineered nvidia drivers by Dave Airlie -
never did reverse engineering of such a monster like Nvidia video adapter, so it would
be very interesting talk.
Or in parallel "A new kind of real-time: enterprise real-time
by Theodore Ts'o.
- Garbage Collection in LogFS by Jorn Engel- it is somehow
related to my upcoming filesystem development -
need to know potential competitors (although his FS is completely different - it is for flash, but nevertheless).
- Routing and IPSEC Lookup Scaling in the Linux Kernel by David Miller -
I particulary interested in the "Grand Unified Flow Cache" designs. And I never saw David.
- CFQ IO Scheduler by Jens Axboe - it is related
to to my upcoming filesystem development and is just interesting.
- Concurrency and Erlang by Andre Pang - highly related
to my threading library work
and is just interesting.
Or in parallel - "Ext4: The next generation of ext2/3"
by Theodore Ts'o.
- Writing an x86 hypervisor: all the cool kids are doing it! by
Zachary Amsden, Jeremy Fitzhardinge, Rusty Russell, Chris Wright - yes, I want to be cool kid which can
write a hypervisor, unfortunately I have no time and knowledge yet, so at least I would like to listen about it.
- Choosing and Tuning Linux File Systems by Valerie Henson -
given what he presented in FS workshop, this can be extremely intresting task for my upcoming
filesystem development.
Non-development activity I would like to participate:
/devel/other :: Link / Comments (0)
Sun, 14 Jan 2007
NPTL thread stack usage.
NPTL (as of glibc-2.5) uses following tricky stack usage model:
when new thread is created some stack must be allocated for it,
so NPTL stores cache of stacks, each of which has different size,
so when thread is created that cache is searched for the entry
with the nearest but bigger cache, if cache is found and it is
not that bigger than requested (it is less than 4 times bigger than requested),
it is returned to the user (and global variable which holds size
of the cached stacks is reduced). If there is no appropriate stack,
it is allocated from anonymous memory using mmap()
with at least MAP_PRIVATE | MAP_ANONYMOUS flags,
which in particular means that pages are not allocated at all,
but copy-on-write mechanism is used (i.e. real allocation happens
only when user writes to allocated pages).
Maximum size of the thread stack is 40 Mb, default value is taken
from rlimits.
Stack is guarded by some pages (if required) and part of it is used
as control block.
Simple and good model.
Exactly the same model will be used for stack allocation for M-on-N
threading model.
There is another interesting memory issue in NPTL - so called thread-local storage (aka TLS).
It allows to have for example global errno variable to be per-thread.
This requires to extend programming language (C and C++ have __thread variables)
and ELF header. Reasons to have thread-local storage are obvious and the main one is performance,
but it has too many problems with symbol lookups, object allocation and other issues (more
details can be found in this article),
so I will not implement it.
Furthermore, my library will be actually _trivial_ layer on top
of glibc, which will just provide _new_ symbols for programmers,
like new_write(), and if approach will be considered as good,
glibc maintainers (namely Ulrich Drepper,
with whom I had some discussion about kevent
(it looks like we still not 100% agree on all issues, but there was a major progress))
could adopt given design.
My library will use makecontext() and friends functions from
glibc as a basis,
which are actually just wrappers, which work with registers and other meaningfull for process
running information.
/devel/threading :: Link / Comments (0)
Meanwhile on flat development front.
As you have already found, I 90% completed hinged ceiling painting -
I selected shine orange colour to specify sleeping area (that is where
hinged ceiling is located), later I will paint the rest of the ceiling
likely into something calm like slightly blue or green. Kitchen
will have likely beige colours.
I also completed self-leveling floor filling everywhere except
half of the room, where developing materials are placed - I even
started to move in the room without shoes, although it is quite cold.
Seems, that I've fixed central heating radiator, or at least made it work
a bit better - it required water lowering, flowing mechanism
disassembling and assembling (only two part can be easily removed though),
and some physical (and indecent wording) "influence". Now I feel myself
much more comfortable in the apartments, although it is quite cold there,
at least without clothes at 6 A.M.
If things will stay in good direction, I plan to paint the whole ceiling this
week and then start to glue wallpapers and put floor covering.
Eventually I will start bathroom development...
/devel/flat :: Link / Comments (0)
A super quest.
What is it?

Answer can be found in gallery, starting from
this item,
but do not look there until your first answer is ready, I almost sure it was wrong,
but if you are that lucky, feel free to ask me for beer.
This super orange colour (better look in gallery) selected to specify sleeping area.
/devel/flat :: Link / Comments (0)
40 santimeters of snow in Moscow!
/other :: Link / Comments (0)
Sat, 13 Jan 2007
Filesystem quest.
I've asked my known Unix admins (mostly Solaris and Linux, bits of AIX and *BSD
and a lot of Windows knowledge) what FS features are
the most required and requested ones. Here is short list in order of priority:
- clearness of the idea. I.e. no reiserfs-in-reiserfs problem or things like
Solaris UFS on highly fragmented disk.
- reliability in expluatation. Different modes of journalling. Very fast
recovery and fsck checks, for example ext3 and reiserfs with theirs 4 minutes
per 250 MB when there are _no_ errors is completely unnacceptible when storage
contains 20 and more partitions.
- ACL support and quotas.
- Defragmentation. At least tool to determine how high is fragmentation of the
current partition and possible defragmentation. Currently it can be fixed
through round data moving over empty partition.
(Although I personally consider defragmentatino as a hack, tool for showing
its status could be interesting. Main solution for that problem
I see in delayed allocation and defragmenting through delayed allocation when
file is read into VFS cache).
- Speed. Although this item is on different places between admins, it is very
significant issue to have high read speed compared to hardware disk speed at least
in most required setups (web/file/mail server) -
current hardware easily allows to have more than 50 MB/sec speeds, while FS
rarely get over 25 MB/sec limit.
- Snapshots. For backup purposes read-only snapshots are the killer feature.
It is also interesting feature to have read-write snapshots (when
main partition is mounted first time it would be useful to allow to mount it
read-write from some point in past).
- Additional data processing units like compression, cryptography and so on,
especially with interface to plug external modules.
As you probably already understood, it will be TODO list for my own FS development.
/devel/fs :: Link / Comments (0)
Hinged ceiling is almost ready.

I think it was the most complex and heavy part of my development process.
It has two layers with small boardings, eventually I will place neon cord
into boarding and dotty lights into ceiling, but right now it contains only
three layers of plaster and waits for the last one which will colour it into
slightly yellow-beige colour, similar to my hammock, but smoother. Plaster
was placed not as stright surface, but with random curviness similar
to the surface made by spreaded material (I've found words for that -
consider paining made by oil-paints using brush - that it similar
to what I want my ceiling to look like). Stucco plaster is called
"Venetian patterns" or something like that, so you could imagine how
it will look. If I will complete it tomorrow, I will post some photos too.
Some more photos can be found in gallery.
Meanwhile I've ordered my enter door (second time already) - it is simple but very
hard-to-break steel door. It will be ready about Jan 23-24, when it will
be setup, I will finally feel myself like living in the apartments, not
in the construction.
/devel/flat :: Link / Comments (0)
I'm pleased to congratulate you with The New Year!
Yes, again. We are so cool, that we celebrate New Year
two times per year - Jan 1 and Jan 13.
I think you know what to do when new year has come, so
do not look here - go and make something cool.
I'm also congratulate Abr with his birthday - since he is in London,
I can not celebrate with him and his family (Hello, Tanya and small
Anton Andreevich), but I transmit my wishes over the Internet
(if you ever read my blog).
Happy Birthday!
/life :: Link / Comments (0)
Fri, 12 Jan 2007
Climbing eveninig.
Short but quite productive training - not too much traces
were completed, since training started one hour after usual start at 18-30,
but nevertheless I completed several warming simple traces and four quite
complex old traces - found that some of them were slightly changed, which
forced me to fail, but only for a moment. I was not 100% recovered
from previous (first time this year) training, so I did not even tried
something completely new and really complex, but nevertheless body
is aching and that is a good sign.
/life :: Link / Comments (0)
SMP scalability in M-on-N threading model.
Actually it is simple.
Since only kernel can schedule thread (actually
not even thread or process, but its own kernel's representation,
so called kernel's virtual process) to run on specified CPU,
M-on-N threading model should have several real threads
(for example several current POSIX threads), its number
should be equal to number of real CPUs, and then library
layer will schedule execution of context of different
real threads, each of which in turn can run on separate CPU.
So, userspace will create new real threads when pthread_create()
is called until number of them is less than number of real CPUs,
each real thread in turn is a context in the global tree of contexts,
where fake context will be added with all subsequent pthread_create()
calls, and userspace scheduler (backed by real threads) will pick up several contexts
from the tree and execute them on the real CPUs.
Simple and looks really good.
/devel/threading :: Link / Comments (0)
Buy Sealand.
Sealand is small island/platform
situated at a distance of approximately 7 nautical miles from the coast
of UK in North Sea.
It has sovereignty.
One can buy (and/or develop) her/his own nation there with own laws, which in turn
allows to register there some firms without any taxes even better
than today's off-shores.
Its price is $500 millions, but there are other islands too, which
prices start from just $50k.
For example this
set of islands with price less than $250k,
althgough I'm not sure that one can declare a sovereignty there.
/other :: Link / Comments (0)
Winter in Russia.
One year ago temperature in Moscow was 2 degrees higher
than in Antarctica, it was the coldest winter
in the history (with instrumental measurements).
Now we have about +/-1 degrees Centigrade - the hottest
winter in the history.
Winter is a time of planned Matrix update. To get calculation
resources for garbage collection, Matrix
shortens light day, removes leafs from trees,
and paints sky into uniform gray colour. This reduces picture calculation.
Previously Matrix covered the whole surface with uniform
white snow, but with new powerfull servers it is not needed
anymore. One says, that after next upgrade there will not be
any need for dedicated winter at all.
/other :: Link / Comments (0)
Thu, 11 Jan 2007
Userspace threading and theirs benefits and drawbacks.
Benefits.
1. Fast scheduling.
There is no need to cross userspace/kernelspace boundary to schedule
new thread execution (just watch what happens with userspace network stack
compared to kernel's one when there are a lot of syscalls performed
for small packets receiving/sending).
2. Fast thread creation and destruction.
It just becomes an allocation of the structure in the userspace,
no need for full creation process which is performed
in clone() syscall.
3. Smaller number of cache misses.
Since there is only one process instead of several threads,
cache locality is increased greatly with reduced number
of misses.
Drawbacks.
1. Scheduling fairness.
Since kernel does not know about multiple threads behind given process,
it can not add it appropriate number of timeslices for execution.
Can be solved either by more tight collaboarion of the userspace nad kernelspace
schedulers or simply by increasing process' nice value.
2. All communications are performed through one kevent pipe.
Which can be problematic (although interface was specially designed to be scalable).
3. Complex code for good SMP scalability and userspace scheduler.
I wanted to put it into 'Benefits' section, since that is exactly why I started
this project.
/devel/threading :: Link / Comments (0)
Threading issues and ways to resolve them.
1. Signals.
POSIX requires that signal must be delivered on per-thread basis,
but signal handler, and thus the fact that signal is ignored or not,
is per-process property. With kevent's possibility
to deliver signals through its queue problem can be solved
in the very elegant way - main process receives a signal
event notification through its kevent queue and then check
all its threads, which have that signal unblocked, all appropriate
threads receives signal through the alternative signal stack.
2. Kevent/poll usage in the threads.
Poll() and select() must be translated into kevent request in syscall wrapper,
for example how I implemented epoll on top of kevent,
and then that event will be put into main kevent queue.
3. Sleep and the list system calls.
Kevent has timer notification which will be used to emulate such calls.
Call for POSIX timers can be emulated through kevent POSIX timers support,
but probably I will not consider this for initial implementation.
4. Blocking inter-process communications like semaphores.
It must be converted to userspace kevent notifications.
All above can look like it is old LinuxThreads days before NPTL, when
there was a special management thread which performed a lot of
that functionality (namely signal handling, resource cleaning, which is not
a problem for this new implementaion, since all resources will be automatically
cleaned when process exits, and no process-visible resources like file
descriptors are closed on thread cancellation, and signals can be
handled perfectly with kevent's capabilities), but now it has moved
into layer between kernel (or glibc for initial implementation)
and application (i.e. scheduler, I think it is correct name, since main
task of that layer is exactly scheduling). But actually it completely does not differ from what we
have right now with NPTL and 1-on-1 thread model - exactly the same
tasks are performed by kernel, but with additional layer crossing
overhead.
/devel/threading :: Link / Comments (0)
Initial thoughs about userspace threads (or M-on-N threading model).
Let's see, what we already have.
Glibc provides us makecontext()
and friends functions, which are essentially a
part of the userspace execution mechanism -
one can create context, run it, swap it and so one.
That is something I want to implement, except its
problems - context switch can be performed from the
outside thread (that is how IBM NGPT was implemented),
it is not the main issue, although I really do not like
such an approach, the main problem is the fact, that
if such a context is going to block, that fact can not be
detected from another contexts, and thus it is impossible
to swap context with another one. Even if some check will
be done in each syscall, or even if each syscall will be
a rescheduling point, that means that either each syscall
must be non-blocking, or the whole process will go to sleep
in syscall, since kernel does not know that there are
several context in the same process.
So, the solution is to have some kind of a thin layer
between kernel and userspace (in a real world
it is called glibc), which will convert all syscalls
into non-blocking operations (including nanosleep()
and the like), and keep a track of what each context performed.
In practice glibc rewrite is not what I would like to do,
but instead some layer on top of it will be implemented,
which will convert syscalls into kevent operations,
and become a rescheduling point. I will even consider to
implement not exactly known syscalls, but instead (at least
for the initial implementation) introduce new calls,
which will be a wrapper to known ones - like new_write()
will be a kevent and new threading model based wrapper,
which will setup all appropriate requests (like POLLIN)
and if possible, call write() itself. When all
execution context are put into the sleep, the whole
process will park itself in the waiting syscall like
kevent_get_events().
Main issues with such approach are following:
- scheduling algorithm
- SMP scalability
- syscall wrapper in the glibc or completely new calls (like described above)
At least first two issues are interesting technical challenges,
the last one will be first implemented with new calls.
/devel/threading :: Link / Comments (0)
Filesystem corruption bug recently found in Linux kernel.
LWN.net article about it
clearly shows how complex VFS is, but its conclusion and Linus
words about buffer heads are interesting. Conclusion is basically
'do not use buffer heads'. Indeed, all the time I worked with VFS (
kevent and AIO,
receiving zero-copy,
test block device for acrypto)
I never ever tried to use buffer heads - why is it needed, when
this days we operate with pages already - and eventually
filesystems operate with pages too - they have special set of
callbacks to write page, read them and so on.
So, filesystem must be simple
in that regard - do not split page into buffer heads, always work with pages and provide
appropriate callbacks where they are needed (inode operations at least),
and that is how my FS will work.
/devel/fs :: Link / Comments (0)
Wed, 10 Jan 2007
Climbing evening. First one this year.
After Week of continuous drinking, lazy-boom-slacking
and other non-sport stuff, I eventually started training.
And miracle has happend - Grange took his fifth point
to the climbing zone too. So I started to climb
on the walls, but not usual boulderings and traverses -
although not a lot of traces were completed (and only one of
them was new, which I miserably failed on-sight,
but not because it was complex, but only due to the fact,
that it was the last trace after 3 hours of climbing, and
I tried it on-sight), eventually I completed it.
And now (I write it in the middle of the Jan 11,
when I eventually bring myself to the office) I feel that
every piece of my body is in pain. It is extremely
pleasant masochistic feeling - I like it.
I think if you have awakened and nothing aches, then
you are likely dead.
Excellent time.
/life :: Link / Comments (0)
Further development. Ideas. Plans. Agenda.
It looks like
kevent
is going to be frozen (see looksthis
thread in linux-kernel@ about too fast rate of new code creation).
Netchannels
are also implemented.
So I have split my TODO list
into TODO and DONE parts. Currently it has following items in TODO section:
- Integration of kevents into mainline
- Some thoughts about FS (especially about journalling and WAFL approach (I think log-structured
fs is really progressive idea)), so it could be used with receiving zero-copy,
or even be a network filesystem with distributed capabilities.
Also need to consider filesystem stress testing and emulation tool.
- Some thoughts about different threading models. Especially after this
analysis.
Things which originally appeared in TODO, but was implemented and essentially completed.
- Network tree allocator,
full sending and receiving zero-copy networking
- Complete userspace network stack
(which actually requires just to sync with TCP stack used in netchannels)
and move netchannels into userspace (with help of full zero-copy support and userspace stack)
- Fast NAT, which will not use Linux connection tracking system (which is extremely slow)
- True asynchronous IO. My thoughts are described here
and here.
Actually kevent integration could be put into DONE section too, since
it does not require too much of efforts to maintain it in the separate tree
with regular sync with mainline - no one requests more, so it will silently
live and work in the own repository.
Although netchannels do not work with my
network tree allocator,
I will not for a while work with this project due to
absence of real need for zero-copy.
Next item in TODO list is filesystem and related development - it is indeed very interesting
task, and my next killer project will definitely new filesystem, which, as a minimum must,
will behave faster than speed of light and will scale more than universe.
But it is very complex task and I want to have some time something
different than kernel, but not less interesting.
So I select new threading model implementation - essentially it will be so called N:M threading model,
implemented on top of POSIX interface - I want to create a library, which
would be placed just instead of glibc libpthread with
at least main functionality.
Or maybe just get a vacation and move to the edge of the Earth.
Stop, since Earth is roughly a sphere, I already on the edge. Crap.
Hmm, or maybe drop my current work and organize something own - the more I
do own project, the more I like the idea, but not right now.
Will go hacking.
/devel :: Link / Comments (0)
New kevent 'take32' release.
This early morning hack (I hope no one at work reads what I'm doing) contains following major changes:
- Added
aio_sendfile_path() - this syscall allows to asynchronosly transfer
file specified by provided pathname to destination socket.
Opened file descriptor is returned.
- Added trivial scheduler which selects execution thread. It allows
to specify given thread 'by-hands', but since kaio provides '-1' it uses
round-robin to get processing thread. In theory it can be bound to
scheduler statistics or gamma-ray receiver data.
- Number of bug fixes in kevent based AIO
mpage_readpages().
Benchmark of the 100 1MB files transfer (files are in VFS already) using
sync sendfile() against aio_sendfile_path()
shows about 10MB/sec performance win (78 MB/s vs 66-72 MB/s over 1 Gb network,
sendfile sending server is one-way AMD Athlong 64 3500+)
for aio_sendfile_path().
Well, call me a looser, but I started 3 days resending timeout.
I know it is annoying and disturbing, and
I really doubt it is a good way to tell the world about my work, and I bet you
all tired from those pathos words, but I really would like to get some feedback,
since I want to start to work on network AIO, but sending mails into
unfeedbackable (I used word 'destination' for hackers and maillists,
but you actually know what I mean) really does not motivate me for that.
/devel/kevent :: Link / Comments (0)
Mon, 08 Jan 2007
New kevent 'take31' release.
Short changelog:
- AIO state machine
aio_sendfile() implementation
- moved
kevent_user_get/kevent_user_put into header
- use
*zalloc where needed
aio_sendfile() is implemented as described
here,
except destructor callbacks,
which are essentially the same, except that they reschedule AIO processing.
Benchmarks of simple sendfile() vs. aio_sendfile() did not show noticeble
win of any approach, but I want to notice, that my receiving side is
not that best in my case (I managed to create a test where usual
sendfile() stuck until signal received, the same hapend with aio_sendfile()
too, but the latter can be buggy).
It would be good to use it in lighttpd (I will create a patch after
some feedback received and approach will be considered as good.
AIO state machine is a base for network AIO (which becomes
quite trivial), but I will not start implementation until
roadback of kevent as a whole and AIO implementation become more clear.
Patches and userspace test application are available in archive
from kevent homepage.
/devel/kevent :: Link / Comments (0)
Merry Christmas!
Although it was yesterday, we celebrated Christmas and Mephody's moving from Moscow to Limerick, Ireland.
We sat first in "5 oborotov" on Sadovaya-Triumfalnaya,
where Schtrom's family, Lyasha with wife and old Mephody's friend Andrew with wife
joined us. We sat there upto 24, and then moved to the next bar - eventually we landed
in "Aero cafe" on Pokrovka, where spent the whole night and moved home around 6 A.M.
What I got from that time besides good time with old friends,
is strong feeling that the most interesting and fun moments always happen when I'm drunk.
I do not know why.
And that I am too old for modern electronic music and clubs.
And it looks like I finally tired of drinking - and since Meph returned to Ireland
and all others are so inert to get them to restaurant, eventually no new adventures
are on the nearest horizon, so I can concentrate on hacking (which happend actully only
couple of times this year), climbing, flat development and other healthy things.
/life :: Link / Comments (0)
Sat, 06 Jan 2007
Kevent celebrates its first birthday!
As present I have created
initial aio_sendfile() implementation.
/devel/kevent :: Link / Comments (0)
Initial aio_sendfile() implementation has been committed into kevent tree.
It is yet very rough and definitely must be cleaned (and some known bugs fixed),
but major part is done.
aio_sendfile() contains of two major parts: AIO state machine and
page processing code.
The former is just a small subsystem, which allows to queue callback for theirs invocation
in process' context on behalf of pool of kernel threads. It allows to queue caches
of callbacks to the local thread or to any other specified. Each cache of callbacks
is processed until there are callbacks in it, callbacks can requeue themselfs into
the same cache.
Real work is being done in page processing code - code which populates pages into
VFS cache and then sends pages to the destination socket via ->sendpage().
Unlike previous
aio_sendfile() implementation, new one does not require low-level filesystem specific
callbacks at all, instead I extended struct address_space_operations to contain
new member called ->aio_readpages(), which is exactly the same as
->readpage() (read: mpage_readpages()) except
different BIO allocation and sumbission routines.
I changed mpage_readpages() to provide mpage_alloc()
and mpage_bio_submit() to the new function called __mpage_readpages(),
which is exactly old mpage_readpages() with provided callback invocation
instead of usage for old functions. mpage_readpages_aio() provides
kevent
specific callbacks, which calls old functions, but with different destructor callbacks,
which are essentially the same, except that if page becomes uptodate, it is not unlocked,
so that it could not be removed until it is sent, and only then it is unlocked.
Code does contain bug (at least one) I know about - subsequent try to send pages happens not
after BIO is ready and thus pages are populated into VFS cache (i.e. pages are marked as uptodate),
but repeatedly in the state machine (rescheduling must happen in BIO destructor, not in the code,
which allocates pages). Another issues is that it is currently impossible to receive kevent notification
when aio_sendfile() is really completed.
/devel/kevent/aio :: Link / Comments (0)
Thu, 04 Jan 2007
Visited Alma Mater - walked and searched for the ghosts from the past...
I visited MIPT and walked in the campus this evening and
I had not met anyone - no even remotely known face or voice.
Things change.
Then moved to Mephody's wife Ira parents place (it is neighbour house to
where I bought my apartment, where they stopped while visiting Moscow) to drink my new shiny
liter or irish wiskey, which he presented to me from Ireland.
It was fun time with Meph and Ira - we stopped about 5 A.M. and recalled
friends, talked about life, its changes, about Alma Mater. Eventually
washed bones to politics, music. Compared life in Limerick (Ireland) and
Moscow (Russia), and found that no matter how strange life here is,
it is much-much-much more interesting than that in stable
old Europe. Meph and Ira concluded that they would like to return to Russia,
or move to France - they visited Paris recently and showed a lot of interesting photos
from the trip - The Louvre, Versailles, Eiffel's tower, Notre Dame De Paris - they are
beautiful places, the French are good and interesting people.
/life :: Link / Comments (0)
AIO (sub) state machine has been completed.
It is small subsystem, which lives in kernel/kevent/kevent_aio.c
file, which allows to queue and asynchronously invoke
callbacks, which are intended to populate pages into VFS cache, send data
to the destination socket, copy data to/from userspace and so on.
Real working callbacks itself are not implemented yet.
I will only implement three of them - open file by filename,
populate file's pages into VFS cache, send pages to destination socket.
Probably will also add writing page to userspace.
This set will allow to implement aio_sendfile()
as sequence of that callbacks - open file by file path, then populate
its pages into VFS cache in some chunks or one-by-one and eventually
send them to the destination socket.
There is a problem of the order of sending one page and populating
its neighbour though, since having the whole VFS cache filled with
locked pages from one file is not a good idea, but locking is required to
allow sending itself - so page would not be swapped out. But I will
either stop further populating until pages are sent, or will
not firgure this out at all - depending on results from initial
implementation.
Each subtask above - i.e. each callback, is an elementary chunk, which
will be handled by kevent. Completeness of the whole task
will be handled by kevent too.
/devel/kevent/aio :: Link / Comments (0)
Wed, 03 Jan 2007
Initial thoughs on the 'true AIO'.
Here was
first announce of the idea, and now I will open it a bit more.
This was written after some studing of Intel Dan Williams' work
on async copy found here,
the whole thread
can be also interested for those who want to know what is AIO developemnt status
and some ideas about its improovement.
A generic solution must be used to select appropriate device to perform
actual data processing.
We had a very brief discussion about asynchronous crypto layer
(acrypto)
and how its ideas could be used for async dma engines - user should not
even know how his data has been transferred - it calls async_copy(),
which selects appropriate device (and sync copy is just an additional
usual device in that case) from the list of devices, exported its
functionality, selection can be done in millions of different ways from
getting the fisrt one from the list (this is essentially how your
approach is implemented right now), or using special (including run-time
updated) heueristics (like it is done in acrypto).
Thinking further, async_copy() is just a usual case for async class of
operations. So the same above logic must be applied on this layer too.
But
layers are the way to design protocols, not implement them.
David Miller on netchannels.
So, user should not even know about layers - it should just say 'copy
data from pointer A to pointer B', or 'copy data from pointer A to
socket B' or even 'copy it from file "/tmp/file" to "192.168.0.1:80:tcp"',
without ever knowing that there are sockets and/or memcpy() calls,
and if user requests to perform it asynchronously, it must be later
notified (one might expect, that I will prefer to use kevent :)
The same approach thus can be used by NFS/SAMBA/CIFS and other users.
That is how I start to implement AIO (it looks like it becomes popular):
- system exports set of operations it supports (send, receive, copy,
crypto, ....)
- each operation has subsequent set of suboptions (different crypto
types, for example)
- each operation has set of low-level drivers, which support it (with
optional performance or any other parameters)
- each driver when loaded publishes its capabilities (async copy with
speed A, XOR and so on)
From user's point of view its aio_sendfile() or async_copy() will look
following:
- call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578')
- call aio_schedule_file_socket(source='/tmp/file', dest='socket')
- call aio_schedule_file_addr(source='/tmp/file',dest='192.168.0.1:80:tcp')
or any other similar call and then wait for received descriptor in kevent_get_events() or
provide own cookie in each call.
Each request is then converted into FIFO of smaller request like 'open file',
'open socket', 'get in user pages' and so on, each of which should be
handled on appropriate device (hardware or software), completeness of
each request starts procesing of the next one.
Reading microthreading design notes
created by Zach Brown (Oracle), I recall
comparison of the NPTL and
Erlang threading models - they are _completely_ different
models, NPTL creates real threads, which is supposed (I hope NOT)
to be implemented in microthreading design too. It is slow.
(Or is it not, Zach, we are intrigued :)
It's damn bloody slow to create a thread compared to the correct non-blocking
state machine. TUX state machine is similar to what I had in my first kevent
based FS and network AIO patchset, and what I will use for current async
processing work.
A bit of empty words actually, but it can provide some food for
thoughts.
/devel/kevent/aio :: Link / Comments (0)
Tue, 02 Jan 2007
Happy New Year!
I congratulate you with this excellent day and wish you to be even cooler than
you were!
I perfectly ok celebrated the New Year in small nice cellar - we had
an excellent company, tasty food and drinks, a lot of fun and cool thigs.
There are some things I can not stop and getting a bit mad when doing -
it includes drink in a good company (actually I do not drink too much),
good-idea-hacking (well, I do hack a lot, which is frequently frowned
upon by those, who lose my attention due to that though) and some other
issues I will not write here (indeed), and a lot of that happend
in New Year vacations.
It was just bloody cool celebration!
Since I'm quite alive and even feel myself good, I will write bits of myself
in the way Rusty did.
Here we go:
- My first contribution to free software was a kernel patch for ELF or 'misc'
binary format loader. Sounds too cool, doesn't it? But... - it was patch which added
error check (three lines!) for some function, copied from other loader.
But it was my first one.
-
- When I was 25 I did not took ballroom or any other dance classes, instead
I played guitar (actually I already do not recall it completely, although
not that much time has gone away). Now I have a trumpet, but can not play it (yet),
can only rape neighbours ears.
- When I was in primary school, I played chess. I did it good, so good that
even recall how figures move and managed (several years ago) to win couple of times against
current second-prize-owner of
the World Youth Chess Championship (although among kids smaller than 12 years).
When we played, Illya was 6 though.
- When I was in high school, I did karate and aviation modelling (the former
only about a half of a year, although I did kick asses since then couple
of times and still can almost make the splits, the latter was about two or three years).
When I was in university I played footbal, skis and gym. The latter about
half a year (do not even know why), the former two about 3-4 years (since then
I hate racing skis). Now I do (and like) climbing.
- I am not brilliant, although quite cool and have a lot of good friends
all over the world. I do not have a girl, instead they have me.
I would like to feature Grange
and David Miller.
Likely others do not have blogs.
This year I expect to be my year. Although I completely do not have money, have debts,
have boring work and other uninteresting stuff, I do expect to get world domination.
At least in some specific areas.
So, stay tuned and have a nice year!
/life :: Link / Comments (0)
|