Zbr's days.
October
Sun Mon Tue Wed Thu Fri Sat
     
11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  
2008
Months
OctNov Dec

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Sun, 18 Feb 2007

M:N threading model news.


I've fixed number of bugs (actually all known) and optimized some pathes a bit. Comparison of the create/run_empty_function/destroy thread against NPTL can be shown on this picture.

NTL vs. NPTL

NTL is about 30% faster, its maximum time is around 18-20 microsecods, NPTL sometimes fires upto 40 microseconds.

/devel/threading :: Link / Comments (0)


Tue, 13 Feb 2007

All magic behind segments access has been uncovered.


So, I've written simple module which dumps Global Descriptor Table for each CPU in system and started my test application, which essentialy does "movl %%gs:0, %0", so I just present parts of the output so things just become clear and obious:

# gdt_reader /dev/gdt
...
 0/ 2:  6: TLS segment #1 [ glibc's TLS segment ] : start: b7ff06c0, size: 4294963200, 
 seg_type: 0x3, dpl: 0x8, AVL: 1, SEG_PRESENT: 1, DESC_TYPE: code or data, OP_SIZE: 32 bits.
...
 0/ 2: 14: default user CS : start: 00000000, size: 4294963200, 
 	seg_type: 0xb, dpl: 0x8, AVL: 0, SEG_PRESENT: 1, DESC_TYPE: code or data, OP_SIZE: 32 bits.
 0/ 2: 15: default user DS : start: 00000000, size: 4294963200, 
 	seg_type: 0x3, dpl: 0x8, AVL: 0, SEG_PRESENT: 1, DESC_TYPE: code or data, OP_SIZE: 32 bits.
...

ds: reg: 15, table: GDT, rpl: 0x3, val: should fail.
gs: reg: 6, table: GDT, rpl: 0x3, val: b7f736c0.
As we can see, %DS register is setup for the whole adress space (its size is 4gb minus one page), as long as %CS, so it is impossible to dereference them.
%GS points to some memory (accessible indirectly as "%gs:0"), where thread local storage is setup by kernel and libc, its size is the same 4gb minus one page.
That page is likely created to cover GDT itself.
Segmentation fault behind "movl %%ds:0, %0" becomes obvious too - it tries to dereference zero address, which ends up with SIGSEGV.

So, instead of searching for documentation on "movl %%gs:0, %0" command (and mainly to determine what is that magical ':0' is about) I created kernel module, dumped and analyzed GDT and finally detected how gcc inline asm treats that.

Btw, notice that %GS segment start is the same as result of the "movl %%gs:0, %0" command. It is impossible to actually get segment start, it just happens that glibc puts pinter to TCB (thread control block) as first pointer in the struct pthread (which is stored in %gs). TCB structure itself has pointer to itself as a first member.
It is only true on x86, other arches happily have special real register for that purpose instead of ugly indirect dereferencies.

/devel/threading :: Link / Comments (0)


Mon, 12 Feb 2007

Fun with kernel.


Instead of read some doc about inline gcc syntax (and mainly to understand what does it mean "movl %%gs:0, %0" and why it faults for %%ds) I write a module which will allow to dump GDT on demand - this will allow to check my hypotesis about segment address dereferencing idea described previously.

Well, actually I read a lot of inline asm docs for the last several days, namely following (for interested reader - that is mostly small howtos and simple examples):

and still in doubt.

/devel/threading :: Link / Comments (0)


The magic behind %gs access.


Ok, I've cought bodhi breath and understood, what did that mean.

Each descriptor is actually index shifted to 3 bits left, least significant bits are LDT/GDT selector and permissions bits, so %gs equal to 0x33 is actually descriptor number 6, which is first TLS descriptor in GDT.
As we know, each entry in GDT describes some region of memory, so construction "%%gs:123" is a 123 offset inside the area described by that descriptor, as I understand that, although it still requires to think why similar access to %%ds ends up with segmentation fault (I could understand absence of reading permissions for %%cs, but why data segment is not readable, I still can not understand).

That automagically means, that %%gs (which is priveledged descriptor as all other segment registers) can not be used to store information about userspace threads in M:N model, since it always points to the TLS area for the main process, so to get currently running userspace thread in M:N model stack pointer aligned to stack size can be used, but that requires that all stacks have the same size (which is true right now anyway).
Having ability to access curent thread not through ugly pointer allows to naturally scale on SMP, that is the only reason to invent such mechanism.

/devel/threading :: Link / Comments (0)


Sun, 11 Feb 2007

Some thoughts about M:N threading nad NPTL.


While I was busy kicking various AIO designs I completely missed time to do real interesting low-level threading stuff. So, some thoughts.

Locking into glibc NPTL sources I noticed that x86 uses %gs register as so called thread-pointer register. To get current thread pointer glibc uses following code:

# define THREAD_SELF \
  ({ struct pthread *__self;						      \
       asm ("movl %%gs:%c1,%0" : "=r" (__self)				      \
   	  : "i" (offsetof (struct pthread, header.self)));		      \
       __self;})
%gs itself shows to the descriptor in GDT, but it is always 51, while GDT only holds 32 descriptors, %%cs for example contains 115 - one question (the last minute though - GDT entries are of 8 bytes each, so it is possible, that value stored in register should be divided to 8 to get actual entry... Will check out tomorrow.).
Another one - what does it mean "%%gs:%c1" - it will be transferred as "%%gs:immediate_parameter_1", i.e. "%%gs:some_num", for example "%%gs:0" works, but it does not work with usual register like %%eax, only %%gs. Similar access to %%cs, %%ds and other segment registers ends up with segmentation fault.

It looks like walking in the dark room with exact knowledge that there is at least one rake...
But I will find an exit soon.

/devel/threading :: Link / Comments (0)


Sun, 04 Feb 2007

Major step in M:N threading implementation - moved to partial userspace context usage.


As a first step I just started to use struct ucontext instead of sigframe - this allows to use getcontext()/makecontext() for initalization, which is a bit faster than signal waiting, but potentially can be much faster (I just need to remove a signal mask initialization through syscall in getcontext()). It also allows some code refactoring, which hides per-arch signal frame definition into implementation files from headers.
I created simple convertor from signal stack frame (mainly sigcontext() structure) into ucontext (mcontext_t) structure, which is used as storage. Although they have exactly the same binary layout, x86 segment registers should not be copied. Current code does not properly saves/restores FPU state too, but this requires more learning.
Next step is to implement getcontext() by hands so that code would not initialize signal mask at all. Then I will think about proper SMP scaling, which will use stack tricks instead of 'current' pointers.

There is a problem though - new thread is started only after parent's timeslice is over, i.e. it is possible to create set of threads with empty functions and they will exists quite for a while - for example 500 threads started in a loop still are created too quickly to be called and stopped to release memory, but for example 100 NTL threads creations are about 1.6 times faster in that synthetic benchmark than NPTL ones. I will think about resolution.
API is also very limited in this version yet.

/devel/threading :: Link / Comments (0)


Thu, 01 Feb 2007

Generic AIO by scheduling stacks by Zach Brown.


Zach Brown has announced his AIO patchset which works as multistack - i.e. each process has set of AIO related stacks (read: it has set of functions blocked in syscalls), which then are switched by scheduler in kernelspace.

This idea is exactly how M:N therading library works but in kernelspace. Another difference is that there is only one scheduler, but in M:N threading library userspace needs to have its own.
A you may noticed, userspace M:N threading works faster in all aspects than kernelspace scheduling/creation, so likely Zach's approach, although there are some nitpicks - it uses too much calls to schedule() to make my something-needs-clarification counter low. Exactly the same problem exists in NTL - it needs rescheduling for start, so I'm implementing better context switching methods similar to setcontext() calls. Another caveat is per-syscall allocation of the new structure, likely it can be worked out by having cache of allocated, but unused ones.
So, number of rescheduling and per-syscall allocations scare me in that approach.

AS ingo Molnar noted "all these LWP, fibre, firbril, micro-thread or N:M concepts fit? Most of the time they are just a /weakening/ of the [the most easy to program one] concept".
And he is right, state machine must be used to achieve the maximum peroformance, that is what I apln to put into M:N threading library, which will just replace blocked syscalls with kevent-driven requests, which in turn form some kind of a state machine, which, by the help of the library, will be removed from higher userspace layer, so programmer would never knew about it.
Although he is not right that kernel->user context switch is a way about user->user times, at least with thread creation/allocation (which is some way is used in this AIO model) its price is about 2-20 times slower, so it is not acceptible to have such latency.

Another couple of words for my work - kevent AIO, which only works for aio_sendfile() right now, was designed in the state machine way - there are several execution threads indeed, but each request is actually a set of smaller ones, which are executed one-by-one in completion of previous one. Since completion happens in different contexts, special machine is designed. It works similar to Tux state machine though, but its tasks so far only included pages reading into VFS cache and theirs sending.

Shit, I would say, that kevent based AIO works that way (and can be even better with network AIO), but 1. i'm not subscribed to linux-kernel@, and 2. I'm modest enough to only say how cool I am in my blog :)

/devel/threading :: Link / Comments (0)


Mon, 29 Jan 2007

M-on-N threading library got a homepage.


Here one can find some design ideas and notes compiled from threading blog tag, which you are currently reading.

/devel/threading :: Link / Comments (0)


Atomic locks and updated starting route for M-on-N threading model.


After I started to use atomic locks (lock prefix on x86) instead of semaphores, thread start/empty exec/stop was reduced down to 0.3 microseconds compared to 14 microsecods for POSIX NPTL case.

But there are problems.
First one is that I perform initial context setup through signal invokation, which is at least two syscalls. They are slow.
Another one is that thread is really started only after rescheduling, which is another signal, so another two syscalls.
Third on is that there must exist different locking primitives - for signal context and for process context, which must block signals, which in turn adds additional overhead of sigprocmask() syscall.

After I fixed all above issues (actually not fixed, but confirmed that they must exist), performance reduced to 9 microseconds compared to 14 microsecods for POSIX NPTL case for empty thread creation/destruction.

This can be fixed, if I would have created arch-specific getcontext()-like calls, which would be mutually transformable into signal context information (existing getcontext() and friends produces different data than signal context has at least on x86). But I can not right now, since I do not know enough x86 ABI (I learned a lot for past several days, as you can notice from this blog, but it is still even remotely not enough).

Currently M-on-N threading model uses ugly arch-specific hacks to start new threads, which actually are something remotely similar to makecontext().
So, the solution, which will rock M-on-N threading implementation is to convert or create getcontext() and friends calls which can be used with signal context information.

But let's Linux motto "release early, release frequently" fly - I plan to release alpha version today even without syscall substitution support and send it to linux-kernel@ and libc-hacker@ for review.

/devel/threading :: Link / Comments (0)


Sat, 27 Jan 2007

New userspce preemptive scheduler for M-on-N threading model is completed.


It was built on the idea, that kernel saves in stack information about interrupted context, so it can be extracted there and changed to anything else. I do not use makecontext() and friends at all now, since its internals are completely different from what is stored in signal's stack. System works, but not 100% reliable - there is a race in scheduler when it is possible to reference a thread which was just exited, I will fix this with introduction of atomic operations, which will also reduce thread creation overhead related to futex() syscall, which is how semapphores are implemented, which are currently main locks in my threading library.

Such approach currently can only be used, when sources are compiled without position-independent code support.

If timer signal based scheduler is disabled, no bugs happens (which is quite obvous, since it becomes synchronous).

I did not tested SMP scalability and there is no syscall dynamic substitution, which will be added after I complete bug fixing in the scheduler.

Current test for speed of the thread creation shows (thread is allocated, its empty function is called, then it is removed), that speed for one thread creation/execution/destruction is about 1.9 microseconds, compared to 14 microseconds for NPTL POSIX threads.

/devel/threading :: Link / Comments (0)


Wed, 24 Jan 2007

Execution different function after returning from signal handler.


It looks a bit like writing an exploit actually, but I managed to change signal execution path to call my own function after signal handler is completed instead of returning to previously running context. Currently that function is executed in previously running context, only %eip was changed.

Code is quite simple and generic enough:

	struct sigframe *frame = (void *)get_ebp + sizeof(void *);
	struct sigcontext *sc = &frame->sc;
Above is correct at least on x86 and x86_64 (except that register is %rbp), although above structures are cpu-specific. If %eip (or %rip) is changed in signal handler to pointer to the new function, it will be called instead of function, which is supposed to run in that context.

Here is a log:
main 1814: scheduling first, context size: 88, fpstte: 624.
call_me_func: ebp: 0xb7f19ff8, stack: 0xb7f1a008, diff: 16.
call_me_func: esp: 0xb7f19fb0, stack: 0xb7f1a008, diff: 88.
1169655167:       th: 0x8049c00, stack: 0xb7efa008, id: 1, esp: b7f19fb0, ebp: b7f19ff8.
alarm_sighandler: ebp: 0xb7f19b08, esp: 0xb7f19ad0, func: 0x804868a, frame: 0xb7f19b0c, call_me_func: 0x804856c.
alarm_sighandler: prev: esp: b7f19de8, ebp: b7f19fa8, eip: 3404db.
alarm_sighandler: eip set to 804865a.
sched_return: func: 0x804865a, ebp: b7f19de4, esp: b7f19dcc.
1169655168:       th: 0x8049c00, stack: 0xb7efa008, id: 2, esp: b7f19fb0, ebp: b7f19ff8.
1169655171:       th: 0x8049c00, stack: 0xb7efa008, id: 3, esp: b7f19fb0, ebp: b7f19ff8.
1169655174:       th: 0x8049c00, stack: 0xb7efa008, id: 4, esp: b7f19fb0, ebp: b7f19ff8.
As you can see, sched_return() is called instead of old function, which prints next string since sched_return() returns.

To implement correct userspace scheduling I only need to replace the whole struct sigframe function with context from different thread. So far this looks simple, how it will be in practice I will check tomorrow, and now I need some climbing.

P.S. in previous story about how signals work I made a mistake saying that new signal stack is allocated - no, the same process' stack is used, or alternative one if it is available and thus special flag is set.

/devel/threading :: Link / Comments (0)


Tue, 23 Jan 2007

How do signals work in Linux?


Likely it is the same in all other Unix systems too, but I only checked Linux kernel.

There are two types of signals - synchronous, which are for example result of error operation, they always happen synchronously after the wrong operation, and asynchronous ones - they can be delivered at any time, for example using kill() call.
No matter what type of signal we received, it was produced exactly the same way.

When signal is generated and it is not blocked, mask of pending signals is updated. Later (when process is scheduled for execution) kernel will examine mask of pending signals and if there are any, it will start to deliver them one-by-one. If handler is set to SIG_DFL or SIG_IGN either process will be killed (actually default action will take place), or signal will be dropped. The most interesting case thus is when there is our own handler.
In that case kernel will eventually call setup_frame() function, which will setup new signal stack (or use existing if SA_ONSTACK flag is set), save current context (copy registers, error value, some thread info and other interesting information), setup return call (function which will be called when signal handler is completed to return back to kernel). Save context procedure includes filling struct sigframe, which contains all info needed to continue to run interrupted task and/or schedule it after some time.

After some GNU asm learning and googling, I've managed to run function on top of its own stack. Code (for x86) is pretty simple:

	asm volatile (
		"mov	%0,%%eax	\n" /* Start address */
		"mov	%1,%%ebx	\n" /*  Arguments */
		"mov	%%esp,%%edx	\n" /* save old sp to edx */
		"mov	%2,%%esp	\n" /* change stack */

		"push	%%edx		\n"
		"push	%%ebx		\n" /* copy arguments to new stack */

		"call	*%%eax		\n" /* call (*func)(arg); */
		"mov	4(%%esp),%%edx	\n"
		"mov	%%edx,%%esp\n" /* restore old stack */
		:
		: "g"(func), "g"(data), "g"(stack+stack_size)
		: "eax", "ebx", "edx", "esp"
		);
It was lurked in PTL threading library, which is the only one which does support preemptive userspace scheduling. Provided function is indeed called on its own stack. But when signal is delivered, its $ebp does not contain that stack, but instead it contains address somewhere in the middle of the new stack, which rised a suspicion, that glibc installs own signal handler, and then calls my, but experiments with x86_64 test machine showed, that kernel indeed jumps directly into my own signal handler. Unfortunately I do not have fast x86 test machine, and do not want to spread power to two arches currently, so I will setup my VIA C3 test machine and will run some signal tests on its modified kernel to detect, where exactly stack pointer for interrupted context is stored. When this issue will be resolved, correct preemptible scheduler for M-on-N threading model implementation will be just a matter of hours.

/devel/threading :: Link / Comments (0)


True context switching in userspace.


After some thinking, I've understood, that setcontext() approach does not allow to have real context switching - when this function is called, context is restored from the point where getcontext()/makecontext() was called last time, so even if one have possibility to restore new context, context being removed must save its state by itself - obviously no one will do it. So this approach will not be used at all in my context switching implementation (although I will check getcontext()/makecontext()s sources, since they contain needed bits).

Let's slightly move away from topic and concentrate on how signals work.
According to my knowledge, signal is just a call for some function in stack - kernel saves needed context, setups small region on stack of currently executed thread, saves current context and calls a special function which ends up with registered handler invocation. (Interesting note for investigation - how do signals work with non-executable stack - likely special page is allocated for that purpose, but if it is so, then there is a way to write explits which can run on stack even when system setups it to be non-executable).
Signal handler can not be reentered, when it is exiting, it restores previous context, thus creating real context switching.

Returning back to our topic - scheduler's work thus become obvious - system's signal helper will store current execution context on stack, then new thread will be selected by registered handler and its context will be restored when signal will exit instead of old one, thus true context switching will be performed.

This looks simple in theory, but on practice there are couple of small problems:

  • I do not know (even x86) asm enough to code like in C (but I always wanted to hack low-level stuff)
  • I do not 100% sure that signals work like I described (but I surely want to know)
  • I do not know how it will be easy or not to save/restore context (according to glibc sources, getcontext()/makecontext() and friends are not too big though)
Actually context switch is a bit more complex - for example virtual mapping should be restored/saved too for TLS (thread-local storage), on x86 MMU registers must be saved and more generally FPU state must be saved too, but for initial implementation I think it is not too relevant.

So, enough for theories, it is about an 1 A.M. and I need to sleep - I'm sure this day will be very interesting.

/devel/threading :: Link / Comments (0)


Mon, 22 Jan 2007

Signals and contexts.


I was a bit optimistic when said that scheduling works - it can not work at all, completely, since it is not allowed to call setcontext()/swapcontext() from signal handler. It is only needed to schedule away CPU-bound tasks which do not perform syscalls, since syscall will be a rescheduling point too.

To solve this situation system needs to either spawn additional control thread, or allow kernel to create different signal types, which will be safe for setcontext()/swapcontext(). The former is platform-independent approach, although I would like to implement the latter, but for now first one will be done.

I've just found that there is no preemptive userspace threading library - IBM's closed NGPT library is based on GNU Pth and they both proide only non-preemptive scheduling.

Now I understand why I have so much problems with preemptive scheduling, and actually it does not look like there is easy solution even with control thread - all *context() functions work only with current context.

This requires some deep thinking...

/devel/threading :: Link / Comments (0)


Sat, 20 Jan 2007

M-on-N threading scheduler is ready.


It works similar to O(1) scheduler in that regard, that it has two queues of tasks too. When task runs out of its timeslice, it is moved into inactive queue, which becomes active, when active queue is empty.
It seems, that system scales good with increased number of CPUs, at least when system created several busy-loop threads, they ate both physical CPUs on my Core Duo system (according to 'top'), although it is possible that distribution is unfair.
Code is not yet ready, I know about at least two nasty bugs there, and it lacks syscall substitution part, which is a major part, but there is progress indeed.

/devel/threading :: Link / Comments (0)


Fri, 19 Jan 2007

Userspce scheduling in M-on-N threading model.


The main purpose of the scheduler as a saparated system is to hold fairness in spreading CPU clocks between tasks.
Current scheduler just gets next task and gives it its own timeslice (currently equal to 10 milliseconds). If that task is sleeping in syscall or performing CPU intensive operations, system will not take that into account. So tasks which are so called IO-bound, i.e. waiting in syscall for IO comlpetion (or actually any other sleeping), are not getting CPU clocks in a fair manner, since theirs timeslices are spent mostly sleeping. CPU-bound tasks in that model does not fully utilize CPU too, since part of the time CPU is idle since scheduler put IO-bound task into execution, and that task is waiting, while it would be possible to start CPU-bound task.

So, the solution is to provide each task a given timeslice, which will be decreased when task is actively executed on CPU. If task puts itself into sleep, its timeslice is reduced according to time, which was used for active execution. When rescheduling happens either in syscall time or in a signal, scheduler will select task with the highest timeslice left. Priority of the task will correspond to the length of the timeslice each task obtains.

Userspace scheduler also has access to the information, what exactly any task in question is doing, so if it is known that it is waiting in syscall, it will not be awakened at all until scheduler receives kevent which given task is waiting for.

/devel/threading :: Link / Comments (0)


Wed, 17 Jan 2007

Threading part of the NTL M-on-N threading library is ready.


Although not without problems - there is no scheduler (well, there is round-robin one, which is not what I want), I did not run any kind of benchmarks to test SMP scalability and timer signal overhead (the latter is the most problematic part - although 'top' shows zero CPU usage for pool of 100 threads sleeping in infinite loop, it is still possible that actual CPU usage due to signal delivery overhead can be noticeble).

Code does not contain kevent syscall wrappers yet. I will think about dynamic library loading tricks, which allow to 'replace' syscalls in runtime.

Enough for today - I'm going climbing.

/devel/threading :: Link / Comments (0)


pthread_create() vs. clone().


Did you ever tried to use clone() directly? I bet you never tried it at least with recent kernels.
First, exported clone() does not correspond to what kernel expects, it looks like it is only provided for compatibility. Manpage for that call is utterly obsoleted and incorrect (except useful flag description it contains _wrong_ descrition of parameters at least for i386).
But I do not search for easy ways - I have glibc sources and can dig into them.
That was my first impression of the man, who in theory can climb the Everest, fly to the space and understand math behind string theory (the latter only if time permits, it looks like the whole life can be spent there digging into more and more new subtheories).
Now I think that all three tasks described above can be much-much-much more solvable than digging in the glibc sources. And those people says that I poorly described kevent - hey, look into glibc NPTL implementation (and I even do not talk about its coding style) and pray you will never see this again, or just try to start a new thread using clone().

After about an hour of reverse engineering process trying to make __clone() work (note, that clone() does not work at all, just forget about this call, only __clone() is correct for i386 and 2.6 kernel), I managed to start new thread. It was a win, except very small problem, that it crashed somewhere in the provided function calling chain.
I want you to know, that I do not know low-level i386 arch enough to easily read and understand asm code (some years ago I managed to write asm application which entered protected mode in DOS, but I do not recall asm already, and actually never understood gas semantic good enough) found in sysdeps/unix/sysv/linux/i386/clone.S, so I miserably failed to proceed.

Yes, I started to use pthread_create() for SMP scalability. I do not hear how you scream 'loser', since you would be there too, but those of you, who still lurks here and ironically nod your head would better point me to something useful for understanding of how modern i386 (or actually any other arch) starts and works with threads/processes.

/devel/threading :: Link / Comments (0)


Initial implementation of the ntl (new threading library) M-on-N threading library.


Well, I can not find better prefix than ntl, which is extremely non-ordinary abbreviation for 'new threading library'.

Anyway, current version is very initial, it does not contain scheduler and does not contain kevent-driven wrappers on top os usual IO syscalls, but it already has all initialization mechanisms, cache of threads and all structures required for scheduling.

There are two major problems uncovered with this initial implementation.
First one is scheduling problem. Since NTL does not contain dedicated schduling thread, it is quite hard to perfrom scheduling of the functions which does not do syscalls, for example with those which just do while(1); loop and eat 100% CPU and never enters NTL layer. To solve this problem I need to add timer and appropriate signal handler, where reschduling will happen, which in theory can lead to performance degradation and to problem with alarm signal registered in thread function (although that should be fixed with kevent timer notifications).

Another problem is futex performance.
In current code there are two locks implemented as semaphores, which in modern Linux are transfered into futexes - schduler lock, which guards queue of threads, and stack cache lock, which guards list of free thread stacks.
So usual thread creation, empty function and thread exiting in NTL changes from this operations to:

mmap2(NULL, 8396800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7684000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
futex(0xb7fdac20, FUTEX_WAKE, 1)        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
futex(0xb7fdac20, FUTEX_WAKE, 1)        = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1)        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1)        = 0
...
munmap(0xb7fd7000, 4096)                = 0
so we get aditional four futex calls - two locks are processed: one when stack is unlinked and returned to stack cache, and another when thread is added and removed from scheduler's queue.

Performance differs noticebly (test case includes creation of the thread, which exits immediately, which is repeated requested number of times):
$ ./ntl_test 100000
num: 100000, diff: 388234, speed: 3.882340.
Compared to 1.793600 microseconds without futex calls.

In this situation there is no concurency at all - it is synthetic test, so actually one _empty_ futex call gets about 0.5 microseconds, where pure syscall overhead is 50% (this is Intel Core Duo 3.40GHz (running 3.7 Ghz) test machine).
I can not say if futex performance is slow of fast - but I would like to avoid this, so in practice semaphores should not be used for thread serialization, instead lightweight locks must be introduced.
In current code all locks are abstracted and implemented in separate file, so lock changes are trivial, but I do not want to introduce per-arch usage right now.

/devel/threading :: Link / Comments (0)


Mon, 15 Jan 2007

Initial benchmarking of pthread_create() vs. makecontext().


Benchmark is simple - allocate new thread, thread function immediately exits, parent thread waits for cancellation and starts again.
One case is pure pthread_create()+pthread_join(), another one is getcontext(), stack allocation (8mb as of my current rlimit), makecontext(), swapcontext() thread function immediately exits, stack is being freeing.

Obviously I expected that makecontext() will be much faster, but (time is a number of microseconds to create/destroy one thread, i.e. perform sequence described aboved):

$ ./test_pthread 100000
num: 100000, diff: 1402225, time: 14.022250.
$ ./test_context 100000
num: 100000, diff: 1322459, time: 13.224590.
Impossible, something was completely wrong and another world's magic was mixed, that was my first impression.
But when I studied in MIPT, I was told on every physics lab, that there is no magic, so I started to think.
The only thing my brain could think about, was to run strace.
So I did, and found following interesting moments.

Pthread case:
...
mmap2(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7622000
...
clone(child_stack=0xb7e214c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
	CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, 
	parent_tidptr=0xb7e21bf8, {entry_number:6, base_addr:0xb7e21bb0, limit:1048575, seg_32bit:1, 
	contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e21bf8) = 24426
clone(child_stack=0xb7e214c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|
	CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, 
	parent_tidptr=0xb7e21bf8, {entry_number:6, base_addr:0xb7e21bb0, limit:1048575, seg_32bit:1, 
	contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e21bf8) = 24427
...
and so on - everything looks ok - one stack allocation, and then it was reused since stack was not freed, but was put into cache by nptl implmentation.

Here is ucontext case:
...
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7659000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
munmap(0xb7659000, 8392704)             = 0
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7659000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
munmap(0xb7659000, 8392704)
...
So, mmap()/munmap() was the culprit - my context allocation code did not used cache of stacks, instead it allocated/freed new one for each new context. After I emulated cache usage, I got following strace for context case:
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb75fc000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
...
munmap(0xb75fc000, 8392704)             = 0
And following results:
$ ./test_pthread 100000
num: 100000, diff: 1402225, time: 14.022250.
$ ./test_context 100000
num: 100000, diff: 179360, time: 1.793600.
As expected - there is no magic, userspace context switching is about 7 times faster than real thread creation, and mmap()/munmap() syscalls provide exactly clone() overhead. Empty syscall on this machine is about 0.25 microseconds, so its overhead is negligible.

/devel/threading :: Link / Comments (0)


Sun, 14 Jan 2007

NPTL thread stack usage.


NPTL (as of glibc-2.5) uses following tricky stack usage model: when new thread is created some stack must be allocated for it, so NPTL stores cache of stacks, each of which has different size, so when thread is created that cache is searched for the entry with the nearest but bigger cache, if cache is found and it is not that bigger than requested (it is less than 4 times bigger than requested), it is returned to the user (and global variable which holds size of the cached stacks is reduced). If there is no appropriate stack, it is allocated from anonymous memory using mmap() with at least MAP_PRIVATE | MAP_ANONYMOUS flags, which in particular means that pages are not allocated at all, but copy-on-write mechanism is used (i.e. real allocation happens only when user writes to allocated pages).
Maximum size of the thread stack is 40 Mb, default value is taken from rlimits.
Stack is guarded by some pages (if required) and part of it is used as control block.
Simple and good model.

Exactly the same model will be used for stack allocation for M-on-N threading model.

There is another interesting memory issue in NPTL - so called thread-local storage (aka TLS).
It allows to have for example global errno variable to be per-thread. This requires to extend programming language (C and C++ have __thread variables) and ELF header. Reasons to have thread-local storage are obvious and the main one is performance, but it has too many problems with symbol lookups, object allocation and other issues (more details can be found in this article), so I will not implement it.

Furthermore, my library will be actually _trivial_ layer on top of glibc, which will just provide _new_ symbols for programmers, like new_write(), and if approach will be considered as good, glibc maintainers (namely Ulrich Drepper, with whom I had some discussion about kevent (it looks like we still not 100% agree on all issues, but there was a major progress)) could adopt given design.
My library will use makecontext() and friends functions from glibc as a basis, which are actually just wrappers, which work with registers and other meaningfull for process running information.

/devel/threading :: Link / Comments (0)


Fri, 12 Jan 2007

SMP scalability in M-on-N threading model.


Actually it is simple.
Since only kernel can schedule thread (actually not even thread or process, but its own kernel's representation, so called kernel's virtual process) to run on specified CPU, M-on-N threading model should have several real threads (for example several current POSIX threads), its number should be equal to number of real CPUs, and then library layer will schedule execution of context of different real threads, each of which in turn can run on separate CPU.

So, userspace will create new real threads when pthread_create() is called until number of them is less than number of real CPUs, each real thread in turn is a context in the global tree of contexts, where fake context will be added with all subsequent pthread_create() calls, and userspace scheduler (backed by real threads) will pick up several contexts from the tree and execute them on the real CPUs.

Simple and looks really good.

/devel/threading :: Link / Comments (0)


Thu, 11 Jan 2007

Userspace threading and theirs benefits and drawbacks.


Benefits.

1. Fast scheduling.
There is no need to cross userspace/kernelspace boundary to schedule new thread execution (just watch what happens with userspace network stack compared to kernel's one when there are a lot of syscalls performed for small packets receiving/sending).

2. Fast thread creation and destruction.
It just becomes an allocation of the structure in the userspace, no need for full creation process which is performed in clone() syscall.

3. Smaller number of cache misses.
Since there is only one process instead of several threads, cache locality is increased greatly with reduced number of misses.

Drawbacks.

1. Scheduling fairness.
Since kernel does not know about multiple threads behind given process, it can not add it appropriate number of timeslices for execution.
Can be solved either by more tight collaboarion of the userspace nad kernelspace schedulers or simply by increasing process' nice value.

2. All communications are performed through one kevent pipe.
Which can be problematic (although interface was specially designed to be scalable).

3. Complex code for good SMP scalability and userspace scheduler.
I wanted to put it into 'Benefits' section, since that is exactly why I started this project.

/devel/threading :: Link / Comments (0)


Threading issues and ways to resolve them.


1. Signals.
POSIX requires that signal must be delivered on per-thread basis, but signal handler, and thus the fact that signal is ignored or not, is per-process property. With kevent's possibility to deliver signals through its queue problem can be solved in the very elegant way - main process receives a signal event notification through its kevent queue and then check all its threads, which have that signal unblocked, all appropriate threads receives signal through the alternative signal stack.

2. Kevent/poll usage in the threads.
Poll() and select() must be translated into kevent request in syscall wrapper, for example how I implemented epoll on top of kevent, and then that event will be put into main kevent queue.

3. Sleep and the list system calls.
Kevent has timer notification which will be used to emulate such calls. Call for POSIX timers can be emulated through kevent POSIX timers support, but probably I will not consider this for initial implementation.

4. Blocking inter-process communications like semaphores.
It must be converted to userspace kevent notifications.

All above can look like it is old LinuxThreads days before NPTL, when there was a special management thread which performed a lot of that functionality (namely signal handling, resource cleaning, which is not a problem for this new implementaion, since all resources will be automatically cleaned when process exits, and no process-visible resources like file descriptors are closed on thread cancellation, and signals can be handled perfectly with kevent's capabilities), but now it has moved into layer between kernel (or glibc for initial implementation) and application (i.e. scheduler, I think it is correct name, since main task of that layer is exactly scheduling). But actually it completely does not differ from what we have right now with NPTL and 1-on-1 thread model - exactly the same tasks are performed by kernel, but with additional layer crossing overhead.

/devel/threading :: Link / Comments (0)


Initial thoughs about userspace threads (or M-on-N threading model).


Let's see, what we already have.
Glibc provides us makecontext() and friends functions, which are essentially a part of the userspace execution mechanism - one can create context, run it, swap it and so one. That is something I want to implement, except its problems - context switch can be performed from the outside thread (that is how IBM NGPT was implemented), it is not the main issue, although I really do not like such an approach, the main problem is the fact, that if such a context is going to block, that fact can not be detected from another contexts, and thus it is impossible to swap context with another one. Even if some check will be done in each syscall, or even if each syscall will be a rescheduling point, that means that either each syscall must be non-blocking, or the whole process will go to sleep in syscall, since kernel does not know that there are several context in the same process.

So, the solution is to have some kind of a thin layer between kernel and userspace (in a real world it is called glibc), which will convert all syscalls into non-blocking operations (including nanosleep() and the like), and keep a track of what each context performed. In practice glibc rewrite is not what I would like to do, but instead some layer on top of it will be implemented, which will convert syscalls into kevent operations, and become a rescheduling point. I will even consider to implement not exactly known syscalls, but instead (at least for the initial implementation) introduce new calls, which will be a wrapper to known ones - like new_write() will be a kevent and new threading model based wrapper, which will setup all appropriate requests (like POLLIN) and if possible, call write() itself. When all execution context are put into the sleep, the whole process will park itself in the waiting syscall like kevent_get_events().

Main issues with such approach are following:

  • scheduling algorithm
  • SMP scalability
  • syscall wrapper in the glibc or completely new calls (like described above)
At least first two issues are interesting technical challenges, the last one will be first implemented with new calls.

/devel/threading :: Link / Comments (0)