|
|
M-on-N threading library.
|
Development status can be tracked here.
Userspace M-on-N threading model is based on the idea,
that when signal is delivered, kernel saves all information
related to previous context in stack, so it is possible to find
it and replace.
Userspace threading and theirs benefits and drawbacks.
Benefits.
1. Fast scheduling.
There is no need to cross userspace/kernelspace boundary to schedule new
thread execution (just watch what happens with userspace network stack
compared to kernel's one when there are a lot of syscalls performed for
small packets receiving/sending).
2. Fast thread creation and destruction.
It just becomes an allocation of the structure in the userspace, no need
for full creation process which is performed in clone() syscall.
3. Smaller number of cache misses.
Since there is only one process instead of several threads, cache locality
is increased greatly with reduced number of misses.
Drawbacks.
1. Scheduling fairness.
Since kernel does not know about multiple threads behind given process,
it can not add it appropriate number of timeslices for execution.
Can be solved either by more tight collaboarion of the userspace and
kernelspace schedulers or simply by increasing process' nice value.
2. All communications are performed through one kevent pipe.
All syscalls are going to be converted into non-blocking operations
(including nanosleep() and the like), and keep a track of
what each context performed. In practice glibc rewrite is not what I would
like to do, but instead some layer on top of it will be implemented, which will
convert syscalls into kevent operations, and become a rescheduling point.
3. Complex code for good SMP scalability and userspace scheduler.
Not a problem.
SMP scalability in M-on-N threading model.
Since only kernel can schedule thread (actually not even thread or process,
but its own kernel's representation, so called kernel's virtual process) to
run on specified CPU, M-on-N threading model should have several real threads
(for example several current POSIX threads), its number should be equal to number
of real CPUs, and then library layer will schedule execution of context of different
real threads, each of which in turn can run on separate CPU.
So, userspace will create new real threads when pthread_create() is called until
number of them is less than number of real CPUs, each real thread in turn is a context in the
global set of contexts, where fake context will be added with all subsequent pthread_create()
calls, and userspace scheduler (backed by real threads) will pick up several contexts from the
tree and execute them on the real CPUs.
I would be possible to use existing Linux clone() syscall, but due to
complete absence of hte documentation (which is sometimes plain wrong) and ery strong
encryption of glibc sources it is quite complex task.
As NPTL, M-on-N threading library uses stack rlimit for thread stack allocation.
Benchmarks.
I only ran simple benchmark of empty thread creation (its function just exits).
After I started to use atomic locks ("lock" prefix on x86) instead of semaphores,
thread start/empty exec/stop was reduced down to 0.3 microseconds compared to 14 microsecods for POSIX NPTL case.
But there are problems.
First one is that I perform initial context setup through signal invokation,
which is at least two syscalls. They are slow.
Another one is that thread is really started only after rescheduling, which
is another signal, so another two syscalls.
Third on is that there must exist different locking primitives - for signal context
and for process context, which must block signals, which in turn adds
additional overhead of sigprocmask() syscall.
After I fixed all above issues (actually not fixed, but confirmed that they must
exist), performance reduced to 9 microseconds compared to 14 microsecods for POSIX NPTL
case for empty thread creation/destruction.
This can be fixed, if I would have created arch-specific getcontext()-like
calls, which would be mutually transformable into signal context information
(existing getcontext() and friends produces different data than
signal context has at least on x86). But I can not right now, since I do not know
enough x86 ABI (I learned a lot for past several days, as you can notice from
this blog, but it is still even remotely not enough).
Currently M-on-N threading model uses ugly arch-specific hacks to start new threads,
which actually are something remotely similar to makecontext().
So, the solution, which will rock M-on-N threading implementation is to
convert or create getcontext() and friends calls which can be used with
signal context information.
|
|
|