Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Wed, 17 Jan 2007

Initial implementation of the ntl (new threading library) M-on-N threading library.


Well, I can not find better prefix than ntl, which is extremely non-ordinary abbreviation for 'new threading library'.

Anyway, current version is very initial, it does not contain scheduler and does not contain kevent-driven wrappers on top os usual IO syscalls, but it already has all initialization mechanisms, cache of threads and all structures required for scheduling.

There are two major problems uncovered with this initial implementation.
First one is scheduling problem. Since NTL does not contain dedicated schduling thread, it is quite hard to perfrom scheduling of the functions which does not do syscalls, for example with those which just do while(1); loop and eat 100% CPU and never enters NTL layer. To solve this problem I need to add timer and appropriate signal handler, where reschduling will happen, which in theory can lead to performance degradation and to problem with alarm signal registered in thread function (although that should be fixed with kevent timer notifications).

Another problem is futex performance.
In current code there are two locks implemented as semaphores, which in modern Linux are transfered into futexes - schduler lock, which guards queue of threads, and stack cache lock, which guards list of free thread stacks.
So usual thread creation, empty function and thread exiting in NTL changes from this operations to:

mmap2(NULL, 8396800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7684000
sigprocmask(SIG_BLOCK, NULL, [])        = 0
futex(0xb7fdac20, FUTEX_WAKE, 1)        = 0
sigprocmask(SIG_SETMASK, [], [])        = 0
futex(0xb7fdac20, FUTEX_WAKE, 1)        = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1)        = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
futex(0xb7fdaaa0, FUTEX_WAKE, 1)        = 0
...
munmap(0xb7fd7000, 4096)                = 0
so we get aditional four futex calls - two locks are processed: one when stack is unlinked and returned to stack cache, and another when thread is added and removed from scheduler's queue.

Performance differs noticebly (test case includes creation of the thread, which exits immediately, which is repeated requested number of times):
$ ./ntl_test 100000
num: 100000, diff: 388234, speed: 3.882340.
Compared to 1.793600 microseconds without futex calls.

In this situation there is no concurency at all - it is synthetic test, so actually one _empty_ futex call gets about 0.5 microseconds, where pure syscall overhead is 50% (this is Intel Core Duo 3.40GHz (running 3.7 Ghz) test machine).
I can not say if futex performance is slow of fast - but I would like to avoid this, so in practice semaphores should not be used for thread serialization, instead lightweight locks must be introduced.
In current code all locks are abstracted and implemented in separate file, so lock changes are trivial, but I do not want to introduce per-arch usage right now.

/devel/threading :: Link / Comments ()