|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Thu, 22 Feb 2007
New syslet/threadlet release by Ingo Molnar.[ND]
Feature set include new entity called threadlet - it is a function,
which will be executed in a sync context, and if it blocks (for example
in some syslet) new thread will be created on behalf of that function,
thus this approach allows to have a mssively parrallel execution.
Unfortunately there are serious caveats in the design - threadlets are
not even designed to work with network, where essentially most of
the calls sleep.
Main problem here is the fact, that havin the whole thread to handle a simple execution
context is wrong - it does not scale - rescheduling is damn to slow
in kernelspace->userspace boundary crossing, that is why I develop
an alternative M:N threading model
in userspace.
I also doubt about its usefullness - people who care will create own
thread for that, since POSIX thread creation overhead generally much smaller
than time function executes.
Among other changes:
- get rid of locked pages for completion ring.
(Just as prognosed,
what else among implemented in kevent ideas will get its place in syslets?)
- multiple completion rings
- removed initialization and celanup functions, instead parameters are explicitly
transferred into execution syscall
- bug fixes
/devel/kevent/aio :: Link / Comments (0)
Sun, 11 Feb 2007
Linus on AIO. Limitations of the proposal. Practical example of weakness.
Linus Torvalds wrote
on reply to tcp_sendmsg() example:
> Will you create a thread every time tcp_sendmsg() hits the send queue
> limits?
No. You use epoll() for those.
I.e. we design Asynchronous IO, which is already limited to not be used with network?
I.e. AIO can not be used in anything connected to the network, since even
if disc read/write will be asynchronous, sending will block and thus we just lose all
possible advantages.
Continue:
There's a reason why a lot of UNIX system calls are blocking: they just
don't make sense as event models, because there is no sensible half-way
point that you can keep track of (filename lookup is the most common
example).
Linus - blocking IS waiting for an event, which will remove that block.
Linux even uses wait_event_*() calls - don't you think that name has some sence?
Filename lookup is just an inode reading from disk - when it is done, filename is ready,
that is an event.
And actually no one uses async filename lookup - people use open()
syscall, which is perfectly eventable - block-removal event is readines of the opened file
descriptor - it is even used in kevent AIO (surprise?) as a part of the async sendfile transfer
state machine (but I must admit, that opening always happens in async mode as part of the
state machine, so it will lose some ticks if things are perfectly in the cache,
but practice shows
that async sendfile is faster).
In another mail Linus continues to burn things out:
You use the AIO stuff for things that you *expect* to be almost
instantaneous. Even if you actually start ten thousand IO's in one go, and
they all do IO, you would hopefully expect that the first ones start
completingn before you've even submitted them all. If that's not true,
then you'd just be better off using epoll.
I.e. we should not use AIO for the case, when request really blocks,
only when it is synchronous and maybe sometimes block.
Linus, direct IO (used by databases) blocks all the time, sync IO blocks all the time,
network blocks, pipes block, readahead blocks - only the simplest case of reading from
VFS cache does not block.
And eventually Linus proposes waiting for AIO events:
for (;;) {
async(epoll); /* wait for networking events */
async_wait(); /* wait for epoll _or_ any of the outstanding file IO events */
handle_completed_events();
}
Linus - you have just introduced a waiting for AIO events - i.e. new type of events,
which are supposed to wrap async completions. And since every async syscall is that new
event, we can wait for them in userspace loop.
You do not know, but kevent is supposed to wait on every possibly type of events - you do not
need to wrap sync-events-waiting calls (like epoll()) into async helper
and then wait for that - just register it with kevents where you are currently forks in patchset.
And to draw the line: AIO by micro-threads is not even supposed to work
in environments where it will block all the time (like network or direct IO), instead
in blocking environments events should be used, since they are much more scalable.
Micro-thread AIO sucks even in reading from file - practice example: if file is happend on bad block,
reading will block for too long (seconds!), and system can be just killed with
rescheduling when there are a lot of threads waiting for read completion on that
blocks.
And to finally kill such design, here is another test I created.
Consider a directory with high number of inner dirs and files (hundreds),
theirs total size is 3 times smaller than amount of RAM (1gb vs. 300 Mb)
and several applications which run and randomly copy data from one file to another.
I've put several printks in __lock_page() (i.e. when requesting
application blocks and thus new thread would be created) and watch a nice picture
when upto hundred of blocks happend per second (and that is just for the case, when size
of the test dir is 3 times smaller than RAM, what will happen when size of the dir
will be more than amount of RAM I even do not want to imagine):
printk: 84 messages suppressed.
__lock_page: aio_new_thread: 6650.
printk: 118 messages suppressed.
__lock_page: aio_new_thread: 6769.
Conclusion: 'f toppku', i.e. into the furnace.
/devel/kevent/aio :: Link / Comments (0)
Sat, 10 Feb 2007
Test which shows how broken is thread-like AIO design.
Linus has proposed
yet another way to do async syscalls.
It is a bit similar to fibrils, but different in that
regard, that Linus' patch just creates a new real thread
if async call blocks. So, when syscall blocks, system returns to user as a different
thread.
There is a huge problem with that - per syscall
thread creation/destruction. Linus, why do you think people
do not create new thread each time new client has connected to
web server?
Artificial example with sys_stat64() does not count -
try to have thousands of such threads.
That approach sucks even more than fibrils, imho, althogh Zach's one has a problem,
that fibril is always created no matter if call does not block.
Rescheduling is a problem.
To prove that this is a huge problem I've setup a simple test - I changed
all sockets allocation from process context (actually only TCP sending
functions) to GFP_ATOMIC, so when they will fail,
and thus process will put into sleep, since previous allocation
policy was GFP_KERNEL, a new thread would be created.
So, I got following results:
tcp_sendmsg: sock: ffff810038e57900, wait: 562.
tcp_sendmsg: sock: ffff810038e57340, wait: 563.
tcp_sendmsg: sock: ffff810038e56d80, wait: 564.
tcp_sendmsg: sock: ffff810038e567c0, wait: 565.
tcp_sendmsg: sock: ffff810038e56200, wait: 566.
printk: 20458 messages suppressed.
tcp_sendmsg: sock: ffff81003363d300, wait: 21025.
and the like...
That was a simple couple of seconds test run of ab benchmark
against 2.6.20 kernel with lighttpd web server - about 4k connections per second,
80k connections total, trivial index page (got from debian installer) on athlon64 with 1gb of ram
connected over 1gbit link.
And during that simple test system would created 21k threads?
No way, it is just broken design. It is wrong.
So, read my lips - ev-e-ry-thing con-nec-ted to the net-work sle-eps.
Linus, if you read this (although I doubt), please, do not make terrible mistake.
Do not include kevent, if you do not want, but please think about above test before it is too late.
/devel/kevent/aio :: Link / Comments (0)
Sun, 04 Feb 2007
Quotation of the week:
One of the big problems today is that you can either sleep for your I/O in
io_getevents() or for your connect()/accept() in poll()/epoll(), but
there is nowhere you can sleep for all of them at once. That's why the
aio list continually proposes aio_poll() or returning aio events
via epoll().
(c) Oracle.
Original
(thread about
fibrils and AIO by scheduling stcks).
What can I say - NIH syndrome is washing brains (I must say I have it too), or people just too lazy
to bother to read something created by others.
/devel/kevent/aio :: Link / Comments (0)
Sat, 06 Jan 2007
Initial aio_sendfile() implementation has been committed into kevent tree.
It is yet very rough and definitely must be cleaned (and some known bugs fixed),
but major part is done.
aio_sendfile() contains of two major parts: AIO state machine and
page processing code.
The former is just a small subsystem, which allows to queue callback for theirs invocation
in process' context on behalf of pool of kernel threads. It allows to queue caches
of callbacks to the local thread or to any other specified. Each cache of callbacks
is processed until there are callbacks in it, callbacks can requeue themselfs into
the same cache.
Real work is being done in page processing code - code which populates pages into
VFS cache and then sends pages to the destination socket via ->sendpage().
Unlike previous
aio_sendfile() implementation, new one does not require low-level filesystem specific
callbacks at all, instead I extended struct address_space_operations to contain
new member called ->aio_readpages(), which is exactly the same as
->readpage() (read: mpage_readpages()) except
different BIO allocation and sumbission routines.
I changed mpage_readpages() to provide mpage_alloc()
and mpage_bio_submit() to the new function called __mpage_readpages(),
which is exactly old mpage_readpages() with provided callback invocation
instead of usage for old functions. mpage_readpages_aio() provides
kevent
specific callbacks, which calls old functions, but with different destructor callbacks,
which are essentially the same, except that if page becomes uptodate, it is not unlocked,
so that it could not be removed until it is sent, and only then it is unlocked.
Code does contain bug (at least one) I know about - subsequent try to send pages happens not
after BIO is ready and thus pages are populated into VFS cache (i.e. pages are marked as uptodate),
but repeatedly in the state machine (rescheduling must happen in BIO destructor, not in the code,
which allocates pages). Another issues is that it is currently impossible to receive kevent notification
when aio_sendfile() is really completed.
/devel/kevent/aio :: Link / Comments (0)
Thu, 04 Jan 2007
AIO (sub) state machine has been completed.
It is small subsystem, which lives in kernel/kevent/kevent_aio.c
file, which allows to queue and asynchronously invoke
callbacks, which are intended to populate pages into VFS cache, send data
to the destination socket, copy data to/from userspace and so on.
Real working callbacks itself are not implemented yet.
I will only implement three of them - open file by filename,
populate file's pages into VFS cache, send pages to destination socket.
Probably will also add writing page to userspace.
This set will allow to implement aio_sendfile()
as sequence of that callbacks - open file by file path, then populate
its pages into VFS cache in some chunks or one-by-one and eventually
send them to the destination socket.
There is a problem of the order of sending one page and populating
its neighbour though, since having the whole VFS cache filled with
locked pages from one file is not a good idea, but locking is required to
allow sending itself - so page would not be swapped out. But I will
either stop further populating until pages are sent, or will
not firgure this out at all - depending on results from initial
implementation.
Each subtask above - i.e. each callback, is an elementary chunk, which
will be handled by kevent. Completeness of the whole task
will be handled by kevent too.
/devel/kevent/aio :: Link / Comments (0)
Wed, 03 Jan 2007
Initial thoughs on the 'true AIO'.
Here was
first announce of the idea, and now I will open it a bit more.
This was written after some studing of Intel Dan Williams' work
on async copy found here,
the whole thread
can be also interested for those who want to know what is AIO developemnt status
and some ideas about its improovement.
A generic solution must be used to select appropriate device to perform
actual data processing.
We had a very brief discussion about asynchronous crypto layer
(acrypto)
and how its ideas could be used for async dma engines - user should not
even know how his data has been transferred - it calls async_copy(),
which selects appropriate device (and sync copy is just an additional
usual device in that case) from the list of devices, exported its
functionality, selection can be done in millions of different ways from
getting the fisrt one from the list (this is essentially how your
approach is implemented right now), or using special (including run-time
updated) heueristics (like it is done in acrypto).
Thinking further, async_copy() is just a usual case for async class of
operations. So the same above logic must be applied on this layer too.
But
layers are the way to design protocols, not implement them.
David Miller on netchannels.
So, user should not even know about layers - it should just say 'copy
data from pointer A to pointer B', or 'copy data from pointer A to
socket B' or even 'copy it from file "/tmp/file" to "192.168.0.1:80:tcp"',
without ever knowing that there are sockets and/or memcpy() calls,
and if user requests to perform it asynchronously, it must be later
notified (one might expect, that I will prefer to use kevent :)
The same approach thus can be used by NFS/SAMBA/CIFS and other users.
That is how I start to implement AIO (it looks like it becomes popular):
- system exports set of operations it supports (send, receive, copy,
crypto, ....)
- each operation has subsequent set of suboptions (different crypto
types, for example)
- each operation has set of low-level drivers, which support it (with
optional performance or any other parameters)
- each driver when loaded publishes its capabilities (async copy with
speed A, XOR and so on)
From user's point of view its aio_sendfile() or async_copy() will look
following:
- call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578')
- call aio_schedule_file_socket(source='/tmp/file', dest='socket')
- call aio_schedule_file_addr(source='/tmp/file',dest='192.168.0.1:80:tcp')
or any other similar call and then wait for received descriptor in kevent_get_events() or
provide own cookie in each call.
Each request is then converted into FIFO of smaller request like 'open file',
'open socket', 'get in user pages' and so on, each of which should be
handled on appropriate device (hardware or software), completeness of
each request starts procesing of the next one.
Reading microthreading design notes
created by Zach Brown (Oracle), I recall
comparison of the NPTL and
Erlang threading models - they are _completely_ different
models, NPTL creates real threads, which is supposed (I hope NOT)
to be implemented in microthreading design too. It is slow.
(Or is it not, Zach, we are intrigued :)
It's damn bloody slow to create a thread compared to the correct non-blocking
state machine. TUX state machine is similar to what I had in my first kevent
based FS and network AIO patchset, and what I will use for current async
processing work.
A bit of empty words actually, but it can provide some food for
thoughts.
/devel/kevent/aio :: Link / Comments (0)
|