Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Sat, 19 Apr 2008

hbukittbd: Andrew Morton proposes new userspace/kernelspace interface.

Rusty Russel is an author of the vringfd() (name says it all) new interface for the event ring buffer management.
Quotation from Andrew Morton:

This is may be our third high-bandwidth user/kernel interface to transport bulk data ("hbukittbd") which was implemented because its predecessors weren't quite right. In a year or two's time someone else will need a hbukittbd and will find that the existing three aren't quite right and will give us another one. One day we need to stop doing this ;)
...
So I think it would be good to plonk the proposed interface on the table and have a poke at it. Is it compat-safe? Is it extensible in a backward-compatible fashion? Are there future-safe changes we should make to it? Can Michael Kerrisk understand, review and document it? etc.

You know what I'm saying ;) What is the proposed interface?
Just for the reference, I've filled it under kevent tag :)

/devel/kevent :: Link / Comments ()


Thu, 07 Feb 2008

Memory notification events.

Jake Edge posted an article at LWN.net about various memory pressure notification, which userspace application may be insterested in.
For example they can wait for swap in/out notifcaitinos or oom condition far before it is killed by oom-killer, so it could free some unused ram (like firefox could free some recently viewed pages cache).

Notifications are transferred to userspace via /dev/mem_notify file, which is readable and pollable. Alternative way is to use SIGIO signal to the process when the device becomes readable.
Patch likely will be accepted soon.

This is another example of the real need for unified event management subsystem in the Linux kernl.

/devel/kevent :: Link / Comments ()


Tue, 14 Aug 2007

Kevent has been removed from kernel summit agenda.

As long as other event delivery mechanisms, seems things settled down and everyone is happy.

/devel/kevent :: Link / Comments ()


Tue, 15 May 2007

Kevent/eventfd/pollfs discussion thread.


Here one can find a discussion I referred a week ago.
Actually I do not understand Davide Libenzi's claims that kevent is:

Monolitic and interface-centric solutions, or better, solutions in search of a problem
getting into account that each kevent kernel user is about 300 lines of code including comments and its registration is completely plugable even in run-time and its memory overhead is several times less. I'm sitting and wondering...

And to start the day, two citations from that thread:
Yes, of course. If we're heading to yet-another monolitic interface, we're heading with no valid reasons given if other than some handwaving. While there are quite a few (modularity, compatibilty, plus the other ones that came in my mind and that I explained in the way-too-many emails) to back a file-based approach.
...
That's my point. I think we ultimately have to have something like kevent and then all this *fd() work is unnecessary and just adds code to the kernel which has to be kept around and which might hinder further work in this area.
But actually who cares? I've just looked into mainline git tree and found, that the whole eventfd patchset is already committed into the tree.
Personally I completely do not care.

Actually my massive kevent spambombing resulted in some changes in people's minds - I see a lot of kernel projects started to add 'takeN' to theirs short descriptions to show iteration number, which I did not see before.

/devel/kevent :: Link / Comments ()


Thu, 10 May 2007

Kevent strikes back?


Xavier Nicollet has sent me LWN article about yet another round of discussions about event delivery mechanisms in kernel.
After Davide Libenzi started his signalfd/timerfd/*fd patchset after eventfs and kevent, there were quite a bit of discussion about it, and eventually something new called pollfs (I did not follow that threads, since I'm not subscribed to linux-kernel@) appeared and started to fight for the place under the sun. Andrew Morton highlighted that likely eventfs by Davide will be included into 2.6.22, but strange things started to appear here...
Ulrich Drepper, glibc land monarch, said that he opposes against eventfd and similar patchsets because of lack of functionality for high-performance servers (mainly because of absence of special-purpose ring buffer implemented in kevent, there are also words about kevent possibility to carry more information and other small bits) and wants something similar to kevent to go in. Davide seems to be opposed to add such functionality (and actually it is not that simple task with old poll design).
As before - there is no active discussion and support developers for Ulrich's positions, but there are no developers against it too, so likely nothing will be included into mainline in 2.6.22 round.

Frankly, I removed kevent git tree from my machine, but I have backups...
So, I do not know should I dust off kevent patchset for yet another round of (empty as I predict) discussion or just forget it completely.
Let's see if new patchsets will appear in a near future, I will follow LWN kernel line, if nothing will be changed/added I will have a talk with Andrew about needs for this step at all.
We will see...

/devel/kevent :: Link / Comments ()


Fri, 16 Mar 2007

Asyncfd by Davide Libenzi got ring buffer.


Implementation allows to post events into userspace ring buffer.
There are several disadvantages though:

  • it can not be called from interrupt context.
  • there is possibility to lost events when ring is full, since events are posted unconditionally.
  • it has redundant fields in API.
Briefly saying - it goes the way kevent moved, but with additional problems.
For example I do not see how to solve second problem from above list, since kevent has (had) own queue of events, which were copied from the queue when userspace updated indexes, so if userspace ring is full, they would not be lost, but existing mechanism does not have such a queue at all, events are just blindly copied and thus can be lost when ring is full.
First problem must be solved too to support tricky usage (as Davide mentioned in patch description, USB calls might enp up calling it from interrupt context).
One of the problems Davide solves either by using locked userspace memory (which is a big no-no), or by using kernel buffers and read(), which is unconvenient and just breaks the whole idea of ring buffer - having multiple events in userspace without additional (only waiting) syscalls and having syscalls as thread cancellation points without additional structures shared between threads.

Timerfd implementation will likely rise the same questions from Ulrich Drepper as my kevent timers - they are not needed, instead POSIX timers support should be extended (although I always provided two patches - for POSIX timers and own API, maybe Davide will use both too). But POSIX timer events become ready in softirq context, so it is impossible to use above ring buffer implementation for them.
Timerfd code also uses long type in structures shared between userspace and kernelspace, which does not work on x86_64 when userspace is 32bit.

Yes, I know, I'm a bastard moron, since I write only bad things in my blog, but hey, I was not in Cc: list :)
Actually I'm glad people started to work in that area after kevent, that means that my ideas were correct and eventually Linux will adopt a good solution for that problems.
I think Davide will resolve all issues quickly.

/devel/kevent :: Link / Comments ()


Sun, 04 Mar 2007

Kevent.


In case you were confused about this post that I never used any project to advertise kevent and this post about how I used threadlet thread to advertise kevent.
The latter is a very ironic post about the only possibility to get attention is only by entering some other thread (remotely related) and start agressively stating own point of view. If you think otherwise - that is your problems and basicvally you suck.
I indeed did that, but that was not done to show that threadlets suck and kevent is cool.
Absolutely.
I like Ingo's approach of threadlets and agree that it is a very simple and quite non-problematic case to create highly parallized environment, but I just think (based on my expirience with threads in Linux) that threads per IO (or per block) is not a good solution for AIO model. I do not say, that it is worse than kevent (or any other event-driven model), I just want to say, that as is it is not the best solution from performance point of view, and both event driven (kevent or epoll, the former will be declined almost sure, I proposed another idea - having new filesystem for each type of events, i.e. signalfs, timerfs and so one, that approach does have problems and scales worse than kevent (it is bound to file structure, which is quite heavy), but it allows to have epoll as a controlling interface) model.
Although it does requre to write new code both in kernelspace and userspace, and to change existing applications just like kevent, and likely scales worse than kevent, which is there already, it probably will be blessed by developers.
But I do not care actually :)
There are a lot of other interesting things (without politics).
Stay tuned.

/devel/kevent :: Link / Comments ()


Fri, 02 Mar 2007

How things can endup with some trivial initial misunderstanding. [ND]


(you are never ever wrong, and if you are proven wrong on topic A you claim it is an irrelevant topic (without even admitting you were wrong about it) and you point to topic B claiming it's the /real/ topic you talked about all along. And along the way you are slandering other projects like epoll and threadlets, distorting the discussion. This kind of keep-the-ball-moving discussion style is effective in politics but IMO it's a waste of time when developing a kernel.)
Sigh, we started very interesting topic, but ended up with some personal insults and drawing things people never talked about and even remotely based assumptions on - and all just because of some trivial misunderstanding of the simplest context.

That's pity, but well, if things has moved to that point, fuck them, there will be tons of others.
It seems I broke possible good relations with some very interesting kernel hackers, but I never wanted to do any personal or project insults for my own advertisement. If people can not (or do not) understand that, then, well, let's just move further.

/devel/kevent :: Link / Comments ()


Quotation of the week. [ND]


Ingo Molnar wrote in linux-kernel@ in threadlet/syslet thread:

sure, if we debate its virtualization driven market penetration via self promoting technologies that also drive customer satisfaction, then we'll be able to increase shareholder value by improving the user experience and we'll also succeed in turning this vision into a supply/demand marketplace. Or not?
I'm going to write that mantra in some piece of the memory.

/devel/kevent :: Link / Comments ()


Tue, 27 Feb 2007

All fights about kevent vs. threadlet are over. [ND]


I hope so, since wasting a time in completely empty discussions is not a way to have at least some progress.
Eventually we all (Linus, Ingo an me) concluded that better live in piece and have mixed events and thread design - event ring can be used for IO completion events, IO itself can be backed (if blocks) by threadlet.

So, let's calculate what happend:

  • we wasted 48 hours in stupid words thrown into each other
  • I ran some syslets tests and found that:
    • in Ingo web server there are no reschedulings at all in my environments, so threadlets do no work there as cachemiss threads
    • real disk IO case (using Jens Axboe's FIO tool) shows 30% speed degradation with CFQ scheduler with syslets compared to libaio and sync reads, with deadline scheduler degradation is about 8-10%
  • threadlets are simpler to program for simple test cases (when the whole logical processing can be put into single function without any iteraction with other parts), otherwise it can end up as a disaster to watch synchronization problems
  • kevent likely will not be included
  • kevent likely has some bugs or I screwed my aio tree (it includes syslets/threadlets and kevent)
  • Andrew Morton has a talk at FOSDEM 2007, where he also mentioned kevent as a good thing, but it fails to get attention, since it is a big step in a way people use kernel. Either kevent or something similar could be merged. (Thanks to Xavier Nicollet for pointing that).
    My pessimistic prognosis is about kevent declining.
Many thanks to Ingo Molnar, Davide Libenzi, Linus Torvalds and all others for (sometimes) interesting discussion.

/devel/kevent :: Link / Comments ()


Mon, 26 Feb 2007

Kevent vs. epoll vs. threadlet on VIA EPIA.


Small machine (256 mb of ram, 1Ghz):

kevent:		849.72
epoll:		538.16
threadlet:
# gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
In file included from ./evserver_epoll_threadlet.c:30:
./threadlet.h: In function threadlet_exec:
./threadlet.h:46: error: can't find a register in class GENERAL_REGS while reloading asm
That particular threadlet asm optimization does not work.

/devel/kevent :: Link / Comments ()


How to get attention to your project (secret plans upto now). Kevent and Ingo's threadlets. Scholastic masturbation about events and execution context. [ND]


As you might noticed, kevent was not a favour up to date, it is quite a good subsystem, but no attention. So, steps to get an it.

AIO is a hot topic this days - mainly because of Zach Brown's efforts to create generic AIO subsystem suitable for database usage on top of micro-thread design (blessed by Linus Torvalds some time ago).
As you have noticed from this blog, I'm against such a step, so I started to participate in related threads.
Eventually Ingo Molnar implemented own micro-thread AIO design - it has the same generic disadvantages as all other micro-thread implementations from Zach and Linus, but since Ingo is a scheduler-god, he has some tricks (shit, I would look into a dictionary to check how part of the wear, which starts near the shoulder and covers the arm is called, but since it is time to work without dictionary (and admin at my paid work fucked my (hacked a bit due to bugs in firewall setup and social engineering) connection down to frequently miserable bytes/per second)), well, Ingo has some tricks for x86 which allows to create extremely high-performance kernel threads on x86.
So, kevent is in danger - one of its users - mainly kevent AIO, created as a pure state machine over some VFS bits and network items, kevent AIO can be replaced with micro-thread (ugly) design.
I started some participation (mainly in blog) in related threads, so eventually, when first syslet patch was created, I was in Cc list and started to analyze patchset (both in blog without political correctness, and in mail lists with one).

Now we have following things:

  • Ingo-scheduler-god is for his threadlet/syslet model, but agree, that kevents can be superior in some aspects.
  • Linus-linux-god is against me at all, mainly because we are doing purely theoretical even scholastic masturbation about events vs. execution context (read his mail to David Miller, where he argues that it is idiotic to think about network as AIO model, I will return to that item soon, it is too fun to talk with Linus using his words).
As you can see, Linus and Ingo (and a lot of other hackers) talk about kevent and events vs. context in general. It is good and interesting even not because kevent is discussed.
And it is challenging!

Heh, things has ended up with complete fight against kevent.
Ingo Molnar wrote:
conclusion: currently i dont see a compelling need for the kevents subsystem. epoll is a pretty nice API and it covers most of the event sources and nicely builds upon our existing poll() infrastructure.
I need to say, that I have not ran any tests yet, since my Intel Core2 test machine has x86_64 distro, and I can not download new distro du to bandwidth limitation, and my via epia machine still compiles a tree... So, right now I'm cooking up a tree which includes both kevent and syslet/threadlet patches to test kevent/epoll/threadlet. I will do it on couple of my test machines: via epia (1ghz, 256mb of ram, 100mbit ethernet) and Intel Core2 CPU (2.40GHz, 2gb of ram, gigabit ethernet). Client will be 'ab' on my desktop Intel Core Duo 3.7 Ghz test machine with gigabit ethernet.

So, stay tuned - if I will complete all setup before I go climbing (and wine will not end), I will post all results here (even if kevent will shitassly suck).

Meanwhile on x86_64 several runs of the same
ab -c8000 -n80000 http://192.168.0.48/

kevent			8447.78
			5248.45
			4792.66
			4689.78
			4978.08
			5207.60

epoll			4572.20
			5024.89
			4580.57
			4800.19
			4708.29
Even in median kevent is faster than epoll. And some times it is about two times faster.

/devel/kevent :: Link / Comments ()


Sat, 24 Feb 2007

New lighttpd patch.


I've released new lighttpd patch for kevent, it fixes potential stall of the server - due to full absence of the documentation for programmer (and very good docs for administrator) I did some mistakes when programmed addition and deletion callbacks (and it is possible that there are others, since I still do not understand some fields).
With this version (for kevent-37 release, although it can be used for 26+ releases too) benchmark of kevent vs epoll on lighttpd server do not show any difference (about 4k requests per second), optimized evserver_kevent.c trivial web-server produces upto 7900 req/second with 'ab' benchmark with 14k index page.

/devel/kevent :: Link / Comments ()


Thu, 22 Feb 2007

New kevent 'take37' release. [ND].


Short changelog:

  • ported to 2.6.21-rc1
  • documentation cleanups by Frederik Deweerdt
And here is my last words on kevent.

If you are somehow related to kevent development, please do not perform any steps about this post. Thank you.

A bit of history of the project.
I developed it first to create asynchronous network IO as a challenge request from Stephen Hemminger. Back in previous winter I created both kevent as a generic event delivery mechanism and network AIO too. It worked, it showed noticeble perfomance and usability win. I tried to push it upstream but never got a response.
Later previous year David Miller made a good kick for kevent and I started to push it upstrem again.
There were number of changes aimed to improve performance, extend feature list and eventually we have subsystem which indeed can be called as generic event handling mechanism.
Set of features it can work with includes file descriptor events - the same as in usual poll()/select() (there is a patch which implements epoll() over kevent), special high-performance socket and pipe notifications, possibility to get timer expiration events, POSIX timers and signal notifications, possibility to have any private userspace notifications and eventually even network AIO (sendfile(), open+sendfile()+close()).
I ported libevent and lighttpd to kevent.
But I'm continuously getting trouble obtaining feedback from kernel developers despite a huge support from several core kernel hackers and I want to thank them and a lot of other people who helped to develop that subsystem.

I do not care much about kevent inclusion - I hacked it not for that but for process, but such hinged situation takes a lot of time to throw code and words again and again into blackhole. For the my last kick for Andrew Morton and Ulrich Drepper to include kevent I got a response about the fact that it needs some more review and some more comments.

So, -rc1 is out and I've sent the latest kevent release. If there will be no feedback I will not continue to push it upstream. I will continue to support it on per kernel release basis like I do for many years for acrypto (asynchronous crypto layer). It is not too big work to maintain that, but pushing it into the rock wall is a bit uninteresting and actually stupid time spending.
Please do not kick someone to get a review or something like that - if people do not want it right now, there is no need to force them - eventually it can end up with some better system (let's see how acrypto work resulted (or maybe not) in a upcoming async crypto changes in Linux crypto stack) from some other kernel hacker.

As I expect to get some free time after -rc1 is out, I will continue to work on my interesting projects, the nearest ones are M:N theading model and netchannels trie patch to replace socket hash tables.
It looks like it is time to start new generation filesystem implementation too.

/devel/kevent :: Link / Comments ()


Thinking more on syslets/threadlets and kevent. [ND]


As I showed, thread-like AIO design is utterly broken when things come to networking, so if this will be accepted, network programmer, who cares about performance, will need to wait on two types of events - AIO events, which populates data from disks, and polling for network socket events. Eventually someone will reinvent/extend kevent to support AIO notifications.
Likely if I will add futex waiting into kevent, things will works just out of the box.
There is a post in my draft box, which will be the latest one in kevent serie - I'm only waiting for -rc1 release so that I could resend kevent the last time.

/devel/kevent :: Link / Comments ()


New syslet/threadlet release by Ingo Molnar.[ND]


Feature set include new entity called threadlet - it is a function, which will be executed in a sync context, and if it blocks (for example in some syslet) new thread will be created on behalf of that function, thus this approach allows to have a mssively parrallel execution.
Unfortunately there are serious caveats in the design - threadlets are not even designed to work with network, where essentially most of the calls sleep.
Main problem here is the fact, that havin the whole thread to handle a simple execution context is wrong - it does not scale - rescheduling is damn to slow in kernelspace->userspace boundary crossing, that is why I develop an alternative M:N threading model in userspace.
I also doubt about its usefullness - people who care will create own thread for that, since POSIX thread creation overhead generally much smaller than time function executes.

Among other changes:

  • get rid of locked pages for completion ring. (Just as prognosed, what else among implemented in kevent ideas will get its place in syslets?)
  • multiple completion rings
  • removed initialization and celanup functions, instead parameters are explicitly transferred into execution syscall
  • bug fixes

/devel/kevent/aio :: Link / Comments ()


Sat, 17 Feb 2007

Kevent has a git tree now.


Interested party can clone kevent git tree via http (164 Mb currently):

$ git clone http://tservice.net.ru/~s0mbre/archive/kevent/kevent.git/ ./
Here one can find steps needed to create public git repo (which is not that trivial step and it is not described in popular git tutorials).

/devel/kevent :: Link / Comments ()


Linus and GNOME, kernel and kevent.


Is this a double standards, or just lack of attention?
Linus proposed some patches to GNOME with this descriptiosn:

I've sent out patches. The code is actually _cleaner_ after my patches, and the end result is more capable. We'll see what happens.

... Now the question is, will people take the patches, or will they keep their heads up their arses and claim that configurability is bad, even when it makes things more logical, and code more readable.
Doesn't it sound too similar to kevent mails? Damn yes.

/devel/kevent :: Link / Comments ()


Fri, 16 Feb 2007

Syslet cancellation.


Davide Libenzi wrote:

What about the busy_async_threads list becoming a hash/rb_tree indexed by syslet_atom ptr. A cancel would lookup the thread and send a signal (of course, signal handling of the async threads should be set properly)?
I tired to repeat, that if AIO enters async path, then it becomes a real event which can be waited for. And it is done through kevent, which supports kevent cancellation already. It also supports waiting for events, reading them through kevent queue and ring buffer. Ring buffer is implemented in a userspace - guess, what will be added in v3 release of the syslets?

/devel/kevent :: Link / Comments ()


Syslets and events.


Ingo Molnar wrote:

We have kaio that is centered around block drivers - then we have epoll that works best with networking, and inotify that deals with some (but not all) VFS events - but neither supports every IO and event disciple well, at once.
I'm disappointed. I would understand if it was written by Linus Torvalds, but not Ingo...

/devel/kevent :: Link / Comments ()


Tue, 13 Feb 2007

Kevent is not closed.


I've got a feedback from Ulrich Drepper (I never replied since kevent was always failed to compile, but not because it that good) and Andrew, who asked to resend kevent again after rc1 is out.
Ok, let's wait for a week.

Meanwhile I committed documentation cleanups from Frederik Deweerdt.

Thanks everyone for support.

/devel/kevent :: Link / Comments ()


Mon, 12 Feb 2007

New kevent 'take36' release. More interesting descriptions.


Short changelog:

  • Fixed typo in Makefile about kevent based replacement for epoll (not included into patchset) which led to compilation failure.
  • Changed AIO description text.
Here is a changes:
 -[take35 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
 +[take36 10/10] kevent: Kevent based generic AIO.

 -Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
 +Kevent based generic AIO.
 +
 +This patch only implements network AIO, which is _COMPLETELY_
 +impossible and broken in _ANY_ micro-thread design. For details
 +and test consider following link:
 +http://tservice.net.ru/~s0mbre/blog/2007/02/10#2007_02_10
 +
 +Designing AIO without network in mind can only be result of heavy hang-over.
 +
 +Kevent AIO is implemented as state machine.
 +There is a patch which implements async open/send_header/sendfile/close.
Changes in main description:
 -[take35 0/10] kevent: Generic event handling mechanism.
 +[take36 0/10] kevent: Generic event handling mechanism [new description text for generic AIO].

 -Generic event handling mechanism.
 +Generic event handling mechanism [new description text for generic AIO].
 +
 + [ Consider reading at least introduction texts for patches ]

  Kevent is a generic subsytem which allows to handle event notifications.
  It supports both level and edge triggered events. It is similar to
  poll/epoll in some cases, but it is more scalable, it is faster and
 +
 + It can serve as a storage for different AIO models as well - in case
 + syscall or other request is ready immediately kevent returns that event
 + in submission point.

 + Number of comments dropped to zero several releases ago -
 + it is a sign that API, design and implementation are perfect.
 +
 Consider for inclusion.
I also asked Andrew Morton (again) about inclusion/declining plans.
If that will be put into void again, I think it is time to close project.

/devel/kevent :: Link / Comments ()


Sun, 11 Feb 2007

Linus on AIO. Limitations of the proposal. Practical example of weakness.


Linus Torvalds wrote on reply to tcp_sendmsg() example:

> Will you create a thread every time tcp_sendmsg() hits the send queue > limits?

No. You use epoll() for those.
I.e. we design Asynchronous IO, which is already limited to not be used with network?
I.e. AIO can not be used in anything connected to the network, since even if disc read/write will be asynchronous, sending will block and thus we just lose all possible advantages.

Continue:
There's a reason why a lot of UNIX system calls are blocking: they just don't make sense as event models, because there is no sensible half-way point that you can keep track of (filename lookup is the most common example).
Linus - blocking IS waiting for an event, which will remove that block.
Linux even uses wait_event_*() calls - don't you think that name has some sence?
Filename lookup is just an inode reading from disk - when it is done, filename is ready, that is an event.
And actually no one uses async filename lookup - people use open() syscall, which is perfectly eventable - block-removal event is readines of the opened file descriptor - it is even used in kevent AIO (surprise?) as a part of the async sendfile transfer state machine (but I must admit, that opening always happens in async mode as part of the state machine, so it will lose some ticks if things are perfectly in the cache, but practice shows that async sendfile is faster).

In another mail Linus continues to burn things out:
You use the AIO stuff for things that you *expect* to be almost instantaneous. Even if you actually start ten thousand IO's in one go, and they all do IO, you would hopefully expect that the first ones start completingn before you've even submitted them all. If that's not true, then you'd just be better off using epoll.
I.e. we should not use AIO for the case, when request really blocks, only when it is synchronous and maybe sometimes block.
Linus, direct IO (used by databases) blocks all the time, sync IO blocks all the time, network blocks, pipes block, readahead blocks - only the simplest case of reading from VFS cache does not block.

And eventually Linus proposes waiting for AIO events:
for (;;) {
	async(epoll);	/* wait for networking events */
	async_wait();	/* wait for epoll _or_ any of the outstanding file IO events */
	handle_completed_events();
}
Linus - you have just introduced a waiting for AIO events - i.e. new type of events, which are supposed to wrap async completions. And since every async syscall is that new event, we can wait for them in userspace loop.
You do not know, but kevent is supposed to wait on every possibly type of events - you do not need to wrap sync-events-waiting calls (like epoll()) into async helper and then wait for that - just register it with kevents where you are currently forks in patchset.

And to draw the line: AIO by micro-threads is not even supposed to work in environments where it will block all the time (like network or direct IO), instead in blocking environments events should be used, since they are much more scalable.

Micro-thread AIO sucks even in reading from file - practice example: if file is happend on bad block, reading will block for too long (seconds!), and system can be just killed with rescheduling when there are a lot of threads waiting for read completion on that blocks.

And to finally kill such design, here is another test I created.
Consider a directory with high number of inner dirs and files (hundreds), theirs total size is 3 times smaller than amount of RAM (1gb vs. 300 Mb) and several applications which run and randomly copy data from one file to another.
I've put several printks in __lock_page() (i.e. when requesting application blocks and thus new thread would be created) and watch a nice picture when upto hundred of blocks happend per second (and that is just for the case, when size of the test dir is 3 times smaller than RAM, what will happen when size of the dir will be more than amount of RAM I even do not want to imagine):
printk: 84 messages suppressed.
__lock_page: aio_new_thread: 6650.
printk: 118 messages suppressed.
__lock_page: aio_new_thread: 6769.

Conclusion: 'f toppku', i.e. into the furnace.

/devel/kevent/aio :: Link / Comments ()


Sat, 10 Feb 2007

Test which shows how broken is thread-like AIO design.


Linus has proposed yet another way to do async syscalls.
It is a bit similar to fibrils, but different in that regard, that Linus' patch just creates a new real thread if async call blocks. So, when syscall blocks, system returns to user as a different thread.

There is a huge problem with that - per syscall thread creation/destruction. Linus, why do you think people do not create new thread each time new client has connected to web server?

Artificial example with sys_stat64() does not count - try to have thousands of such threads.

That approach sucks even more than fibrils, imho, althogh Zach's one has a problem, that fibril is always created no matter if call does not block.

Rescheduling is a problem.

To prove that this is a huge problem I've setup a simple test - I changed all sockets allocation from process context (actually only TCP sending functions) to GFP_ATOMIC, so when they will fail, and thus process will put into sleep, since previous allocation policy was GFP_KERNEL, a new thread would be created.
So, I got following results:

tcp_sendmsg: sock: ffff810038e57900, wait: 562.
tcp_sendmsg: sock: ffff810038e57340, wait: 563.
tcp_sendmsg: sock: ffff810038e56d80, wait: 564.
tcp_sendmsg: sock: ffff810038e567c0, wait: 565.
tcp_sendmsg: sock: ffff810038e56200, wait: 566.
printk: 20458 messages suppressed.
tcp_sendmsg: sock: ffff81003363d300, wait: 21025.
and the like...

That was a simple couple of seconds test run of ab benchmark against 2.6.20 kernel with lighttpd web server - about 4k connections per second, 80k connections total, trivial index page (got from debian installer) on athlon64 with 1gb of ram connected over 1gbit link.

And during that simple test system would created 21k threads?
No way, it is just broken design. It is wrong.

So, read my lips - ev-e-ry-thing con-nec-ted to the net-work sle-eps.

Linus, if you read this (although I doubt), please, do not make terrible mistake.
Do not include kevent, if you do not want, but please think about above test before it is too late.

/devel/kevent/aio :: Link / Comments ()


Wed, 07 Feb 2007

New kevent slogan - 'kevent can do everything you ever thought of. Completely.'


What this thread shows to me is a fact, that I created quite good kevent subsystem, but unfortunately quite a few people know about it. Let's see:

  • most AIO will not block, so we do not need special setup, those who will block, needs to wait by Linus Torvalds - kevent does allow non-blocking events - and it does not differentiate between them, if it is ready immediately, it will be returned as ready on submission time, if it is not ready, one will wait for it in the queue
  • 90% of database AIO will block by Joel Becker from Oracle - kevent works perfectly ok with such loads. It was designed for them.
  • file binding is too expensive... we want to wait on different kinds of events by Linus Torlvalds and Oracle folks - kevent was designed for that scenario. It works for any kind of events, setup cost is way too cheaper than the whole file structure.
  • AIO is better to be implemented as state machine by Ingo Molnar. I've done it already in kevent AIO.
  • AIO must be implemented as a scheduled away micro-threads by Zach Brown from Oracle - kevent allows to wait on that fibrils, which block and are scheduled away. Kevent also allows to be used as a storage for information for async syscall - if syscall does not block, kevent returns at submission point, otherwise it is possible to wait until it is ready either through kevent queue or ring buffer.
  • AIO syscalls should be in a form of struct with syscall number and array of its args - kevent allows to be a storage for that - see above, when syscall being executed does not block, kevent will return immediately (if several kevents are submitted in one batch, number of ready kevents will be returned, ready kevents will be copied into submission array from the beginning), if syscall blocks, kevent allows to wait until it is ready through its queue or ring buffer.
  • what design kevent is going to crack today?

/devel/kevent :: Link / Comments ()


Continue on AIO thread and kevents.


Linus Torvalds wrote:

Don't be silly. AIO isn't an event. AIO is an *action*.

The event part is hopefully something that doesn't even *happen*.

Why do people ignore this? Look at a web server: I can pretty much guarantee that 99% of all filesystem accesses are cached, and doing them as "events" would be a total and utter waste of time.

You want to do them synchronously, as fast as possible, and you do NOT want to see them as any kind of asynchronous events.
It is up to AIO interface - it can request completion event after it has checked that system blocks, if it can not check that in advance it still can perfectly fine use kevents, since kevents are not blocked and returned immediately (if special flag KEVENT_REQ_ALWAYS_QUEUE is not set) in case of underlying subsystem does not block and can provide data immediately. Setup for kevent is a cache allocation and queueing into the binary tree, so one can calculate how fast it is - it is just bloody faster than process creation or rescheduling.
Actually that is silly - if that would have anyhow similar performance costs for thread creation and kevent creation, no one ever used polling - people would use real threads per event. Fibril allocation (and it happens always) is slower than kevent one, and in case of block - price becomes just too expensive.

Yeah, in 1% of all cases it will block, and you'll want to wait for them. Maybe the kevent queue works then, but if it needs any more setup than the nonblocking case, that's a big no.
Linus, your never ever read what I posted in all kevent related threads - shame on you, since it was you who appreciated similar design (well, it was created by you, but it was much simpler and feature free) several years ago - but emoutions must go away - kevent does not differentiate between blocking and non-blocking mode until special flags are set - in the usual case kevent is allocated, its fields are filled and it is queued into the tree.

Eric Dumazet wrote:
It seems to me that kevent was designed to handle many events sources on a single endpoint, like epoll (but with different internals). Typical load of thousand of sockets/pipes providers glued into one queue.

In the fibril case, I guess a thread wont have many fibrils lying around...

Also, kevent needs a fd lookup/fput to retrieve some queued events, and that may be a performance hit for the AIO case, (fget/fput in a multi-threaded program cost some atomic ops)
Let me clarify what are event sources and how events are delivered to userspace - it is possible to have tons of event sources (timers, sockets, pipes, fibrils, anything) in the same kevent tree, but they are posted back to userspace one-by-one through ring buffer or through syscall to save order (one can have multiple threads reading the same ring buffer though). If there will not be enormous number of fibrils per process, then it is fine - theirs number does not play any significant role.

To get events through kevent queue fd loopkup is needed - but it is much-much-much less costly than rescheduling and thus fibril completion. If there are multiple threads doing completion processing, they should use ring buffer - it was created exacly for that purpose, so kevent queue will not have a lot of lookups per event at all and will not have simultaneous access to atomic operation too - think about compare and exchange operation in modern CPUs, which Ulrich wanted to implement in threads reading kevent ring buffer.

Kent Overstreet wrote:
An app can have a bunch of cheap, fast user space threads servicing whatever; as they run, they can push their system calls onto a global stack. When no more can run, it does a giant asys_submit (something similar to io_submit), then the io_getevents equivilant, running the user threads that had their syscalls complete.
That is exactly how kevents work - no need to invent new wheel. And that is, btw, how AIO sendfile works (except underlying state machine) - it gathers file descriptor or path, socket and length/offset parameters, combine a private structure and creates a kevent, which will be completed either immediately if sending does not block, or will return error and state machine will handle that (in case file descriptor or socket is in non-blocking mode), or handling thread blocks and is scheduled away, so new one can pickup next request.
The whole sub-thread started after this message is specilating about how we could invent new things, which happend to be implemented yar ago and called kevent. Hey, it is already done.

Davide Libenzi wrote:
Note that it's not a trivial tasks to extract a long enough level of parallelism, that would make you feel pain in having to walk through the submission array. Think about the trivial web server case. Remote HTTP client asks one page, and you may think to batch a few ops together (like a stat, open, send headers, and sendfile for example), but those cannot be vectored since they have to complete in order. The stat would even trigger different response to the HTTP client. You need the open() fd to submit the send-headers and sendfile.
Really? And how do you think it can be solved? by issuing tons of async IO? No, it is way to faster to be solved as a state machine suggested by Ingo Molnar and me - kevent AIO works that way, AIO sendfile was implemented that way, and it was proven that state machine (file path and header are provided into the call, it internally opens file, sends its content (populates pages into VFS cache if needed), and closes file) works faster even for pages in VFS cache (which are not supposed to block). Who is going to eat own's hat?

Possible AIO design:
struct async_submit {
        void *cookie;
	int sysc_nbr;
	int nargs;
	long args[ASYNC_MAX_ARGS];
};
struct async_result {
	void *cookie;
	long result:
};
Silly, if you are going to implement it that way - kevent already can do it for you. it will allow to way (btw, Linus, databases, which perform direct IO will block 90% of all theirs requests, so they do wait, so they do need kevent, but that is another story).

/devel/kevent :: Link / Comments ()


Tue, 06 Feb 2007

Linus Torvalds again.


He wrote:

Also, quite frankly, I tend to find Uli over-designs things.
Linus, do not see for design, watch the fucking implementation, which is there for too long already, and was created way long before Ulrich created his design, which in first edition was a subset of kevent (later were added ring buffer and possible thread id).
One just needs to patch and run - it is already done.
Totally and completely.

Another sentence:
We want less code. The whole (and really, the _only_) point of the fibrils, at least as far as I'm concerned, is to *not* have special code for aio_read/write/whatever.
I tend to agree here, but Linus, you are missing the main problem - if more code means faster processing, it is better, so fibril rescheduling just can not compete there with specially written AIO calls - like C written copy can not compete with asm one - fibrils do not scale with SMP, rescheduling of the system with 10k fibrils is impossible, task allocation (even smaller task) is slow - there are too many things where fibrils suck.

/devel/kevent :: Link / Comments ()


Continue on fibril AIO thread. This time Linus Torvalds.


Linus Torvalds wrote:

But if you want to, we could have a *separate* "convert async cookie to fd" so that you can poll for it, or something. I doubt very many people want to do that. It would tend to simply be nicer to do
	async(poll);
	async(waitpid);
	async(.. wait foranything else ..)
followed by a
wait_for_async();
That's just a much NICER approach, I would argue. And it automatically and very naturally solves the "wait for different kinds of events" question, in a way that "poll()" never did (except by turning all events into file descriptors or signals).
Linus, wake up, it is done already. I post a new patchset each week for the last several months (started more than a year ago), which implemented that already.
Linus even says magical words: wait for different kinds of events, which is written in every mail about kevent I sent for the last year.

Kevent already allows to wait on different kinds of events - from file descriptors down to signals or timers. On any kind of events. Trivially.

/devel/kevent :: Link / Comments ()


Is kevent perfect?


David Miller wrote:

I'd be quiet if there were some well formulated objections to his work being posted, but people are posting nothing. So either it's a perfect API or people aren't giving it the attention and consideration it deserves.

Obviously, kevent is perfect!

/devel/kevent :: Link / Comments ()


Interesting, where does Oracle get that?


Scot McKinley at Oracle wrote:

As Joel mentioned earlier, from an Oracle perspective, one of the key things we are looking for is a nice clean *common* wait point. We don't really care whether this common wait point is the old libaio:async-poll, epoll, or "wait_for_async". And if "wait_for_async" has the added benefit of scaling, all the better.
However, it is desirable for that common wait-routine to have the ability to return explicit completions, instead of requiring a follow-on call to some other query/wait for events/completions for each of the different type of async submissions done (poll, pid, i/o, ...). Obviously not a "must-have", but desirable.
It is also desirable (if possible) to have immediate completions (either immediate errs or async submissions that complete synchronously) communicated at submission time, instead of via the common wait-routine.
Finally, it is agreed that neg-errno is a much better approach for the return code. The threading/concurrency issues associated w/ the current unix errno has always been buggy area for Oracle Networking code.
Scot, you did your homework quite bad - just chek linux-kernel@ and read kevent mails I post about weekly - everything you described above is implemented about a year ago already - and as a bonus it has a lot of additional high-performance features, interfaces and usage cases added later.

/devel/kevent :: Link / Comments ()


Continue fibril AIO discussion.


David Miller has entered fibril discussion with kevent support, thanks a lot David.
Let's see what he got as answers.

Davide Libenzi wrote:

Zab's async syscall interface is a pretty simple one. It accepts the syscall number, the parameters for the syscall, and a cookie. It returns a syscall result code, and your cookie...
Could this submission/retrieval be inglobated inside a "generic" submission/retrieval API? Sure you can. But then you end up having submission/event structures with 17 members, 3 of which are valid at each time. The API becomes more difficult to use IMO, because suddendly you have to know which field are good for each event you're submitting/fetching.

So, what's wrong with all that people? Kevent's structure does gets as parameters event type and id - which is exactly syscall number, there is also a user's pointer, which can be used as cookie or as a pointer to data, where syscall parameters live.

But thinking some more on it, it becomes completely wrong discussion - one can use any userspace interface he/she likes - kevent is event handling mechanism, it is used for AIO completions, it is used in POSIX timers - fibril AIO (which design I'm not agree with) can use kevent as a queue, which will return all its results - just see how it is implemented in kevent AIO or POSIX timers and adopt it to fibrils, or at least read documentation or mails annotations, which describe kevent usage cases.

What Davide Libenzi is wrong about in his mind, is the fact, that neither kevent, nor poll()/epoll() are supposed to be an AIO mechanism - they were created to deliver events - AIO completion is one of the events, which can be perfectly delivered by kevent and poll()/epoll(), the latter requires heavy file->bindings, while the former does not at all.
Kevent AIO or fibril AIO, or FSAIO - they are different mechanisms of doing AIO, but all of them can use kevent to deliver its own events like completion or errors.

/devel/kevent :: Link / Comments ()


Mon, 05 Feb 2007

More comments on fibrills AIO.


Zach Brown wrote:

Being able to wait on that with file->poll() obviously requires juggling file-> associations which sounds like more weight than we might want. Or it'd be optional and we'd get more moving parts and divergent paths to test.

Davide Libenzi wrote:
Yes, no need for the above. We can just host a poll/epoll in an async() operation, and demultiplex once that gets ready.

If you follow kevent blog and know how it works, you will understand, why I want to crack my head against the wall.

/devel/kevent :: Link / Comments ()


Kevent compilation is broken - patch has been released.


Blame me for that - I generate patches from my git tree automatically and never test them for correctness in the form of patches, only as git tree, so small error sneaked into, which was caught by David M. Lloyd. Fix is trivial.
This hunk is related to my epoll replacement with kevent (which I did not include into main patchset due to simple reason - I do not want to kill someone's work without author's approval).

/devel/kevent :: Link / Comments ()


Sun, 04 Feb 2007

Quotation of the week:

One of the big problems today is that you can either sleep for your I/O in io_getevents() or for your connect()/accept() in poll()/epoll(), but there is nowhere you can sleep for all of them at once. That's why the aio list continually proposes aio_poll() or returning aio events via epoll().
(c) Oracle.

Original (thread about fibrils and AIO by scheduling stcks).

What can I say - NIH syndrome is washing brains (I must say I have it too), or people just too lazy to bother to read something created by others.

/devel/kevent/aio :: Link / Comments ()


Sat, 03 Feb 2007

Fibrills and threads.


Let me briefly describe how I understood Ingo Molnar (and his idea of having set of working threads, each of which works with caches of requests, which form state machine of given AIO operation), Linus Torvlads and Zach Brown (who invented fibrils - small process stacks which are created for each AIO call and scheduled when operation blocks).

Main Zach's idea is to clone (create a limited clone actually) current process' stack and issue a call in it, so essentially he creates a new lightweight thread per AIO call. That thread is called fibrill, it is limited in that regard, that it is strictly bound to the executing process, thus it is not allowed to schedule fibrill while main process is not running. It is exactly userspace threads, described here, but without SMP scalability, and in kernelspace.
So, when AIO call is performed, special stack is allocated on behalf of calling process, and that stack works like micro-thread, which can sleep, can be scheduled away (and main process will continue its execution right after the call), can be awakened and so on. Main disadvantages are impossibility to scale on SMP since fibrill is bound to process, it is not real thread, per syscall stack allocation (which is quite heavy, about 1.5k) and initialization, and rescheduling cost (consider thousands of pending AIO calls, each of which is a small thread).

Another approach was invented by Ingo Molnar and implemented in his Tux in-kernel http/ftp server and in kevent AIO (although I though about it independently, but did look into Tux code later).
This approach embedds limited set of real working threads, say 10 or 2 or 64, per process (in kevent it is global limit), which execute operation requests, each of which form some state machine, for example async sendfile consists of file opening (name lookup), sending its data (which in turn consists of populating data into VFS cache and actual sending over the network) and file closing. Such a cache (in kevent notation it is also called a 'request') of functions, which form a state machine, is executed on one of the real threads until some operation is blocked, so when it happens, another threads can start executing its requests, so such approach scales well on SMP.
Main disadvantage is complexity of state machine programming, but it is quite doable, which is shown in kevent AIO and Tux.

So, for kevent, I've put into TODO list to extend it to create set of threads on behalf of each process, issued AIO request (instead of having global threads), and spread work between them without blocking of the main process. It would look like implicit creation of several threads using clone() syscall, so that all threads would have the same process ID and would obey to process' rlimits.

/devel/kevent :: Link / Comments ()


Fibrills AIO, kevent AIO and Ingo's ideas.


Are you fucking kidding?
It does exist in kevents already.
Everything.
From the beginning to the end.
Completely.

I sometimes feel that kevent mails are not even read by anybody, but at least for now I thought that only diff itself was not read (I specially put couple of fancy printks in the code, which were only noticed by Jonathan Corbert in LWN), but at least annotations were.
Now it looks like they were not too.

And another one.

What is really good in that discussion, is the fact, that kevent AIO should have pool of threads per process, not globally. That will solve any rlimit problems and allows to scale to any number of processes issued AIO calls. In that regard kevent is limited, since number of working threads does not scale with number of processes started AIO work.

/devel/kevent :: Link / Comments ()


Fri, 02 Feb 2007

I could not resist. New kevent feature - async file open.


Async open/send/close sequence. I posted a patch for exising kevent AIO support, which adds async open into already given set of send/close (the latter is implicit) sequence. Open was sync before.
With existing kevent AIO state machine it becomes really trivial, although patch is rather broken (there is no need to setup file descriptor at all for that case), it clearly shows how simple becomes AIO with the right state machine involved. It could be posible to add there networking too, but then I expect Andrew will never ever see into patchset.

Yes, I'm a loser, I shamelessly advertise myself on every corner, but I do not care, really. If otherwise developers do not review the work, I do not feel myself anyhow bad to show them, what I think about it.

/devel/kevent :: Link / Comments ()


Thu, 01 Feb 2007

New kevent 'take35' release.


Short changelog:

  • Ported to the 2.6.20-rc7 (9be5b038b1c9d1927c367bf91683458e10d5d4eb) tree
Jeff Garzik asked Andrew Morton if kevent will be included into upcoming -mm release, but as David Miller, Ingo Molnar and many others before, failed to get any definitive answer.

Andrew suggested that someone performed a line-by-line review of the patchset (it grew to 208 Kb after most major features were implemented) and someone to provide a diff between current implementation and what Ulrich Drepper has in mind regarding this project.
Unfortunately this requires either mind-reading machine, or some feedback from Ulrich. Last mail from him about kevent was related to 25 release, which implemented most of the changes he wanted, although I do not like them, but it was at least some progress.
The rest of the differences are signal mask in syscalls, but nature of kevent signal delivering does not require it, since mask of pending signals is not updated if special flag is set, and exceeded functionality (like hrtimers accessible through kevent interface and as POSIX addon).
But as practice shows, it is not enough to convince anyone, since people with whom I argued about, are not going to listen and understand why and how kevent works.
But I will try. Again.

As for current kevent status, it is somewhere in feature-freeze state, awaiting (maybe forever) for decision.

/devel/kevent :: Link / Comments ()


Thu, 25 Jan 2007

New kevent version 'take34' has been released.


The only change is a header pointer in aio_sendfile_path(). Now one can use aio_sendfile_path() to send a file with header in one syscall instead of three: send(header), open(file), sendfile().

/devel/kevent :: Link / Comments ()


Fri, 19 Jan 2007

Kevent feature request for aio_sendfile().


Suparna Bhattacharya (IBM) has requested new feature in the asynchronous file sending syscall - header pointer, which will be put into socket queue before file's data.
Although Linux syscall overhead is extremely small compared to other Unix systems, it is still not zero, so since I already "optimized" (i.e. removed) open() call in aio_sendfile_path(), I think things will not became worse if I will put there header pointer and length too.
I plan to release new kevent version after M-on-N threading model implementation with this feature implemented.

/devel/kevent :: Link / Comments ()


Next 40 entries