|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Tue, 28 Feb 2006
Kevent based AIO.
I need to admit, that switching to kevent requeueing
design was a bit too erroneous. kevent can not be
removed from storage list, used, and then queued back, since
it can be removed in that period and freed by appropriate ioctl/syscall.
Due to the nature of kevents, they can not have fine-grained reference counters,
since the can only be freed in one place (if it is one-shot
kevent, then it can be removed in two places, which are
synchronized by using kevent_user->ctl_mutex) in process
context, but kevent can be queued into three different lists,
two of them are synchronized by kevent_user->ctl_mutex
mutex and per-list locks, and third one - kevent_storage->list - was
synchronized by kevent_storage->lock, so removing
path could get full access to the list if that lock is grabbed,
but now it is not possible, since lock is dropped by kevent_storage_ready()
before it starts kevent->callback().
So, I'm going to revert that change and start thinking about how to
get together network callback invocation, which is called with softirqs disabled,
and block layer, which assumes that IRQs are disabled when kevents become ready,
i.e. when BIOs are completed.
Fixed tricky bug in queueing procedure - it was racy all the time,
and huge network transfers did not catch this due to it's callback nature -
it is very likely that networking kevent on fast machine with fast LAN
will be processed immediately in ->enqueue()
method, so this race did not show up.
Next step is to finish AIO reading support, which just requires reading offset
to be provided from userspace. That will allow to run real benchmarks and aio_sendfile()
implementation start.
:: Link / Comments (0)
Mon, 27 Feb 2006
Kevent based AIO.
Ok, it looks like it is somehow ready, but there is a big problem:
kevent setup parameters are provided through struct ukevent from userspace,
but in case of AIO there is no place there to put file offset to start reading from,
so currently it only reads from zero file offset.
Here is brief benchmark.
Reading directly into userspace pages from disk using 40k buffers.
hdparm shows about 41 Mb/sec maximum reading speed.
AIO CPU usage is always zero, speed changes from 16 to 40 Mbytes/sec.
Synchronous read shows steady 26 Mb/sec with 5% CPU usage.
Buffered reading I do not get into account, since it is not implemented for kevent based AIO yet.
:: Link / Comments (0)
Sun, 26 Feb 2006
Kevent based AIO.
Ok, first troubles:
- kevent calls it's callbacks with BH disabled, and it is perfectly ok for socket notifications, but block layer
calls it's
bio->bi_end_io() callback with disabled IRQs, so kevent will catch a warning when
enables BHs.
- it looks like in reading case in general one BIO can contain only one page, since BIO is a request of physical media,
but FS blocks in general do not lie one-by-one in physical media.
Actually first locking problem should be solved by switching to different kind of locking.
kevent_storage->lock protects storage list of kevents from modifications,
which can only happen in process context, and actually it is perfectly legal to reenter
kevent_storage_ready(), since it does not modify storage list.
Another solution is requeueing of kevents to/from storage, i.e. remove kevent from the storage
list, call it's callback and then enqueue it back, so callback will always be called in right context.
Ok, after some digging into kevent internals, I've implemented second approach, and things do not look worse
from performance point of view,
although there was one regression introduced with aio_* syscalls, which is fixed now.
And there is still one regression in receiving code - I do not know when it was broken so will spend rest of the
day searching for it.
It looks like there is some major error in network AIO design,
performance is 2 time slower then in synchronous sending or receiving
if only one network flow is used, although CPU usage is 2-6 times less.
There must be some recent drawback, since at least receiving was completely different
one month ago.
Hmm, but after receiving buffer increased from 4k to 40k asynchronous test starts show comparable
performance with zero CPU usage, while classical synchronous test eats about 6-7% CPU.
Receiving speed is about 20-30 Mbytes/sec over gigabit lan with some jumps
from 15 to 40 Mbytes/sec, sender is Realtek 8169, receiver - e1000.
:: Link / Comments (0)
Sat, 25 Feb 2006
Kevent based AIO.
Ok, kevent based AIO reading has it's first stage implemented -
BIOs are allocated and read request can be completed.
It allows to fill one userspace page with file data per call currently,
next step is multiple pages reading implementation.
:: Link / Comments (0)
There is one terrible thing I really do not like and probably even fear.
It is a visit to teethbreaker aka stomatologist.
I've done it today - quite a long time there,
some scary imaginations before and actually nothing
awful at all. Modern medicine can do magical things.
:: Link / Comments (0)
Fri, 24 Feb 2006
Kevent, network AIO.
Hacked a little kevent based AIO reading mechanism.
First I've decided to implement only reading data using block layer only
without touching VFS at all, so it is somehow similar to direct IO.
Data will be copied directly into userspace pages from block layer.
There are several places where kevent inserting can sleep:
- when allocating various objects
- when getting userspace pages
- when getting FS blocks
Currently I do not see how to remove that possible sleeps, but since
it is happens in ->enqueue() callback in process context
I do not think it worth changes.
Each submitted BIO has private ->bi_end_io() callback which
invokes kevent subsystem by calling kevent_storage_ready().
All code is actually do_mpage_readpage() and it's environment,
so no rocket science here.
And since it is completely untested I can not event say if it is right approach or not.
:: Link / Comments (0)
Thu, 23 Feb 2006
I congratulate you with the Day of Motherland Defender.
It is an interesting celebration day in Russia, since, from one point of view,
it is a holiday of all men, but from other point of view it is only
holiday of all profeessional soldiers.
Anyway, I've visited restaurant "5 oborotov"
with Grange, where spent a very good time.
I recommend this place for anyone definitely.
We have taken a part in interesting competition devoted to "Liebenweiss" beer,
where everyone would create a short story, where each word should start from letter "L".
Mine and Grange's broken brains born something like this cruft:
Лазарь, ласкавший лошадь в лепрозории с лавандой,
И льющийся лубочный легочный лишай...
and so on...
And we got the first place!
:: Link / Comments (0)
Wed, 22 Feb 2006
Kevent, network asynchronous IO.
I decided to try to implement asynchronous sendfile()
syscall using hard way, i.e. by implementing AIO reading using
kevent.
No one need an excuse to reinvent the wheel.
Main difference between network data receiving and FS reading is that
in network there is a remote side which pushes data to us, while in
FS initiator must get that data itself. So in network it is easy
to put kevent callback somewhere in RX path, which will be automatically
called when RX interrupt in NIC happens.
In FS there is no remote side, but there is a block layer, which actually
has bio->bi_end_io() callback which is invoked each time
BIO (block IO request) has been processed, so in theory we could place our kevent callback
somewhere in the codepath waiting for BIO completion.
But in practice we need to read the file, but not the data from block device,
so we must follow FS specific path on the block device and call FS specific
functions to obtain next block of data, and this functions can sleep and
perform in absolutely unpredictable manner.
And we also need to have exclusive rights to the pages in VFS with data from
our file, so pages must be locked, and page locking can sleep too.
block_read_full_page() - generic "read page" function for
block devices that have the normal get_block functionality,
i.e. most of the block device filesystems, sleeps on buffer_head
to become uptodate.
So, under above observations asynchronous reading based on kevent
is going to be developed with following in mind:
- if pages are in VFS cache, then it is just copied to the user's buffer, like synchronous code does.
- if pages are not present in VFS, then process will look similar to
do_mpage_readpage():
- required number of pages are allocated.
- blocks are setup using FS specific
get_block() callback.
- new BIO is constructed using above information.
- BIO is submited for IO.
bio->bi_end_io() callback calls kevent subsystem.
- kevent subsytem copies data into user's buffer.
Many steps above require sleeping and process context, so it still requires a lot of
thinking.
:: Link / Comments (0)
Monthly scheduling.
One of the thing I do roughly each month is news reading,
I've just spent several hours reading popular russian news sites,
and can say, that completely nothing interesting happened in the world.
USA is going to start searching for democracy in Iran, terror in East,
fuel and energy crisis. All were the same one month ago. Solid negative flow
of various problems, which I can not solve and can not even help to solve.
Only the Olympic Games.
Very likely I've just broken some good opinion about myself.
:: Link / Comments (0)
Tue, 21 Feb 2006
Linux has been ported to Sun Niagara T200 box.
David Miller, being under influence of red-eyed people all over the world,
has started
Ubuntu 6.04 "Dapper Drake" on T200 system.
:: Link / Comments (0)
Mon, 20 Feb 2006
Climbed with Grange.
That was very cool training, I've done couple of old traces, several traverses.
It's time to start new ones, there are quite many new complex traces in skala-city.
:: Link / Comments (0)
Terminal.
It looks like word "terminal" takes it's roots from ancient Roman word "Terminalibus"
and Roman god Terminus, who was a boundary protector. Feb 23-28 were Terminus' holidays,
when ancient romans checked it's land boundaries and celebrated, probably because boundaries
were still the same or even extended.
:: Link / Comments (0)
Updated w1.
This update removes old-style kernel thread initialization and changes w1 to use kthread api.
It is based on Christoph Hellwig work.
New version is available in archive.
:: Link / Comments (0)
Sat, 18 Feb 2006
Link to kevent and network AIO can be found on Dan Kegel's The C10K problem page.
:: Link / Comments (0)
Kevent, network asynchronous IO.
I've finished aio_send() and aio_recv() system calls implementation
and sent them to netdev@ for review.
Its time to go to drawing board to design aio_sendfile() method.
Main problem of asynchronous file sending is that reading from disk is always blocking,
so we can not read data in softirq context where all asynchronous network processing happens.
One idea I've come to is to initiate some workqueue task on behalf of kernel's keventd thread, which will
read several pages from given file and then simply start kevent sending mechanism. It is not quite fair to
call such design "asynchronous" sendfile, since it just moves work from one place (userspace)
into another (kernel's keventd thread), although sending itself is supposed to be completely
asynchronous with data reading.
Another idea is to use asynchronous IO to read the data and then initiate asynchronous sending mechanism,
but according to OLS 2002 Proceedings,
where one can find comparison of epoll read vs. aio read, AIO behaves worse than synchronous reading,
and from the first glance I do not see how AIO read can end up in kevent subsystem invocation
without major getting in AIO subsystem.
And the most ambitious idea is to implement AIO read using kevent subsytem, or at least create
some kind of stub for kevent callback invocation.
All above requires a lot of thinking...
:: Link / Comments (0)
Fri, 17 Feb 2006
Linux kernel on SUN Niagara.
David Miller has announced
first SMP boot of linux kernel 2.6 on SUN Fire Niagara T200 system.
:: Link / Comments (0)
Kevent and network asynchronous IO.
I've added syscalls for kevent controlling mechanism and started
writing aio_* syscalls for network asynchronous IO.
Definitely system calls are much better from user point of view than ioctl().
Climbed with Grange today.
Excellent training, I've finished my first 7a trace, and actually it was not
that hard after several trainings. Good weekend after good week.
:: Link / Comments (0)
Wed, 15 Feb 2006
Network asynchronous IO.
Ok, I've fixed locking issues with zero-copy asynchronous sending.
Next step is to implement system calls for receiving and sending functionality,
so it would not be so ugly like it is now.
Proposed system calls are
aio_recv(), aio_send() and aio_sendfile(),
last two are actually almost the same, since asynchronous sending support
was implemented using page pinning and zero-copy approach, so they will differ
only in the way what pages are used - either userspace or VFS.
Climbed with Grange. Good training today - I've tired as hell, but that was cool.
:: Link / Comments (0)
Tue, 14 Feb 2006
Zero-copy asynchronous sending support.
Network asynchronous sending support is implemented in a zero-copy way
similar to how sendpage() works.
Here is benchmark of system with 10 concurent sockets which are used for
sending. Receiving server and synchronous sender use epoll() to select active socket.
Benchmark results.

Patch is available on project's
homepage.
Sending support is not 100% ready and requires a lot of testing and locking changes.
I've announced this project in netdev@ maillist.
:: Link / Comments (0)
Mon, 13 Feb 2006
Network asynchronous IO.
I've implemented zero-copy asynchronous sending mechanism for Linux kernel 2.6.
Userspace pages are locked in memory and provided to asynchronous callback,
which works similar to sendpage() method, which is used in sendfile().
Performance is the same for asynchronous and synchronous sending processes, and stays about 80 Mb/sec.
Much more interesting benchmark is some kind of web server, which uses
asynchronous sending mechanism for each client. This test will be run
after I finish asynchronous sendfile interface.
:: Link / Comments (0)
Climbed with Grange.
It was hard training today, but very good. I tried several times
my new favorite and the most complex trace for me. There is
only one place currently on that green 7a where I fail,
although I can finish it only by pieces.
That was a really excellent training definitely.
:: Link / Comments (0)
Sun, 12 Feb 2006
Some notes about author.
Likely you know nothing about me, so if you are interested in it,
I've created about page.
:: Link / Comments (0)
Sat, 11 Feb 2006
Network asynchronous IO.
Initial design notes on asynchronous sending support.
Things going to be very similar to what asynchronous receiving support does:
sending callback will be based on appropriate sendmsg()/sendpage()
callback from synchronous codebase.
To understand how async sending will work I will describe synchronous network sending mechanism briefly here.
When userspace calls send() syscall, it is transferred into protocol specific
sendmsg() callback.
Let's see how tcp_sendmsg() works.
Socket is locked and function tries to send user's data segment by segment from
set of iovecs, which are constructed a bit earlier from user's data.
If socket's write queue is full and process requested non-blocking IO, -EAGAIN will be returned here,
or process will put into sleep otherwise.
When write queue has empty space, process is awakened and allocation operation is performed.
System gets the last skb from socket's write queue or,
if skb from the queue does not have enough space, allocates and adds new one
to the end of the socket's write queue,
where socket accounting code is called, which performs various checks for socket's queue size,
memory pressure and increase size of data put in the socket's write queue.
Then it copies data into skb fragment list or linear space.
tcp_sendmsg() callback then tries to push the first segment from the queue into the net.
This process ends up calling tcp_transmit_skb(), which creates TCP header and checksum,
appends various TCP options and calls AF_INET{4,6} specific queue_xmit() callback,
which is ip_queue_xmit() function for IPv4, and either tcp_v6_xmit()
for usual IPv6, or ip_queue_xmit() for IPv6 mapped to IPv4.
ip_queue_xmit() will resolve destination address through the routing tables
and IPsec if needed, or use skb->dst if packet is already routed, like in case of SCTP.
Then IP header is allocated and built with all required options, and the whole packet
is pushed into netfilter's NF_IP_LOCAL_OUT hook, which ends up calling callback
from stackable dst_entry structure. This technique allows to create various paths
for various packet types, for example IPsec will encrypt packet and then push it for further
processing, while usual IPv4 codepath will put ip_output() there.
We can even break calling chain here and postpone further processing, like it was done for
asynchronous IPsec processing using
acrypto.
All methods will end up calling ip_output(),
which will either fragment skb in ip_fragment() if data size is more than MTU for given output device,
or call ip_finish_output(), which in turn run into netfilter in NF_IP_POST_ROUTING hook,
which eventually will end up calling ip_finish_output2().
ip_finish_output2() performs several checks on header sizes and ends up
calling dst->neighbour->output() callback, which is dev_queue_xmit()
in our case.
dev_queue_xmit() queues a buffer for transmission to a network device.
It is done through qdisc interface, if device has a queue, or calls directly dev->hard_start_xmit()
if devices does not have a queue, for example loopback device and various tunnels do not have queues.
Qdisc interface will call it's ->enqueue() callback, which is specific to traffic control
mechanism we are using, and ends up calling qdisc_restart(), which dequeues,
by calling ->dequeue() callback, first skb and calls dev->hard_start_xmit(),
which sends data to the wire and then must free skb by calling dev_kfree_skb() or friends.
dev_kfree_skb() perform skb destruction in case of zero reference counter.
Destruction is being done in __kfree_skb() function, where routing, security context,
connection tracking and other references are released and skb->destructor() is called, if it exists.
In our case it will be sock_wfree(), which will perform some socket accounting
and release socket's reference.
kevent_naio_callback() will call protocol specific async_send() callback,
which will be very similar to sendmsg() callback, described above, but be always nonblocking.
It must be called when socket's queue is not full
and we can put some data at the end. As we can see above, socket's accounting code is called
in two places: when new data is added to the queue, where size of data put into socket's write queue sk_wmem_alloc
is increased, and in sock_wfree() which is called from skb destruction path,
where sk_wmem_alloc is decreased, which means queue has some empty space
where data can be copied.
I plan to place kevent_socket_notify() with KEVENT_SOCKET_SEND event
in sock_wfree(), which will call kevent_naio_callback() if asynchronous
sending support is enabled for given socket.
:: Link / Comments (0)
Fri, 10 Feb 2006
Network asynchronous IO.
One can find design and implementation details, benchmarks
and some TODO on the just created project's
homepage.
:: Link / Comments (0)
Thu, 09 Feb 2006
Network asynchronous IO.
I've run several benchmarks of asynchronous receiving versus stock recv().
Hardware.
Receiving side: Xeon 2.4 Ghz, HT disabled, 1Gb RAM, 1Gbps Intel 8254IPI (PCI-X 133Mhz slot) e1000 adapter.
Sending side: AMD64 3500+ 2.2 Ghz, 1Gb RAM, 1Gbps RealTek 8169 adapter integrated into nVidia nForce3 chipset (MSI K8N Neo2).
Connection: D-Link DGS-1216T gigabit switch.
Receiving software (naio_recv.c) can be found in
archive.
Sending software is a simple sendfile() based server.
Receiving side runs 2.6.15-rc7-event FC3 system. Default settings.
Sending side runs 2.6.15-1.1830_FC4 FC4 system. Default settings.
Results.
Client receives 1Gb of data on each of 8 runs (4 asynchronous receiving and 4 synchronous).
Each part of 4 graphs contains speed of both types and CPU usage during test.
Performance reported by netperf-2.3 is about 400Mbit/sec, graphs have Mbytes/sec vertical axis
for speed test and CPU usage in percents for CPU test.

Kevent
and network AIO have been announced in netdev@ maillist [
inroduction,
kevent and
network AIO].
:: Link / Comments (0)
Wed, 08 Feb 2006
Network asynchronous IO.
Ok, there are first results - I implemented asynchronous data receiving
directly into userspace pages from socket's receiving queue.
Things are not ready yet, it requires some cleanups, performance tuning
and a lot of testing.
:: Link / Comments (0)
Mon, 06 Feb 2006
Climbed a lot with Grange.
It was excellent climbing today - new traces, new challenges.
At the end I tried the most complex trace I ever clibmed, obviously I failed
it on-sight, but it does not matter, such challenge is like moving to the
major league, which is always exciting. It was really good training.
:: Link / Comments (0)
Sun, 05 Feb 2006
Kevent.
New kevent
subsystem uses RCU now. httperf does not show significant performance increase,
only about ~1% request number increase and ~8% number of errors decrease.
But... There is one tricky problem with RCU: if call_rcu() is used
for synchronization, then callback will be invoked in softirq context, so
it is not allowed to sleep there, but, for example, in case of socket/inode
notifications there must be dropped inode's reference counter, which may sleep
in case of inode freeing. And synchronize_rcu() RCU synchronization mechanism,
which blocks until grapce period elapsed, introduced extremly high latencies.
Moving release path into workqueue drops performance to epoll()/kevent_poll() level,
although keventd does not eat CPU at all, sometimes it
increases upto 2200 requests per second, but number of errors is too high.
I think, such behaviour can be explained by fact,
that inodes associated with sockets are released later and thus new socket acception can fail...
Ok, I'm going to remove RCU protection from
kevent subsystem
and think more about kevent's callback invocation, so it would allow to perform
softirq-only protected processing.
After some optimisations in kevent, I've obtained new maximum request rate value:
httperf --timeout=1 --client=0/1 --server=pcix --port=80 --uri=/ --rate=3000 --send-buffer=4096 \
--recv-buffer=16384 --num-conns=30000 --num-calls=1
Request rate: 2623.7 req/s (0.4 ms/req)
Errors: total 1964 client-timo 555 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 1409
Unfortunately such high rates can not be obtained all the time due to port/socket rollover,
both epoll and kevent_poll show about 1600-1800 requests per second in this setup
with much higher number of errors (upto 7 times more).
:: Link / Comments (0)
Sat, 04 Feb 2006
Network asynchronous IO.
While cleaning receiving part up, I've decided to switch
storage's (inode/socket, for example) list of interes traversal to RCU.
In previous revision of kevent subsystem each kevent's callback
is called under the lock with interrupts turned off, but network asynchronous IO
copies data in that callback, which in turn performs TCP state machine processing,
so tcp_send_ack() call, which supposes, that only softirqs are
turned off, fires WARN_ON(irqs_disabled()); in local_bh_enable().
:: Link / Comments (0)
Fri, 03 Feb 2006
Network asynchronous IO.
Finally I have something to say about this.
Initial part of asynchronous receiving has been implemented.
Userspace can receive data into it's pages directly from
softitq where TCP state machine is handled. Tomorrow I plan
to clean this stuff up and create some kind of usable application
to play with.
:: Link / Comments (0)
Thu, 02 Feb 2006
Van Jacobson's network channels.
LWN article
and netdev@ dicussion.
Do not like to advertise myself, but will...
- packet mmap socket copies data from skb into mapped buffers.
In af_tlb project [1] it was found, that 4kb copying is 5-20 times slower than
VM remap tricks.
[1]. http://tservice.net.ru/~s0mbre/old/?section=projects&item=af_tlb
It is a zero-copy sniffer, which remaps skbs directly from socket queue
into userspace.
- Receive classification.
It is already there in socket lookup code, it just must be pretected
from hardware irqs. Receiving zero-copy project [2] was designed in that
way. It has simple classifier in hard irq context, which selects
zero-copy socket (so called struct zsock) and then calls driver's
provided function to copy/dma/anything into specified in zsock area,
which can be either VFS cache or userspace pinned page. It is described
in detail in [2].
[2].
http://tservice.net.ru/~s0mbre/old/?section=projects&item=recv_zero_copy
:: Link / Comments (0)
|