Zbr's days.
September
Sun Mon Tue Wed Thu Fri Sat
 
8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        
2008
Months
Sep
Oct Nov Dec

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Sat, 02 Sep 2006

Zero-copy sniffer.


I was wrong about magical ab symbols found in sniffer dump - it is usual sending data, but since sending side reserves MAX_TCP_HEADER bytes, so in my setup sending ethernet header starts with offset of 190 bytes and receiving with offset of 16 bytes from the begining of the allocated buffer.

Sending sequence number graph for tcpdump and zero-copy sniffers.
Sending sequence number graph for tcpdump and zero-copy sniffers

As you can see there are no gaps in graphs (although it is just scp transfer using 100Mb NIC (3c59x), CPU usage for zero-copy sniffer when data is being written into /dev/null is about two times less then when it is writtend using tcpdump.

I've released new version of network allocator and zero-copy sniffer and sent it to netdev@ with a question about possibility of inclusion into mainline.

Zero-copy sniffer has following overheads:

  • several atomic operations (in the worst case one atomic_set(), one atomic_inc() and one or two atomic_dec_and_test())
  • one lock (bad global lock per sniffer device), which is held when information about new packet is being put into sniffer's queue when skb is freed
  • delayed freeing which can lead to increased memory usage, or (like implemented) if introduced maximum amount of "locked" data by sniffer, some packets can be dropped by sniffer.
Limitations of current version (introduced not due to design problems, but intentionally to test various special usage cases):
  • use NTA only for netdev_alloc_skb() and sk_stream_alloc_pskb(), i.e. only for allocations of traffic received by NIC and sent through send() syscall over stream socket.
  • always compile zero-copy sniffer in, which increases memory usage and adds described above overhead.
  • skb_copy() always allocate data from SLAB allocator, although it could check if original skb's data was allocated through NTA, but I think that skb_copy() is completely incompatible with high performance.
  • it is possible to eliminate several atomic operations (I'm lazy).
  • debug code (poisoning of the tail of the buffer and additional reference counter) is always compiled in.

/devel/networking/zcs :: Link / Comments (0)


Fri, 01 Sep 2006

Zero-copy sniffer.


I've fixed mapping bug and forced network stack to use network allocator only for packets which are created either by network device (receiving) or through send() syscall over sream socket, so current version does not catch netlink messages, unix sockets and so on. Here is typical zero-copy sniffer log:

dump  447.1024: ptr: 0xc19b0f80, start: 0xc19b0000, size: 1956, off: 200576: entry: 0, cpu: 0: 
	ab:ab:ab:ab:ab:ab -> ab:ab:ab:ab:ab:ab, type: abab, 
dump  448.1024: ptr: 0xc19fa880, start: 0xc19f8000, size: 1828, off: 501888: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6, 
dump  449.1024: ptr: 0xc1a01080, start: 0xc1a00000, size: 1828, off: 528512: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6, 
dump  450.1024: ptr: 0xc19f4800, start: 0xc19f4000, size: 1828, off: 477184: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6, 
dump  451.1024: ptr: 0xc1a01f80, start: 0xc1a00000, size: 1828, off: 532352: entry: 0, cpu: 0: 
	00:11:09:61:eb:0e -> 00:10:22:fd:c4:d6, type: 0800, 192.168.0.48:57758 -> 192.168.4.78:5632, proto: 6,
dump  318.1024: ptr: 0xc1b80780, start: 0xc1b80000, size: 1828, off: 1920: entry: 0, cpu: 1: 
	02:30:9b:0c:89:e8 -> ff:ff:ff:ff:ff:ff, type: 0800, 192.168.4.9:43281 -> 255.255.255.255:43281, proto: 17, 
dump  330.1024: ptr: 0xc1b86580, start: 0xc1b84000, size: 1828, off: 25984: entry: 0, cpu: 1: 
	02:00:63:1f:2d:81 -> 01:00:5e:00:01:14, type: 0800, 192.168.5.231:43281 -> 224.0.1.20:43281, proto: 17, 
dump  331.1024: ptr: 0xc1b86d00, start: 0xc1b84000, size: 1828, off: 27904: entry: 0, cpu: 
	1: 02:3a:d1:7e:6e:65 -> 01:00:5e:00:01:14, type: 0800, 192.168.5.232:43281 -> 224.0.1.20:43281, proto: 17,
Look into strange line with ab symbols instead of the ethernet fields - this is an skb, which was freed in tcp_clean_rtx_queue() when ACK was received. Network allocator fills allocated area with ab bytes for debug purpose, and it looks like TCP state machine preallocates some packets and then frees them without actual usage. Number of such empty allocation is not so samll actually.
I plan to run an interesting benchmark tomorrow - test machine will generate traffic using different packet sizes and sniffer will log TCP sequence numbers on that sending machine, then I will plot a graph of sent and missed packets for zero-copy sniffer and tcpdump.

/devel/networking/zcs :: Link / Comments (0)


Sat, 26 Aug 2006

Zero-copy sending and receiving support.


tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
16:47:27.233768 IP10 truncated-ip - 256 bytes missing! 192.168.4.78 > 192.168.0.48: udp
	0x0000:  abab 0578 abab abab ab11 abab c0a8 044e
	0x0010:  c0a8 0030 abab abab abab abab abab abab
	0x0020:  abab abab abab abab abab abab abab abab
	0x0030:  abab abab abab abab abab abab abab abab
	0x0040:  abab abab abab abab abab abab abab abab
	0x0050:  abab
This is zero-copy sent datagram, which was captured on receiving side, as you can see it is perfectly correct (i.e. it contains exactly those IP and higher layers, which were filled in userspace on sending side).

I've also cleaned zero-copy mapping support a lot, so there would not appeared some situations when allocation would not be caught due to mmap troubles (like different CPU mapping crossing and so on).
I also moved notification about new packet arrival in zero-copy sniffer into freeing function, since when it is placed in allocation one userspace can find new buffer until it is even filled by the kernel. When buffer is being freed, it is obviously already contains data (except cases when allocated object was not used at all).
In general zero-copy sniffer can not catch data changes happend somewhere inside main processing code, for example IPsec packet can not be caught decrypted, since it is very short time while packet itself is in transient state after receiving and decryption, in such cases that transient states must be copied, for example using new allocation (which freeing will be caught by sniffer), memory copying and immediate freeing.

There is still a small problem there with freeing - due to addition of struct skb_shared_info, but it is not really that complex, so I will postpone it for a while and will try to implement trivial dump analyzer.

Almost forgot, you can find current patch and userspace utilities in archive.

/devel/networking/zcs :: Link / Comments (0)


Tue, 22 Aug 2006

Zero-copy networking.


I've implemented initial zero-copy sending support based on network allocator. Here is tcpdump dump:

tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 96 bytes
20:55:13.709761 IP0 [|ip]
	0x0000:  0000 0000 0000 0000 0000 0000 0800       ..............
There is a problem though - I mmap data for only CPU0-bound allocator, but it is possible that allocation happens on different CPU, so there will be incorrect data (that is what you see above - there should not be any zeroes).
This problem should be fixed by proper protocol between userspace sniffer and network allocator (currently what is being used can not be called protocol at all). Since I introduced ->ioctl() method anyway I will use appropriate commands there.

/devel/networking/zcs :: Link / Comments (0)


Mon, 21 Aug 2006

Zero-copy sniffer.


I've completed entirely zero-copy sniffer based on network (formerly tree) allocator. I've sent the whole patchset to netdev@ for review. One can find it and userspace utility in archive.

Design notes.
Network allocator steals pages from main system allocator and use them for all network allocations (it's benefits are behind the scope of zero-copy sniffer description, one can find network allocator features on project's homepage), thus it is possible to mmap all stolen pages from userspace and provide special structure for each allocated chunk into userspace which include offset from the begining of the node (each node contains contiguous page-aligned memory region), node number and other info. Since network allocator tracks number of users for for each memory region, when the last one completes with data procesing (for example userspace sniffer), it must commit that area back to allocator, so NTA relies on correct values returned from userspace (if returned from userspace chunk is not valid, it will not be freed, but if userspace will not "free" chunks (by sending info about them back to kernel) eventually maximum allowed number of shared free regions is achieved and no more data will be sent to userspace (and be allowed to be shared).

Since by default network tree allocator is used for all network allocations (including unix sockets and netlink), sniffer will get all those data and must somehow differentiate between them. That task is out of the scope for this mail though, simple solution is just to attach network allocator to network device (i.e. call NTA allocation functions from netdev_alloc_skb() only).

I never run any special performance tests, but simple "top" command shows much smaller CPU usage for zero-copy sniffer (although it gets all data from every skbs in the machine) compared to tcpdump - 17% vs. 33% maximum on my test machine.
Both sniffers dump received data into /dev/null.
Server side (where sniffers run) runs epoll() based trivial web server, client side runs httperf.
Machines are connected over 100mbit LAN (e100 server NIC, 8169 client NIC).

For zero-copy userspace netchannels I plan to only send to userspace information about allocations which really belong to created netchannel instead of info for each chunk.

Sending zero-copy support is in TODO.

/devel/networking/zcs :: Link / Comments (0)


Sat, 19 Aug 2006

Zero-copy sniffer. First results.


add@/class/mem/kmsg.ACTION=add.DEVPATH=/class/mem/kmsg.SUBSYSTEM=mem.SEQNUM=105.MAJOR=1.MINOR=11
................................................................................................
................................................................................................
....

add@/devices/system/timer/timer0.ACTION=add.DEVPATH=/devices/system/timer/timer0.SUBSYSTEM=timer.SEQNUM=106
...........................................................................................................
..............................................................................

oc->avl_node_list);...alloc->avl_container_array = kzalloc(sizeof(struct list_head) * AVL_CONTAINER_ARRAY_S
IZE, GFP_KERNEL);..if (!alloc->avl_container_array)...goto err_out_exit;...for (i=0; i<AVL_CONTAINER_ARRAY_
SIZE; ++i)...INIT_LIST_HEAD(&alloc->avl_container_array[i]);...entry = avl_node_entry_alloc(GFP_KERNEL, AVL
_ORDER);..if (!entry)...goto err_out_free_container;...avl_node_entry_commit(entry, cpu);...return 0;..err_
out_free_container:..kfree(alloc->avl_container_array);.err_out_exit:..return -ENOMEM;.}../*. * Initialize 
network allocator.. */.int avl_init(void).{..int err, cpu;...for_each_possible_cpu(cpu) {...err = avl_init_
cpu(cpu);...if (err)....goto err_out;..}...err = avl_init_zc();...printk(KERN_INFO "Network tree allocator 
has been initialized.\n");..return 0;..err_out:..panic("Failed to initialize network allocator.\n");...retu
rn -ENOMEM;.}..............................................................................................
.........................................................................................................

.k........ ................................................................................................
......................................................................................a.....*.c..E....,@.@.
. ...N...0.P.3Epe_;........_...........1......width: 47%;.....padding-right: 3%;.....float: left;.....paddi
ng-bottom: 2em;....}.....content-column-left hr {.....display: none;....}.....content-column-right {...../*
Values for IE/Win; will be overwritten for other browsers */.....width: 47%;.....padding-left: 3%;.....floa
t: left;.....padding-bottom: 2em;....}.....content-columns>.content-column-left, .content-columns>.content-
column-right {...../* Non-IE/Win */....}....img {.....border: 2px solid #fff;.....padding: 2px;.....margin:
2px;....}....a:hover img {.....border: 2px solid #f50;....}..../*]]>*/...</style>..</head>...<body>...<h1>F
edora Core <strong>Test Page</strong></h1>....<div class="content">....<div class="content-middle">.....<p>
This page is used to test the proper operation of the Apache HTTP server after it has been installed. If yo
u can read this page, it means that the Apache HTTP server installed at this site is working properly.</p>.
...</div>....<hr />.....<div class="content-columns">.....<div class="content-column-left">......<h2>If you
are a member of the general public:</h2>.......<p>The fact that you are seeing this page indicates that the
website you just visited is either experiencing problems, or is undergoing routine maintenance.</p>.......<
p>If you would like to let the administrators of this website know that you've seen this page instead of th
e page you expected, you should send them e-mail. In general, mail sent to the name "webmaster"............
...........................................................................................................
...........................................................................................................
..........................
Above junk was obtained from zero-copy sniffer running with epoll based web server on my test machine (I manually repleaced all "<" symbols with "&lt;" in the dump to not break HTML formatting).
First two dumps are kobject_uevent during startup ('.' means unprintable symbol, i.e. some binary data), then you can see part of my network tree allocator code being transferred over ssh (decrypted text being sent over unix socket), and at the end there are some pieces of default web page (copied from Fedora Core apache default index.html) and some unknown symbols all over the place.
Binary data at the end of each chunk is added for alignment, binary data at the beginning is header, and one in the middle corresponds to tabs, line foldings and so on.

It works, although there are issues yet to resolve - for example mapping code only maps initial cache, userspace can not see when it has grown yet, sniffer also does not know how many pages are inside each new cache cnunk. I will resolve that issues soon and send code to netdev@ for review.

/devel/networking/zcs :: Link / Comments (0)


Zero-copy sniffer.


Implementation of design with additional bitmask wastes too much space per node, so I decided to create much more simple solution - attach a tag to each allocated chunk, which contains a canary and reference counter. The former is just 4 bytes of special data which is used to check in freeing function if object being freed is valid and there were no memory corruption. Reference counter is used to mark mapped objects as used, so freeing would not destroy them. The only thing to implement is ->nopage() method for zero-copy sniffer underlaying char device, so when network allocator cache grows user could automatically be able to get new pages into mapping.

/devel/networking/zcs :: Link / Comments (0)