Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Tue, 07 Oct 2008

Valgrind support for netchannels.

Alexandre Lissy (alexandre.lissy_smartjog.com) made a patch for the latest to date Valgrind version (3.2.1).
Now one can analyze performance bottlenecks with netchannels applications using standard techniques.

/devel/networking :: Link / Comments ()


Fri, 26 Sep 2008

New failed ipw2100 interrupt and its races.

During my testing I managed to beat following interrupts out of the chip:

[41773.200686] ipw2100: Fatal interrupt. Scheduling firmware restart.
[41773.200707] eth1: Fatal error value: 0x500185B8, address: 0x08004501, inta: 0x40000000
[41773.200810] ipw2100 0000:02:04.0: PCI INT A disabled
[41773.203110] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.224446] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.245781] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.249360] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[41773.249384] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[41773.249426] ipw2100 0000:02:04.0: restoring config space at offset 0x1
	(was 0x2900002, writing 0x2900006)
This happens during PCI ipw2100 device disablement in the reset handler, so when interrupt handler sees that, it bails out. It should be generally ok, but I found a different thing: there is a race between interrupt handler (handler itself and related processing tasklet) and reset code. The latter disables interrupts before starting to turn adapter on, but interrupt handler can run right now on given cpu and can schedule the tasklet, so its disablement does not prevent parallel reading and writing of the various registers.
IRQ processing tasklet does register reading and writing under the lock with interrupts turned off, but reset tasklet does not protect initialization path against it, so I wonder, what may happen in this case. Since register reading and writing happens from absolute addresses (I meant there is no need to write address register first), this maybe not a problem, but still race exists and theoretically can harm the system. Similar unguarded accesses exist in ipw2100_wx_event_work() handler, and also there is unguarded status field setting in various places in the driver, which can harm the driver's behaviour too.

So, maybe I decided to blame firmware a little bit early, although found things may be harmless. I will try to figure this out later tomorrow.

/devel/networking/ipw2100 :: Link / Comments ()


Thu, 25 Sep 2008

ipw2100 fatal interrupt: playing with power states.

I was not able to force card not to send or receive packets with ping tests, although definitely was able to generate lots of fatal interrupt with completely different values and addresses.
Frequently card generates fatal interrupt with different values on the same address, like below:

eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
They did not follow one after another though.
Different error values likely mean, that there is no any correlation between values and addresses, so this information is useless.

I added power state changes to the reset function, so now it does something like that:
[  897.661002] ipw2100: Fatal interrupt. Scheduling firmware restart.
[  897.661021] eth1: Fatal error value: 0x30016C44, address: 0x601F7C00, inta: 0x40000000
[  897.664712] ipw2100 0000:02:04.0: PCI INT A disabled
[  897.712041] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[  897.713549] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[  897.713595] ipw2100 0000:02:04.0: restoring config space at offset 0x1
			(was 0x2900002, writing 0x2900006)
[  954.646319] ipw2100: Fatal interrupt. Scheduling firmware restart.
[  954.646338] eth1: Fatal error value: 0x5000CF10, address: 0x61A00000, inta: 0x40000000
[  954.646429] ipw2100 0000:02:04.0: PCI INT A disabled
[  954.692041] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[  954.692063] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[  954.692103] ipw2100 0000:02:04.0: restoring config space at offset 0x1
			(was 0x2900002, writing 0x2900006)
[  968.585409] ipw2100: Fatal interrupt. Scheduling firmware restart.
[  968.585429] eth1: Fatal error value: 0x5000C9D0, address: 0x57E00500, inta: 0x40000000
[  968.585517] ipw2100 0000:02:04.0: PCI INT A disabled
[  968.632037] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[  968.632059] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[  968.632099] ipw2100 0000:02:04.0: restoring config space at offset 0x1
			(was 0x2900002, writing 0x2900006)
[  972.269514] ipw2100 0000:02:04.0: PCI INT A disabled
[  972.316041] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[  972.316400] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[  972.316446] ipw2100 0000:02:04.0: restoring config space at offset 0x1
			(was 0x2900002, writing 0x2900006)
As we can see, fatal interrupts did not dissapear, and are actually as frequent as before.

Also got this lines:
[ 2032.560413] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560449] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560491] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560593] ipw2100: exit - failed to send CARD_DISABLE command
One after another, which does not provide me any clue though.

I've started several big torrent downloads/seeds as a big load, maybe card somehow differentiates different flows, so this test should be more heavy than lots of pings. First time I noticed fatal interrupt problem with this kind of load, when card not only stopped to work, but also printed some goodbay message.

So far conclusion is not very optimistic: fatal interrupts happen always, no matter what magic is enabled in the reset, which already tells that firmware is broken.
Hopefully additional reset games with power management will allow card to work, even with those interrupts. Time will tell.

/devel/networking/ipw2100 :: Link / Comments ()


Wed, 24 Sep 2008

First ipw2100 testing: fatal interrupt.

I managed to compile small enough kernel, which boots on my laptop (do not know how long it took, since fell asleep), and managed to bring fatal interrupt error just after several seconds of ping -f 192.168.1.1 -s 8192 on freshly booted machine. 192.168.1.1 is my gateway address.
Here is the result with the patch I posted to the mail lists, which was not acked, replied and commented though (well, I have to admit, that if I would send it couple of mails earlier, it could probably find its way into the tree, but I still believe that it would not result in anything, since everyone knows about this bug, it just is not fixed by some reasons). Intel developers (at least those who maintain the driver) continue to keep silence.

[  613.960164] ipw2100: exit - failed to send CARD_DISABLE command
[  624.456033] eth1: no IPv6 routers present
[  690.721534] ipw2100: Fatal interrupt. Scheduling firmware restart.
[  690.721554] eth1: Fatal error value: 0x5000C97C, address: 0x100E201C, inta: 0x40000000
[  690.721580] ------------[ cut here ]------------
[  690.721587] WARNING: at drivers/net/wireless/ipw2100.c:3188
	ipw2100_irq_tasklet+0x8fe/0x9b0 [ipw2100]()
[  690.721736] Pid: 0, comm: swapper Not tainted 2.6.27-rc7-mainline #2
[  690.721744]  [] warn_on_slowpath+0x5f/0x90
[  690.721763]  [] up+0x11/0x40
[  690.721773]  [] release_console_sem+0x190/0x1d0
[  690.721786]  [] enqueue_hrtimer+0x72/0xf0
[  690.721795]  [] printk+0x1b/0x20
[  690.721805]  [] ipw2100_irq_tasklet+0x8fe/0x9b0 [ipw2100]
[  690.721831]  [] hrtick_start_fair+0x157/0x170
[  690.721844]  [] enqueue_hrtimer+0x72/0xf0
[  690.721855]  [] snd_intel8x0_interrupt+0x1d7/0x250 [snd_intel8x0]
[  690.721875]  [] tasklet_action+0x46/0xb0
[  690.721886]  [] __do_softirq+0x75/0xf0
[  690.721897]  [] do_softirq+0x37/0x40
[  690.721906]  [] do_IRQ+0x40/0x70
[  690.721917]  [] getnstimeofday+0x37/0xe0
[  690.721927]  [] common_interrupt+0x23/0x28
[  690.721937]  [] sys_setpgid+0xd8/0x190
[  690.721955]  [] acpi_idle_enter_simple+0x15a/0x1c1 [processor]
[  690.721980]  [] cpuidle_idle_call+0x7b/0xc0
[  690.721991]  [] cpu_idle+0x46/0xe0
[  690.722000]  =======================
[  690.722006] ---[ end trace 70268f59a00d957c ]---
[  695.271318] ipw2100: Fatal interrupt. Scheduling firmware restart.
[  695.271337] eth1: Fatal error value: 0x50014148, address: 0x60207E04, inta: 0x40000000

writing this note and starting over

[ 1520.709136] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 1520.709156] eth1: Fatal error value: 0x5000C96C, address: 0x538E7E40, inta: 0x40000000
[ 1550.954315] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 1550.954334] eth1: Fatal error value: 0x5000C99C, address: 0x08418004, inta: 0x40000000
[ 1592.175473] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 1592.175492] eth1: Fatal error value: 0x50018588, address: 0x57E77A00, inta: 0x40000000
So, this fatal error value and address numbers do not tell me anything, but since they are always different on different addresses, I think firmware just loses its mind and stops responding.
The first line, where ipw2100 fails to send a command, was obtained during ifdown of the interface. I never saw it before, but do not think it is related though.

So, I need to move to the office and want to make some distributed storage changes, namely fix an issue with name collision (kernel already has a dvb card, which module is called dst.ko), and implement better minor number allocation scheme for the imported devices, since right now after node was created and distroyed, new one will not get the same number, but continuously increasing one, which looks confusing and may bring a sysfs initialization error (when system tries to register kobject with existing name).

I will continue ipw2100 experiments today's night if will not fall asleep again because of jetlag. Stay tuned!

/devel/networking/ipw2100 :: Link / Comments ()


Tue, 09 Sep 2008

Userspace network stack git tree is now open.

One can check it via web interface.

/devel/networking/unetstack :: Link / Comments ()


Sun, 07 Sep 2008

New netchannels release.

Network channel is peer-to-peer protocol agnostic communication channel between hardware and userspace. It uses unified cache to store it's channels. All protocol processing happens in process context.

This release brings us reworked (and very simple) unified storage for all kinds of protocols (netchannel can be created for any kind of the protocol), completely lockless data processing (data queueing into the netchannel and its lookup in the global storage are protected by RCU), simplifed interface.

Feature list:

  • Very high bulk performance with small packets (check userspace network stack for more details).
  • Completely lockless netchannel processing (packet queueing and netchannel lookup in the global storage are protected by RCU).
  • Unified storage for all kinds of protocols: TCP/UDP, IP/IPv6, whatever you decide to implement on top of hardware layer you use.
  • No protocol processing. This is pushed to the peer itself. For example to the userspace network stack.
  • Ability to inject packet into the network without root priveledges.
Userspace network stack is the main user of the new netchannel subsystem.

Todo list include:
  • Ability to improve receiving latencies (queue packets from hardware interupt handler and not software interrupt).
  • Automatically scale netchannel hash table on demand.

/devel/networking :: Link / Comments ()


New userspace network stack release.

Unetstack is an extremely small and fast TCP/UDP/IP stack implementation on top of packet socket or netchannels interface.

This release includes sync with the new netchannels interface, dropped routing table support, since userspace network stack is designed on behalf of netchannels and thus efectively single opened object operates with single source and destination peers, so there is no need to introduce unneded caches, since all needed information can be stored in the userspace network stack object itself.

/devel/networking/unetstack :: Link / Comments ()


Sat, 06 Sep 2008

Latencies in netchanneles and sockets in receiving path.

When NIC's interrupt fires in Linux, driver's handler does not process the packet, it either schedules NAPI handler, which will push packet higher to the stack, or submit packet to the software interrupt handler, which will do the same. This is the first queue: interrupt->fotware interrupt (or NAPI, which happens in the same context).

When NAPI polling handler (or networking software interrupt) fires, it searches for the appropriate receiving socket, adds data packet to its queue and wakes up a receiving process. This is second queue.

Netchannels currently work the same way, since its receiving processing happens in netif_receive_skb(), which already may be too late for some low-latency applications.

As was noticed by Salvatore Del Popolo, it is possible to queue packet into netchannel in netif_rx(), but that will limit netchannels to only work with non-NAPI drivers. Instead I think about creating a special helper which will be invoked from the interrupt handler and if there is no appropriate netchannel to queue data into, it will schedule NAPI or network softirq. So far this is in todo list though.

What was really done, its a complete rework of the initalization process, netchannel creation and allocation and its processing. Essentially I rewrote most of the netchannels subsystem for good. It became lockless (RCU protected, there is a hash bucket lock, which is only used when netchannel is added/removed from the bucket, searching is lockless), but allocation process is slower, since netchannel now contains array of the skb pointers, which is allocated at creation time. Size of the array is limited to maximum number of packets netchannel can hold, kind of queue size.

/devel/networking :: Link / Comments ()


Fri, 05 Sep 2008

Netchannels come to the start line.

Or finish one. Depending on the point to look from.

zbr@gavana$ make SUBDIRS=net/core/netchannel/

  WARNING: Symbol version dump /home/zbr/aWork/git/linux-2.6/linux-2.6.netchannels/Module.symvers
             is missing; modules will have no dependencies and modversions.

  CC      net/core/netchannel/netchannel.o
  CC      net/core/netchannel/storage.o
  CC      net/core/netchannel/user.o
  LD      net/core/netchannel/built-in.o
  Building modules, stage 2.
  MODPOST 0 modules
zbr@gavana$ wc -l net/core/netchannel/*.c include/linux/netchannel.h
  430 net/core/netchannel/netchannel.c
  140 net/core/netchannel/storage.c
  244 net/core/netchannel/user.c
  92 include/linux/netchannel.h
  906 total
I want to make a new netchannels release this weekend. It will not contain dynamically resizable hash table though, but if there will be no major bugs in the core, I will consider to complete it for the new release.

I also plan to convert userspace network stack to the libtcp.so or libunetstack.so library, so it could be much easier to create applications with this stack, no matter if implemented on top of netchannels or packet socket, but so far it is only in plans.

/devel/networking :: Link / Comments ()


Mon, 01 Sep 2008

Netchannels strike back.

A while ago I implamented Van Jackobson idea of netchannels - peer-to-peer connection module, which pushed all protocol processing as close to the end peers as possible.
In my first realization, TCP processing was done on behalf of running process (instead of mostly bottom-half context), which resulted in a slightly better performance. Then I implemented userspace network stack as a continuation of this idea. Despite its huge performance improvement, I do not think particul reason is netchannels architecture, but instead amount of syscalls to be made to process bulk traffic flow via small packets. Nevertheless it can also be considered as a netchannels architecture improvement, which resulted in so exceptionally good batching abilities.

Now I want to move further: kernel netchannels side will be made completely lockless and simultaneously very cache-friendly. As in the first implementation, idea is not completely mine, approach I will test is based on Van Jackobson's array design to store network buffers.

During its lifetime, netchannels got NAT support (actually just to show to those people, who do not belive in netchannels architecture, that it is possible to implement filtering and packet mangling), but now I drop it from the project. Netchannels also got tricky multidimentsional trie-based storage, which, after being ported to the socket core, resulted in a noticeable perforamance win, although I did not complete it to support statistics. Actually netchannels implementation of this trie is broken, and it required quite a few steps in socket code to be fixed.
Now I drop it from netchannels patchset too and move to the usual hash tables.
I will make RCU locking for them and make netchannels hash table optionally automatically resizeable. This feature does not exist in socket hash tables, but right now I want to experiment smaller code base, since algorithm I have in mind is a bit tricky.

So, there are lots of interesting ideas, which I've started to work on and plan to finish sooner than later. But since I will move to the USA counsil department for the interview, and then want to finish appartment development tasks, and then, hopefully, move to the Kernel Summit and Plumbers conference, it can take quite long... Please note that I do not forget about other projects.
Code is not dead if not marked appropriately in the TODO list :)
Stay tuned nevertheless!

/devel/networking :: Link / Comments ()


Sun, 10 Aug 2008

A bored russian.

Yes, that is how I was called by The Inquirer. Magazine even put it in bold capital letters :) The rest of the article is quite wrong though (i.e. it is not what was written in my blog).

Slashdot either got an entry, I was called hacker and then a physicist there.

What next? It is really very fun! :)

/devel/networking/dns :: Link / Comments ()


Sat, 09 Aug 2008

Russian physicist.

That is how I was called in New York Times with all this hype about DNS poisoning attack.

Unfortunately I already do not remember what electron charge is and how to describe Higgs boson even to myself. Things moved away almost 10 years ago :)

Article says, that DJBDNS does not suffer from this attack. It does. Everyone does. With some tweaks it can take longer than BIND, but overall problem is there.

But that's enough for this story. I'm moving to another interesting developments.

/devel/networking/dns :: Link / Comments ()


Fri, 08 Aug 2008

Successfully poisoned the latest BIND with fully randomized ports!

Exploit required to send more than 130 thousand of requests for the fake records like 131737-4795-15081.blah.com to be able to match port and ID and insert poisoned entry for the poisoned_dns.blah.com.

# dig @localhost www.blah.com +norecurse

; <<>> DiG 9.5.0-P2 <<>> @localhost www.blah.com +norecurse
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6950
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;www.blah.com.                  IN      A

;; AUTHORITY SECTION:
www.blah.com.           73557   IN      NS      poisoned_dns.blah.com.

;; ADDITIONAL SECTION:
poisoned_dns.blah.com.  73557   IN      A       1.2.3.4

# named -v
BIND 9.5.0-P2
BIND used fully randomized source port range, i.e. around 64000 ports. Two attacking servers, connected to the attacked one via GigE link, were used, each one attacked 1-2 ports with full ID range. Usually attacking server is able to send about 40-50 thousands fake replies before remote server returns the correct one, so if port was matched probability of the successful poisoning is more than 60%.

Attack took about half of the day, i.e. a bit less than 10 hours.
So, if you have a GigE lan, any trojaned machine can poison your DNS during one night...

/devel/networking/dns :: Link / Comments ()


Wed, 06 Aug 2008

Additional note on DNS poisoning attack IN A entry injection.

Actually I did inject 'IN A' entry for the poisoned_dns.blah.com into the cache.

So, to inject arbitrary 'A' entry for the attacked.domain.com into the cache, one has to bruteforce ID (and match source port if needed) for any other subdomain of the same level, i.e. subdomain-123.domain.com, and put into additional section for that message a 'IN NS' record, which would point to attacked.domain.com, and 'IN A' record with fake IP address for that 'IN NS' one, i.e. 'IN A' record for the attacked.domain.com pointing to 1.2.3.4.

This method is a bit less flexible, than just poisoning any subdomain with NS record, which points to the controlled DNS server, but it does not require that server to exist, so it can route traffic directly to your site without first asking your DNS server, where given subdomain lives.

# ping poisoned_dns.blah.com -c100 > /dev/null 2>&1 &
# tcpdump -nn icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:27:20.422124 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 5, length 64
11:27:20.422333 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:21.422126 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 6, length 64
11:27:21.422310 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:22.422123 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 7, length 64
11:27:22.422286 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:23.423122 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 8, length 64
11:27:23.423311 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36

/devel/networking/dns :: Link / Comments ()


More interesting (and complete) hack of the DNS.

I managed to inject following poisoning information:

# dig @localhost +norecurse www.blah.com any

;; ANSWER SECTION:
www.blah.com.		123452	IN	NS	poisoned_dns.blah.com.

;; AUTHORITY SECTION:
www.blah.com.		123452	IN	NS	poisoned_dns.blah.com.

;; ADDITIONAL SECTION:
poisoned_dns.blah.com.	123452	IN	A	1.2.3.4

# dig @localhost www.blah.com
The last command results in the following dump:
01:36:14.567622 IP devfs1.5301 > 1.2.3.4.53: 42416% [1au] A? www.blah.com. (41)
01:36:15.067816 IP devfs1.5301 > 1.2.3.4.53: 29011% [1au] A? www.blah.com. (41)
01:36:15.568013 IP devfs1.5301 > 1.2.3.4.53: 30586 A? www.blah.com. (30)
01:36:16.568182 IP devfs1.5301 > 1.2.3.4.53: 38101 A? www.blah.com. (30)
01:36:18.568429 IP devfs1.5301 > 1.2.3.4.53: 64596 A? www.blah.com. (30)
01:36:22.568634 IP devfs1.5301 > 1.2.3.4.53: 59943 A? www.blah.com. (30)
01:36:30.568960 IP devfs1.5301 > 1.2.3.4.53: 39614 A? www.blah.com. (30)
01:36:40.569163 IP devfs1.5301 > 1.2.3.4.53: 13769 A? www.blah.com. (30)
So, effectively if I would control 1.2.3.4 machine I would be able to answer to that queries with controlled address. I was not able to inject 'A' record for any domain except one which was happend to match id in my fake responses, and it looks like 'A' records are not accepted at all (I'm far from being a DNS expert).

So, actually I consider this exploit as a completed one, which is capable of arbitrary NS record poisoning. Its performance is rather good: poisoning attack requires 1-3 (sometimes more, it heavily depends on link capacity and auth dns server performance) queries from the client to authoritative DNS server. Attacking server, connected via gigabit link, is easily capable to saturate whole DNS ID space while attacked resolver waits for reply from the remote server. Math tells me that 100 mbit connection will require about two times more requests to be sent by the client, which is still not that much.

Server side of the exploit requires root priveledges to run, since it uses raw socket to create a datagram with IP addresses used by attacked server and appropriate authoritative name server. Client connects to one or more attacking servers, sends them appropriate response message and issues a DNS request for that response to the attacked server. Poisoning servers start to flood attacked server with replies, until client sends them next reply to bomb. When client receives fake answer from poisoned DNS server, attack stops. Exploit allows you to specify name server to attack, NS query to inject and DNS name to have that NS record.

Having hard GigE performance numbers, I can say, that port randomization completely does not solve DNS poisoning attack (although makes it harder), since with such link capacity attacker only needs to guess the port, and ID space will be bruteforced before reply is received from the authoritative name server.

So far I can not test randomized-port BIND, since local Debian mirror has somehow unsigned package for it, so I will not install it right now, but will do it later and provide numbers with randomized server. I expect to be able to poison even that server, although not that fast as with constant port.

Have fun!

/devel/networking/dns :: Link / Comments ()


Tue, 05 Aug 2008

DNS cache poisoning attack succeeded for the constant port.

Hacking rox!

# dig @devfs1 3-c13a-15729.paypal.com.

; <<>> DiG 9.5.0-P2 <<>> @devfs1 3-c13a-15729.paypal.com.
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18330
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;3-c13a-15729.paypal.com.	IN	A

;; ANSWER SECTION:
3-c13a-15729.paypal.com. 123405	IN	A	1.2.3.4

# dig 1-71b2-16080.money.paypal.com.
...
;; ANSWER SECTION:
1-71b2-16080.money.paypal.com. 123421 IN A	1.2.3.4

# dig @localhost 29-07f3-16098.test.com
...
;; ANSWER SECTION:
29-07f3-16098.test.com.	123411	IN	A	1.2.3.4
Although it is not a complete win yet: additional section from the poisoning packet was parsed, and entry looks like inserted into DNS server database, but subsequent request ends up with querying remote server. Probably because my fake requests do not contain authority section, so I will extend it and continue this game :)

Ugh, 4 A.M. My body, soul and what else wants to sleep will all hate me tomorrow.

/devel/networking/dns :: Link / Comments ()


Mon, 04 Aug 2008

Got back testing machines.

I was called a saboteur, although no one was able to answer, what will happen, if the same load will be performed by some virus or trojan.
Nevertheless I played some politic game, had some talks, which I managed to cool down from angry to fun strain, and eventually got access again.

I installed BIND on one of the servers, which by the coincidence does not have port randomization fix, so it issues all requests from the 5301 port. I fixed IP header initialization, so now attacking servers send its fake DNS replies not with own IP address as a source (that's likely was one of the main if not main reasons machines were disabled), but using appropriate auth DNS server IP address.

Also found an interesting moment with DNS server traffic: resolver server's network channel is so much loaded with small UDP fake DNS replies, that other ones almost can not sneak in, so effectively real reply comes almost after the whole ID range has been bruteforced. I remind that this is a GigE linked machines, and attacking servers send about 200-300 thousands packets per second average, dropping rate is about 30% (only about 45 thousands packets are received from more than 65.000 being sent).

This basically means, that in this particular case probability of the successful poisoning with port randomization is only limited by random port number, and random ID almost does not play any role (since traffic generated by the attacking server will eat the bandwidth and will not allow real reply to come first), so one should just guess the port number and attack will succeed.
I will try to prove this theory tomorrow as long as confirm that my exploit works.

/devel/networking/dns :: Link / Comments ()


Sat, 02 Aug 2008

DNS cache poisoning attack results.

Disabled account and turned off access to the servers.

And it is just because of several minutes of 200+ kpps UDP DNS response storms from three machines to one of the corporate DNS servers (I think there are hundreds of them, I just got access to couple). Who the hell monitors it Saturday night at 2 A.M.? I specially selected time when normal people sleep, drink or have a sex, but do not work and watch DNS server load.

The only problem actually is that those servers were also used for POHMELFS development and testing. Although I still able to work with two Xen domains (where I actually develop and test initial implementations without various stressing loads for all my current projects), so development will not stop.

I will pretend to be an idiot and to have viruses there. Linux kernel viruses.
And of course I will promise I will install all updates and will be careful next time.
Next time I will not attack known nameserver, but install my own.
It is all about the science and not to harm (I even poisoned non-existent domain).

Or they will get away my toys and kick my ass, but I will resist, so there will be no interesting notes about DNS cache poisoning attack (although not, I will be able to run one on my desktop via loopback, it is quite fast machine) and nice benchmark graphs :)

/devel/networking/dns :: Link / Comments ()


Fri, 01 Aug 2008

DNS cache poisoning attack exploit completed.

I belive I've completed quite distributed client/server network exploit, which is capable to poison given DNS cache either if it works with single source port or randomize it over some port range.
I already described client-server architecture, so only short notes here.
Client broadcasts set of ports and fake queries to number of poisoning servers, and then asks attacked name server a specially crafted query, which does not exist in the attacked domain. Poisoning servers send lots of replies to the attacked DNS server with fake IP addresses and ports, which pretend to be address/port from the authoritative DNS server. Each reply contains answer section for the current client query and additional section, which contains information about attacked domain: the former is a subdomain of the latter, like querying 'IN A' record for '123-456.www.blahblah.com' while reply contains 'IN A' data for '123-456.www.blahblah.com' in answer sectino and 'IN A' data for 'www.blahblah.com' in additional section.
Client then checks reply (or falls on timeout), and if it does not contain given record for the query, sends next packet to poisoning servers and appropriate request to the attacked cached domain server.

So far I did not succeed in this attack, but managed to load network (and actually the main name server) so much, that really lots of people around started to complain, that they have troubles... This is also a result actually, but not that one which I expected, so I will postpone attack to the late night today.

Tcpdumps show that broadcasted data is valid, but there were no actual poisoning, so probably I will install own server and configure it to use single port. Currently attacked server has not very random port distributinon, but still not constant. My poisoning servers (two servers connected via gige link to the same network as attacked server) use 100% CPU each one, since they need to caclulate UDP checksum for each packet (since it has different ID and/or port number) and use raw socket to transmit data (to specify source and destination addresses of the autoritative and attacked server). Each server is usually capable of transmit about 30k-130k packets per second, which corresponds to 1-20 ports (and whole 64k ID range per port) during 5 seconds timeout interval before the next request. This is not enough of course for the 100% guarantee, but I think after quite long time attack may suceed, so I will put it in action for the next weekend or at least a night.

Bert Hubert made some math on this kind of attack, result is not very promising for the attacker, but still probability is far from zero.

I do not promise success, but would like to know, if I'm on the right side, so attack has been started...

P.S. DNS has own tag in the blog now.
P.P.S. Distributed cache poisoning exploit (it may be completely incorrect!) source code can be found in archive. Sorry, no usage details, but you can use '-h' command line parameter :)

/devel/networking/dns :: Link / Comments ()


Thu, 31 Jul 2008

DNS cache poisoning client/server architecture.

SO far I only implemented simple flooder of the requests, which as number of destination ports as a parameter and two names and addresses to put into answer and additional section of the DNS reply. It uses UDP socket, so source address does not belong to server, which should pretend to answer given query, so actually this application will not work, and I need to implement sending via packet socket and substitue source IP address with DNS authoritative server's one.
Poison flooder also should not use only one name/address in answer section, but insteda it should iterate with client, so appropriate request and answer were synchronized.

So far, initial design of the client/server architecture of this small project looks like this: depending on flags, either client connects to multiple flood servers or vice versa, then client sends a message to each server where specifies a port and ID ranges to attack, attacked DNS server IP, requested query name and source address, pretending to be an authoritative name server and additional resource record data to put into replies (which will poison the cache).
Each server starts sending that data to the specified name server with changed source address to the authoritative name server's one and with ID and port changed in given range. When client finished broadcasting request data to all flood servers, it sends a request to the attacked DNS server with given query name to resolve. Now flood servers race with authoritative one to provide an answer. When client receives the answer, it checks if it looks like poisoned data we wants to get, or real answer (which should be NX domain, since we resolve non-existing names). In the former case we exit the process and enjoy the result, otherwise client specifies next name to resolve and the same starts again.

Looks interesting...

/devel/networking/dns :: Link / Comments ()


Wed, 30 Jul 2008

Simple DNS server/resolver.

Exact time to hack a DNS server is a middle of the night: 3 A.M. here and I've just completed initial draft of the trivial DNS server, which is only capable to receive a datagram from predefined port, parse it, fill a reply for static "IN A" record (I think I will add a config file), this record is placed into 'answer' and 'additional' resource record sections, then the whole request is being sent back to the client.

That's how it looks for standard UNIX dig command:

$ dig @localhost -p 1025 www.google.com
;; Warning: query response not set

; <<>> DiG 9.4.2-P1 <<>> @localhost -p 1025 www.google.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51486
;; flags: rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;www.google.com.			IN	A

;; ANSWER SECTION:
www.google.com.		123456	IN	A	195.178.208.66

;; ADDITIONAL SECTION:
www.google.com.		123456	IN	A	195.178.208.66

;; Query time: 15 msec
;; SERVER: 127.0.0.1#1025(127.0.0.1)
;; WHEN: Wed Jul 30 02:56:23 2008
;; MSG SIZE  rcvd: 64
There are several warnings, which I will fix later, but main part is section content: www.google.com obviously does not have an IP address of my blog site. TTL usually also does not equal to 123456.
Game continues, while I need some sleep...

/devel/networking/dns :: Link / Comments ()


Tue, 29 Jul 2008

Some DNS port distribution data.

Gathered today's late night, so that DNS server would not be too much disturbed by other users.
Graphs below show some BIND (do not know version) source port cloud and distribution for a thousand runs. Each request issued non-existent subdomain of controlled domain server, so I was able to capture dums and analyze them a bit.

DNS source ports cloud DNS source ports distribution

This graphs show source ports cloud and its distribution. Each histogram corresponds to number of hits into 100 ports range, start of the range is shown at X axis labels.
First, port range is randomly selected in 50k-65k range, so one needs to guess much smaller amount of port.
Second, even in 1 thousand requests there are lots of requests with the same port (stats show that there 149 ports, which were used 2 and more times in above 1000 runs, there is even single port which was used 4 times). If we select range of 100 ports, then appropriate distribution is shown on the graph.
Such behaviour allows to limit source port range even more.

Now, DNS IDs.

DNS ID cloud DNS ID distribution

The whole range of IDs is used, and theirs distribution (each histogram corresponds to number of IDs in the appropriate 100 ids range) is more uniform. There were only 9 IDs used twice per 1000 runs.

But since I do not know exact load of the analyzed DNS server (and it can be high even at 3 A.M.), I can not say if that numbers are due to port/id selection algorithm implementation of just because load was high and there were actually not only my 1000 requests.

To further play with DNS caches I decided to install local DNS server first test things with it.

/devel/networking/dns :: Link / Comments ()


Sun, 27 Jul 2008

Lots of talks about DNS cache poisoning attack.

There are two types of this attack: DNS query ID guessing and request source port guessing for servers which use randomized source port, which should be turned on after Dan Kaminsky's alert.

DNS ID is 16 bits only, so it could be guessed rather fat, one just need to force someone who uses attacked DNS cache to issue appropriate requests. When request is received by DNS resolver, it is stored there for predefined amount of time (TTL parameter provided by higher-level DNS resolver or eventually authoritative name server). Dan found, that attacker can actually ask not for attacked domain, but some subdomain of it (if attacker tries to point www.microsoft.com to own IP, it can force sending DNS requests for 1.microsoft.com, 2.microsoft.com and so on), and put data about actual target into additional resource records attached to all datagrams. So, when it eventually win the race, it can store (among lots of subdomains) needed pointers in the attacked DNS cache.

I've just thought that this attack will not be possible, if all queries from DNS resolvers to higher-level resolvers and/or authoritative name servers would happen over TCP instead of more common UDP. There is no need to issue requests from random ports anymore, no need to parse and drop additional resource records. There will be no problems with truncation of large messages... But to play a bit with the whole idea I'm implementing a simple DNS query/response processor. Maybe will play a bit with local cache (ISP at office uses only 6 different ports to send requests) poisoning, although its main goal is IP-over-DNS tunnel.

This is kind of a real rest after VISA/hotel paperwork. I was told, that if I will be called to embassy for the interview, chances are high VISA will be declined because of my sence of humor :)

Update:

zbr@gavana:~/aWork/tmp/dns$ ./query -a 195.178.208.66 -i 0x1234 -q tservice.net.ru
query: 'tservice.net.ru', class: 1, type: 1, server: 195.178.208.66:53, protocol: 17, id: 1234.
Connected to 195.178.208.66:53.
id: 1234: flags: resp: 0, opcode: 0, auth: 0, trunc: 0, RD: 1, RA: 0, rcode: 8.
        : question: 1, answer: 1, auth: 2, addon: 2.
	: question: name: 'tservice.net.ru.', type: 1, class: 1.
	: name: 'tservice.net.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 195.178.208.66
	: name: 'tservice.net.ru.', type: 2, class: 1, ttl: 86400, rdlen: 14, rdata: ns.tservice.ru.
	: name: 'tservice.net.ru.', type: 2, class: 1, ttl: 86400, rdlen: 7, rdata: dns2.tservice.ru.
	: name: 'ns.tservice.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 195.178.208.66
	: name: 'dns2.tservice.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 62.141.76.164
And DNS protocol gets the first price among the ugliest crappies.
Now its time to create a DNS server itself, which will get requests (above dump shows BIND session), parse them and perform appropriate actions, like sending reply with specially crafted additional resource records, either NULL one for example (can contain upto 64k of data) or TXT (length byte followed by character string, there may be multiple strings as long as total length (including length bytes itsef) is less than 64k). Or additional A resource record, which may contain information about domain to poison...

/devel/networking/dns :: Link / Comments ()


Tue, 01 Jul 2008

Why is blocking sending considered harmful?

I frequently hear that whatever server you implement, it has to be non-blocking, since in case of parallel sending it allows to send multiple requests to fast servers, while not-sending data to slow server, since non-blocking socket will return EAGAIN.

This is only half-right solution: when we have to put given data to all servers, and can not free it until all servers replied with acknowledge, non-blocking mode can bring more damage than gain.

Mainly because it allows to eat all the memory for requests, which are still in the queue to be sent to slow server, and which was already sent to fast ones. In this case higher-level application (consider simple application which generates some data and writes it into the file in distributed filesystem, which writes file to several servers) will never block since transfer to fast servers completes quickly, and will provide more and more data, which will consume all RAM.

It is possible to deadlock system in this case, since to send some data to remote server we always have to allocate at least some data to put network headers into. With non-blocking solution we will consume all memory and kick itself into the coma.

/devel/networking :: Link / Comments ()


Passive OS fingerprinting.

I've updated OSF modules to xtables, so you have to enable its support in kernel config and get recent iptables (I tested with 1.4.1.1, which is the latest release to date).

OSF allows you to match incoming packets by different sets of SYN-packet and determine, which remote system is on the remote end, so you can make decisions based on OS type and even version at some degreee.

Installation instruction, example and source code can be found on homepage.

I've also sent it to netfilter-devel@ and netdev@ maillists, since my previous mails never appeared there likely because of spam filters.

/devel/networking :: Link / Comments ()


Sat, 14 Jun 2008

Passive OS fingerprinting.

Ever dreamt to block all Linux users in your network from accessing internet and allow full bandwidth to Windows worm? We have to care about our smaller brothers, so this iptables extension module allows you to do so. OSF stands for OS Fingerprint allows you to build usual iptables decision on incoming TCP packets, only initial handhsake containing SYN bit is enough to understand what remote OS is. Original idea belongs to Michal Zalewski.
This iptables module was imlemented almost 5 years ago and lived in patch-o-matic (userspace library is still there) iptables tree. Now I've updated it to Xtables and send for review.

Installation steps are described on the homepage, but are trivial and include usual make/make lib building and loading rules into the module via procfs file.

# insmod ./ipt_osf.ko
# ./load ./pf.os /proc/sys/net/ipv4/osf
# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector
You find something like this in syslog:
ipt_osf: Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139

/devel/networking :: Link / Comments ()


New userspace network stack release.

Fixed bug found by Salvatore Del Popolo (delpopolo_dit.unitn.it) in TCP implementation, when system checked sending window and determined, that packet was not allowed to be sent and nevertheless tried to do so in some cases.

Userspace network stack is a very fast (if working on top of netchannels, also supported packet socket) and very small network stack (TCP/UDP/IP/ethernet) implemeneted entirely in userspace. Because of it lives near the very the end of the peer (i.e. very close or even embedded into application), it allows much faster processing of some workloads, namely small packet sending and receiving, where it outperforms vanilla Linux TCP/IP stack 3 times in performance and 4 times CPU usage (sending and receiving vary).

ATCP gigabit test

Comapre netchannels+unetstack versus Linux sockets (2006 year numbers).

It is not about problems in the Linux stack, but overhead of syscalls, which are in turn results of too separate data sending and reply processing in the existing model.

/devel/networking/unetstack :: Link / Comments ()


CARP: Common Address Redundancy Protocol for Linux kernel.

I've finally made a new release of the CARP for Linux kernel.

CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard. The latest protocol to help provide high availability and network redundancy, it was developed because router giant Cisco Systems believes that its Hot Standby Router Protocol (HSRP) patent covers some of the same technical areas as VRRP.

This project allows you to build high-available clusters of multiple machines with balanced master selection between them. Installation and setup are pretty trivial:

$ tar -zxf carp_latest.tar.gz
$ cd carp
$ make

# insmod ip_carp.ko
# modprobe cn
# insmod carp_conn.ko
# ifconfig carp0 up
# carp_conn_daemon -m master.sh -b backup.sh
And the same on all other machines.
Each script as you got from its name is executed when node becomes master or backup one, you can put there firewall rule changes, traffic shaping setup, network daemon start/stop scripts and whatever you like.

Its main advantage over any other existing open (well, it behaves much more robust than Cisco VRRP though) master/backup solutions (like Hearbeat or userspace CARP) is ability to setup multicast address (via usual /sbin/ifconfig command) and thus do not confuse some crappyCisco hardware, which will not understand that node changed.

One can get the latest sources from CARP homepage.
Enjoy!

/devel/networking :: Link / Comments ()


Tue, 01 Apr 2008

Fix for the fundamental network/block layer race in sendfile().

Summary of the previous series with this pompous header: when sendfile() returns, pages which it sent can still be queued in tcp stack or hardware, so subsequent write into them will endup in corrupting data which will be eventually sent. This concerns all ->sendpage() users namely sendfile() and splice().

We can only safely reuse that pages only when ack is received from the remote side, which will force network stack to release pages. My simple extension allows to hook into data releasing path and perform any actions we want. This is achieved by replacing skb->destructor with own callback registerd by interested user, for example splice/sendfile code. Splice (pipe info structure) in turn is extended to hold atomic counter of the pages in flight (without structure size change because of alignment issues it has right now), so splice code will sleep when full pipe info (->nrbufs pages) have been sent, it will wait until number of pages in flight hits zero, which is decremented in private splice callback.

Patch was tested with simple send and recv applications, which can be found in archive.

One has to run them on different machines, since loopback uses a bit different scheme (namely page is _never_ copied, so when it is received by 'remote' side, it still exists on the 'local' side, so modifications will endup in data corruption).

devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024
In case of failure you will get this:
Connected to devfs1:1025.
/tmp/test/1024 -> devfs1:1025
Data was corrupted: ab.
after short period of time, where above 'ab' is a hex byte writen into mapped file, which has been sent, immediately after senfile() returns to userspace. Data is supposed to be always zero, and applications should run forever.
-c parameter specifies number of bytes to be sent in each run of the sendfile(). It has to be the same on both machines.

This idea was first thought as soft barriers in distributed storage.

/devel/networking :: Link / Comments ()


Fri, 29 Feb 2008

Debugging undebuggable.

If something looks undebuggable from the first view, than take a secon one. Better from different angle. Some problems require third look.

Bits of history of the problem. Pohmelfs has extremely large latencies when syncing local inode to the remote server. This involves sending a command to the server to create an object with given name and receive back a response with its real inode information (like inode number and other fields cached for faster stat() and similar workloads). Pohmelfs then changes local inode info to match the real data.
Syncing of small tree of 500 files takes about 40 (!) seconds. Well, in Xen environment where I develop this things local creation of 500 files in single ext3 directory takes more than 15 seconds, but another 25 is a pure overhead.
That was short description of previous series.

Next, problems of fixing the problems.
First, Xen version used at that testing machine is old enough, so oprofile does not work. Second, I do not know VFS internals enough (this is my first filesystem, interested reader can find how I managed to step likely on every possible rakes on that field, some of them were even small kid rakes...) to determine where there is a possibility to catch that long delays, but since linux filesystem is actually a not that complex system, but set of callbacks, implementation is not really outstanding, but knowing in which condition each callback can be invoked and which problems can be here or there is kind of a magic... Third, remote userspace pohmelfs server was not actually written by me, instead its bytecode was blown out because of some substances inspiration, so it can be very much a reason for all the problems, given that it is trivial as pretty much all my userspace code, even total rewrite will not fix the issue.

So, latency problem in pohmelfs looked really undebuggable. But you know, cup of excellent tea (from tea-packet) with lemon can fix any problem (or high themperature and substances, or fair amount of alcohol, everyone has fun the way he likes), so it was first decided to implement a simple network kernel module which would connect to remote userspace server and exchange messages in a similar fasion like pohmelfs does.
Such module was implemented, started and showed excellent performance (about 1 thousand of messages per second send and received back in test network, which is several orders of magnitude faster than pohmelfs). So, move back to VFS and pray for inspiration.

Inspiration was met today (thanks Arnaldo, likely it is because I'm getting healthier :).
I always thought that number of subsequent calls for recv() is not a good idea no matter where: in kernel or userspace, since it takes a socket lock, which in turn can introduce latencies found, so I eliminated subsequent recvs in pohmelfs code (testing module was written better and does sending and receiving without such 'fragments'), which resulted in... nothing, results did not changed at all. So, wrong step, but having subsequent sending calls in a row is not a good idea too, so I replaced them with allocation and copy, so that there would be only single kernel_sendmsg() call. As you might expect performance... changed by 30 times. Just by having single send call instead of two for as much as 500 invokations forced the whole network exchange to behave completely different.
So, to debug problem further I extended testing module and introduced ability to send and receive data not by single packet but via two fragments: 4 bytes and rest of the packet (60 bytes). Here is a result table for 1000 of messages sent and received back by testing module:

no fragments:				1.43 seconds
send fragments (4 and 60 bytes):	40.43 seconds
recv fragments (4 and 60 bytes):	1.43 seconds
both fragmentations:			40.43 seconds
It is 30 times difference just for simple application change!
tcpdump on receiving side shows that subsequent fragments sending results in a real message sending all the time kernel_sendmsg() is invoked, which results on ack for each such message (both 4 and 60 bytes), which completely degrades tcp window and connection just can not recover with such behaviour.

So, all that words were written just to show that even undebuggable from the first view problems can be easily solved, and that harmless (from the first view again) programming mistakes can result in very interesting results...

Now back to drawing board to think how to improve pohmelfs protocol even more to get the last bits out of the wire.

Btw, interested reader can get my network testing module and userspace from theirs just created homepage.

/devel/networking :: Link / Comments ()


Fri, 14 Dec 2007

New release of the userspace network stack.

Changed data reading function, now it does not copy TCP header into user's buffer, only data, and forced packet socket reading path to limit maximum number of packets to be read, which do not match created netchannel.
As usual, new release is available from project homepage.

/devel/networking/unetstack :: Link / Comments ()


Tue, 04 Dec 2007

The 22'th century netchannels release.

This is the 22'th release of the netchannels, a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing.

Short changelog:

  • update cached route in the netchannel when it expires
Thanks to Salvatore Del Popolo (delpopolo_dit.unitn.it) for testing.

You can get the latest sources from netchannels homepage.

Userspace network stack is available from own homepage.

/devel/networking :: Link / Comments ()


Thu, 29 Nov 2007

The 21'th netchannels release.

Netchanel is a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing.

One of its users is userspace network stack.

Short changelog:

  • fixed queue length usage
  • fixed dst release path. Both problems reported by Salvatore Del Popolo (delpopolo_dit.unitn.it)
  • removed nat user
More details can be found on project homepage.

/devel/networking :: Link / Comments ()


Wed, 07 Nov 2007

iWARP port sharing problem.

I read Ronald Dreier's post about iWARP port sharing problem and want to shed some light on it.
Besides the fact, that Ronald greatly described basics of the technology, he skipped, that problem was discussed and solution was found with introduction of iWARP specific aliases which should be assigned by administrator, so that network stack got a new ifindex and application bound to different device would not get the same port as iWARP ones.
Ronald also skipped that part, where it was suggested some improvements, which were not implemted (error propagation and fallback, automazation of the process (like alias creation) and other bits), most of the time essentially the same answer was received, that it is not needed... Maybe it is, but why this talk was missed in Ronald's presentation of the evil empire of the network developers?
So, I think, RDMA people do not need a discussion, you want that your own ideas got merged just because of the fact, that you believe it is cool, and no matter how things are in real life and what others say you about it.

I know that, because it was me, who performed first review of the alias patches for iWARP.

/devel/networking :: Link / Comments ()


Tue, 06 Nov 2007

New release of the userspace network stack.

It is based on patches by Holger Schurig (holgerschurig_gmx.de).
Short changelog for this unetstack release:

  • added netchannel.h, which allows to compile userspace network stack without netchannels support in the kernel
  • killed warnings about unused wariables

/devel/networking/unetstack :: Link / Comments ()


Saving the universe from the thermal death

or decreasing world entropy. I.e. fixing bugs in the kernel.

My small contribution - fixed sch_teql bug.

/devel/networking :: Link / Comments ()


Thu, 01 Nov 2007

Network hash tables for socket lookups.

Topic of moving hash tables to RCU rises regulary in netdev@ mail list, but so far there is no solution for hash resizing problem because of RCU nature. Likely it can not be fixed at all without some additional (maybe optional) synchronization.
It was pointed that Robert Olsson's hashed trie can be a good solution.

Interested reader can also check my multidimensional trie algorithm, which I implemented for network sockets lookup and originally got from netchannels. It was announced at netdev@ bug I got quite passive response, so froze the project for a while (it can be resurrected though)...
At the links above you can find performance testing comared to hash tables in kernel with different sizes. Testing was performed by running simple web server and huge number of clients, which frequently connect/disconnect from server.

/devel/networking :: Link / Comments ()


Thu, 18 Oct 2007

New release of the userspace network stack.

Short changelog:

  • really fixed leak in raw netchannel reading path
  • changed timestamp setup
  • added retransmit checking timer
  • added sanity checks for addresses and ports processed in the stack - in case of packet socket they can be incorect some times (when working over loopback for example)
  • retransmit logic checks - still requires bits of work, it is not 100% correct
This rlease contains number of really useful fixes, but retransmit logic is not yet correct. Since unetstack uses very aggressive (non-rfc-compliant) congestion control algorithm, this can lead (and I see this in practice) to complete dataflow suspending.
I will investigate this problem further later.

/devel/networking :: Link / Comments ()


Reading userspace network stack code.

	if (!th->ack) {
		ulog("%s: Strange packet.\n", __func__);
		goto out;
	}
Very interesting, what did I mean?

/devel/networking :: Link / Comments ()


Tue, 16 Oct 2007

Userspace network stack.

I've released new version of the userspace network stack, which contains a memory leak fix by Salvatore Del Popolo (delpopolo_dit.unitn.it).
Enjoy!

/devel/networking :: Link / Comments ()


Next 40 entries