|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Tue, 07 Oct 2008
Valgrind support for netchannels.
Alexandre Lissy (alexandre.lissy_smartjog.com) made a
patch
for the latest to date Valgrind version (3.2.1).
Now one can analyze performance bottlenecks with
netchannels applications
using standard techniques.
/devel/networking :: Link / Comments ()
Fri, 26 Sep 2008
New failed ipw2100 interrupt and its races.
During my testing I managed to beat following interrupts out of the chip:
[41773.200686] ipw2100: Fatal interrupt. Scheduling firmware restart.
[41773.200707] eth1: Fatal error value: 0x500185B8, address: 0x08004501, inta: 0x40000000
[41773.200810] ipw2100 0000:02:04.0: PCI INT A disabled
[41773.203110] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.224446] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.245781] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.249360] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[41773.249384] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[41773.249426] ipw2100 0000:02:04.0: restoring config space at offset 0x1
(was 0x2900002, writing 0x2900006)
This happens during PCI ipw2100 device disablement in the reset handler,
so when interrupt handler sees that, it bails out. It should be generally ok,
but I found a different thing: there is a race between interrupt handler (handler
itself and related processing tasklet) and
reset code. The latter disables interrupts before starting to turn adapter on,
but interrupt handler can run right now on given cpu and can schedule
the tasklet, so its disablement does not prevent parallel reading and writing of the
various registers.
IRQ processing tasklet does register reading and writing under the lock with interrupts
turned off, but reset tasklet does not protect initialization path against it, so I wonder,
what may happen in this case. Since register reading and writing happens from absolute
addresses (I meant there is no need to write address register first), this maybe not a problem,
but still race exists and theoretically can harm the system. Similar unguarded accesses exist
in ipw2100_wx_event_work() handler, and also there is unguarded status field setting
in various places in the driver, which can harm the driver's behaviour too.
So, maybe I decided to blame firmware a little bit early, although found things may
be harmless. I will try to figure this out later tomorrow.
/devel/networking/ipw2100 :: Link / Comments ()
Thu, 25 Sep 2008
ipw2100 fatal interrupt: playing with power states.
I was not able to force card not to send or receive packets
with ping tests, although definitely was able to generate lots
of fatal interrupt with completely different values and addresses.
Frequently card generates fatal interrupt with different values on the same
address, like below: eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta: 0x40000000
They did not follow one after another though.
Different error values likely mean, that there is no any correlation between
values and addresses, so this information is useless.
I added power state changes to the reset function, so now it does something like that:
[ 897.661002] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 897.661021] eth1: Fatal error value: 0x30016C44, address: 0x601F7C00, inta: 0x40000000
[ 897.664712] ipw2100 0000:02:04.0: PCI INT A disabled
[ 897.712041] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[ 897.713549] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[ 897.713595] ipw2100 0000:02:04.0: restoring config space at offset 0x1
(was 0x2900002, writing 0x2900006)
[ 954.646319] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 954.646338] eth1: Fatal error value: 0x5000CF10, address: 0x61A00000, inta: 0x40000000
[ 954.646429] ipw2100 0000:02:04.0: PCI INT A disabled
[ 954.692041] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[ 954.692063] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[ 954.692103] ipw2100 0000:02:04.0: restoring config space at offset 0x1
(was 0x2900002, writing 0x2900006)
[ 968.585409] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 968.585429] eth1: Fatal error value: 0x5000C9D0, address: 0x57E00500, inta: 0x40000000
[ 968.585517] ipw2100 0000:02:04.0: PCI INT A disabled
[ 968.632037] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[ 968.632059] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[ 968.632099] ipw2100 0000:02:04.0: restoring config space at offset 0x1
(was 0x2900002, writing 0x2900006)
[ 972.269514] ipw2100 0000:02:04.0: PCI INT A disabled
[ 972.316041] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[ 972.316400] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11 (level, low) -> IRQ 11
[ 972.316446] ipw2100 0000:02:04.0: restoring config space at offset 0x1
(was 0x2900002, writing 0x2900006)
As we can see, fatal interrupts did not dissapear, and are actually as frequent as before.
Also got this lines:[ 2032.560413] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560449] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560491] ipw2100: exit - failed to send CARD_DISABLE command
[ 2032.560593] ipw2100: exit - failed to send CARD_DISABLE command
One after another, which does not provide me any clue though.
I've started several big torrent downloads/seeds as a big load, maybe card somehow
differentiates different flows, so this test should be more heavy than lots
of pings. First time I noticed fatal interrupt problem with this kind of load,
when card not only stopped to work, but also printed some goodbay message.
So far conclusion is not very optimistic: fatal interrupts happen always, no matter
what magic is enabled in the reset, which already tells that firmware is broken.
Hopefully additional reset games with power management will allow card to work,
even with those interrupts. Time will tell.
/devel/networking/ipw2100 :: Link / Comments ()
Wed, 24 Sep 2008
First ipw2100 testing: fatal interrupt.
I managed to compile small enough kernel, which boots on
my laptop (do not know how long it took, since fell asleep),
and managed to bring fatal interrupt error just after several seconds
of ping -f 192.168.1.1 -s 8192 on freshly booted
machine. 192.168.1.1 is my gateway address.
Here is the result with the patch I posted to the mail lists,
which was not acked, replied and commented though (well, I have to admit,
that if I would send it couple of mails earlier, it could probably find its
way into the tree, but I still believe that it would not result in anything,
since everyone knows about this bug, it just is not fixed by some reasons).
Intel developers (at least those who maintain the driver) continue to keep silence.
[ 613.960164] ipw2100: exit - failed to send CARD_DISABLE command
[ 624.456033] eth1: no IPv6 routers present
[ 690.721534] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 690.721554] eth1: Fatal error value: 0x5000C97C, address: 0x100E201C, inta: 0x40000000
[ 690.721580] ------------[ cut here ]------------
[ 690.721587] WARNING: at drivers/net/wireless/ipw2100.c:3188
ipw2100_irq_tasklet+0x8fe/0x9b0 [ipw2100]()
[ 690.721736] Pid: 0, comm: swapper Not tainted 2.6.27-rc7-mainline #2
[ 690.721744] [] warn_on_slowpath+0x5f/0x90
[ 690.721763] [] up+0x11/0x40
[ 690.721773] [] release_console_sem+0x190/0x1d0
[ 690.721786] [] enqueue_hrtimer+0x72/0xf0
[ 690.721795] [] printk+0x1b/0x20
[ 690.721805] [] ipw2100_irq_tasklet+0x8fe/0x9b0 [ipw2100]
[ 690.721831] [] hrtick_start_fair+0x157/0x170
[ 690.721844] [] enqueue_hrtimer+0x72/0xf0
[ 690.721855] [] snd_intel8x0_interrupt+0x1d7/0x250 [snd_intel8x0]
[ 690.721875] [] tasklet_action+0x46/0xb0
[ 690.721886] [] __do_softirq+0x75/0xf0
[ 690.721897] [] do_softirq+0x37/0x40
[ 690.721906] [] do_IRQ+0x40/0x70
[ 690.721917] [] getnstimeofday+0x37/0xe0
[ 690.721927] [] common_interrupt+0x23/0x28
[ 690.721937] [] sys_setpgid+0xd8/0x190
[ 690.721955] [] acpi_idle_enter_simple+0x15a/0x1c1 [processor]
[ 690.721980] [] cpuidle_idle_call+0x7b/0xc0
[ 690.721991] [] cpu_idle+0x46/0xe0
[ 690.722000] =======================
[ 690.722006] ---[ end trace 70268f59a00d957c ]---
[ 695.271318] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 695.271337] eth1: Fatal error value: 0x50014148, address: 0x60207E04, inta: 0x40000000
writing this note and starting over
[ 1520.709136] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 1520.709156] eth1: Fatal error value: 0x5000C96C, address: 0x538E7E40, inta: 0x40000000
[ 1550.954315] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 1550.954334] eth1: Fatal error value: 0x5000C99C, address: 0x08418004, inta: 0x40000000
[ 1592.175473] ipw2100: Fatal interrupt. Scheduling firmware restart.
[ 1592.175492] eth1: Fatal error value: 0x50018588, address: 0x57E77A00, inta: 0x40000000
So, this fatal error value and address numbers do not tell me anything,
but since they are always different on different addresses, I think firmware
just loses its mind and stops responding.
The first line, where ipw2100 fails to send a command, was obtained during
ifdown of the interface. I never saw it before, but do not think
it is related though.
So, I need to move to the office and want to make some
distributed storage
changes, namely fix an issue with name collision (kernel already has a dvb card, which
module is called dst.ko), and implement better minor number allocation
scheme for the imported devices, since right now after node was created and distroyed,
new one will not get the same number, but continuously increasing one, which looks
confusing and may bring a sysfs initialization error (when system tries to
register kobject with existing name).
I will continue ipw2100 experiments today's night if will not fall asleep again
because of jetlag. Stay tuned!
/devel/networking/ipw2100 :: Link / Comments ()
Tue, 09 Sep 2008
Userspace network stack git tree is now open.
One can check it via
web interface.
/devel/networking/unetstack :: Link / Comments ()
Sun, 07 Sep 2008
New netchannels release.
Network channel
is peer-to-peer protocol agnostic communication channel
between hardware and userspace. It uses unified cache to store it's
channels. All protocol processing happens in process context.
This release brings us reworked (and very simple) unified
storage for all kinds of protocols (netchannel can be created for any kind
of the protocol), completely lockless data processing
(data queueing into the netchannel and its lookup in the global
storage are protected by RCU), simplifed interface.
Feature list:
- Very high bulk performance with small packets
(check userspace network stack
for more details).
- Completely lockless netchannel processing (packet queueing and netchannel lookup in the global storage are protected by RCU).
- Unified storage for all kinds of protocols: TCP/UDP, IP/IPv6, whatever you decide to implement on top of hardware layer you use.
- No protocol processing. This is pushed to the peer itself. For example to the
userspace network stack.
- Ability to inject packet into the network without root priveledges.
Userspace network stack
is the main user of the new netchannel subsystem.
Todo list include:
- Ability to improve receiving latencies (queue packets from hardware interupt handler and not software interrupt).
- Automatically scale netchannel hash table on demand.
/devel/networking :: Link / Comments ()
New userspace network stack release.
Unetstack
is an extremely small and fast TCP/UDP/IP stack implementation on top of packet socket or
netchannels interface.
This release includes sync with the new netchannels interface,
dropped routing table support, since userspace network stack is designed on
behalf of netchannels and thus efectively single opened object operates
with single source and destination peers, so there is no need to
introduce unneded caches, since all needed information can be stored in
the userspace network stack object itself.
/devel/networking/unetstack :: Link / Comments ()
Sat, 06 Sep 2008
Latencies in netchanneles and sockets in receiving path.
When NIC's interrupt fires in Linux, driver's handler
does not process the packet, it either schedules NAPI handler,
which will push packet higher to the stack, or submit packet to
the software interrupt handler, which will do the same. This is
the first queue: interrupt->fotware interrupt (or NAPI, which
happens in the same context).
When NAPI polling handler (or networking software interrupt) fires,
it searches for the appropriate receiving socket, adds data packet
to its queue and wakes up a receiving process. This is second queue.
Netchannels currently work the same way, since its receiving processing
happens in netif_receive_skb(), which already may be too late
for some low-latency applications.
As was noticed by Salvatore Del Popolo, it is possible to queue packet
into netchannel in netif_rx(), but that will limit netchannels to
only work with non-NAPI drivers. Instead I think about creating a special
helper which will be invoked from the interrupt handler and if there is no
appropriate netchannel to queue data into, it will schedule NAPI or network
softirq. So far this is in todo list though.
What was really done, its a complete rework of the initalization process,
netchannel creation and allocation and its processing. Essentially I rewrote most
of the netchannels subsystem for good. It became lockless (RCU protected,
there is a hash bucket lock, which is only used when netchannel is added/removed
from the bucket, searching is lockless), but allocation process
is slower, since netchannel now contains array of the skb pointers, which is allocated
at creation time. Size of the array is limited to maximum number of packets netchannel
can hold, kind of queue size.
/devel/networking :: Link / Comments ()
Fri, 05 Sep 2008
Netchannels come to the start line.
Or finish one. Depending on the point to look from.
zbr@gavana$ make SUBDIRS=net/core/netchannel/
WARNING: Symbol version dump /home/zbr/aWork/git/linux-2.6/linux-2.6.netchannels/Module.symvers
is missing; modules will have no dependencies and modversions.
CC net/core/netchannel/netchannel.o
CC net/core/netchannel/storage.o
CC net/core/netchannel/user.o
LD net/core/netchannel/built-in.o
Building modules, stage 2.
MODPOST 0 modules
zbr@gavana$ wc -l net/core/netchannel/*.c include/linux/netchannel.h
430 net/core/netchannel/netchannel.c
140 net/core/netchannel/storage.c
244 net/core/netchannel/user.c
92 include/linux/netchannel.h
906 total
I want to make a new netchannels
release this weekend. It will not contain dynamically resizable hash table though, but if there will be no major
bugs in the core, I will consider to complete it for the new release.
I also plan to convert userspace network stack
to the libtcp.so or libunetstack.so library, so it could be much easier to create applications
with this stack, no matter if implemented on top of netchannels or packet socket, but so far it is only in plans.
/devel/networking :: Link / Comments ()
Mon, 01 Sep 2008
Netchannels strike back.
A while ago I implamented Van Jackobson idea
of netchannels - peer-to-peer
connection module, which pushed all protocol processing as close to the end peers as possible.
In my first realization, TCP processing was done on behalf of running process (instead of mostly bottom-half context),
which resulted in a slightly better performance. Then I implemented
userspace network stack
as a continuation of this idea. Despite its huge performance improvement, I do not think particul reason
is netchannels architecture, but instead amount of syscalls to be made to process bulk traffic flow
via small packets. Nevertheless it can also be considered as a netchannels architecture improvement, which
resulted in so exceptionally good batching abilities.
Now I want to move further: kernel netchannels side will be made completely lockless and simultaneously
very cache-friendly. As in the first implementation, idea is not completely mine, approach I will test
is based on Van Jackobson's array design to store network buffers.
During its lifetime, netchannels got NAT support (actually just to show to those people, who do not belive
in netchannels architecture, that it is possible to implement filtering and packet mangling), but now I drop it
from the project. Netchannels also got tricky multidimentsional trie-based storage, which, after being ported
to the socket core, resulted in a noticeable perforamance
win, although I did not complete
it to support statistics. Actually netchannels implementation of this trie is broken, and it required
quite a few steps in socket code to be fixed.
Now I drop it from netchannels patchset too and move to the usual hash tables.
I will make RCU locking for them and make netchannels hash table optionally automatically resizeable.
This feature does not exist in socket hash tables, but right now I want to experiment smaller code base,
since algorithm I have in mind is a bit tricky.
So, there are lots of interesting ideas, which I've started to work on and plan to finish sooner than later.
But since I will move to the USA counsil department for the interview, and then want to finish appartment development tasks,
and then, hopefully, move to the Kernel Summit and Plumbers conference, it can take quite long... Please
note that I do not forget about other projects.
Code is not dead if not marked appropriately in the TODO list :)
Stay tuned nevertheless!
/devel/networking :: Link / Comments ()
Sun, 10 Aug 2008
A bored russian.
Yes, that is how I was called by The Inquirer.
Magazine even put it in bold capital letters :) The rest of the article is quite wrong though (i.e. it is not what was written in my
blog).
Slashdot either got an entry, I was called hacker and then
a physicist there.
What next? It is really very fun! :)
/devel/networking/dns :: Link / Comments ()
Sat, 09 Aug 2008
Russian physicist.
That is how I was called in
New York Times
with all this hype about DNS poisoning attack.
Unfortunately I already do not remember what electron charge is
and how to describe Higgs boson even to myself. Things moved away almost 10 years
ago :)
Article says, that DJBDNS does not suffer from this attack. It does. Everyone does.
With some tweaks it can take longer than BIND, but overall problem is there.
But that's enough for this story. I'm moving to another interesting developments.
/devel/networking/dns :: Link / Comments ()
Fri, 08 Aug 2008
Successfully poisoned the latest BIND with fully randomized ports!
Exploit
required to send more than 130 thousand of requests for the fake
records like 131737-4795-15081.blah.com to be able to match port
and ID and insert poisoned entry for the poisoned_dns.blah.com.
# dig @localhost www.blah.com +norecurse
; <<>> DiG 9.5.0-P2 <<>> @localhost www.blah.com +norecurse
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6950
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; QUESTION SECTION:
;www.blah.com. IN A
;; AUTHORITY SECTION:
www.blah.com. 73557 IN NS poisoned_dns.blah.com.
;; ADDITIONAL SECTION:
poisoned_dns.blah.com. 73557 IN A 1.2.3.4
# named -v
BIND 9.5.0-P2
BIND used fully randomized source port range, i.e. around 64000 ports. Two attacking servers,
connected to the attacked one via GigE link, were used,
each one attacked 1-2 ports with full ID range. Usually attacking server is able to send about 40-50 thousands
fake replies before remote server returns the correct one, so if port was matched probability of
the successful poisoning is more than 60%.
Attack took about half of the day, i.e. a bit less than 10 hours.
So, if you have a GigE lan, any trojaned machine can poison your DNS during one night...
/devel/networking/dns :: Link / Comments ()
Wed, 06 Aug 2008
Additional note on DNS poisoning attack IN A entry injection.
Actually I did
inject 'IN A' entry for the poisoned_dns.blah.com into the cache.
So, to inject arbitrary 'A' entry for the attacked.domain.com into the cache,
one has to bruteforce ID (and match source port if needed) for any other subdomain of the
same level, i.e. subdomain-123.domain.com, and put into additional section
for that message a 'IN NS' record, which would point to attacked.domain.com,
and 'IN A' record with fake IP address for that 'IN NS' one,
i.e. 'IN A' record for the attacked.domain.com pointing to 1.2.3.4.
This method is a bit less flexible, than just poisoning any subdomain with NS
record, which points to the controlled DNS server, but it does not require that server
to exist, so it can route traffic directly to your site without first asking
your DNS server, where given subdomain lives.
# ping poisoned_dns.blah.com -c100 > /dev/null 2>&1 &
# tcpdump -nn icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:27:20.422124 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 5, length 64
11:27:20.422333 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:21.422126 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 6, length 64
11:27:21.422310 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:22.422123 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 7, length 64
11:27:22.422286 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:23.423122 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 8, length 64
11:27:23.423311 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
/devel/networking/dns :: Link / Comments ()
More interesting (and complete) hack of the DNS.
I managed to inject following poisoning information:
# dig @localhost +norecurse www.blah.com any
;; ANSWER SECTION:
www.blah.com. 123452 IN NS poisoned_dns.blah.com.
;; AUTHORITY SECTION:
www.blah.com. 123452 IN NS poisoned_dns.blah.com.
;; ADDITIONAL SECTION:
poisoned_dns.blah.com. 123452 IN A 1.2.3.4
# dig @localhost www.blah.com
The last command results in the following dump:
01:36:14.567622 IP devfs1.5301 > 1.2.3.4.53: 42416% [1au] A? www.blah.com. (41)
01:36:15.067816 IP devfs1.5301 > 1.2.3.4.53: 29011% [1au] A? www.blah.com. (41)
01:36:15.568013 IP devfs1.5301 > 1.2.3.4.53: 30586 A? www.blah.com. (30)
01:36:16.568182 IP devfs1.5301 > 1.2.3.4.53: 38101 A? www.blah.com. (30)
01:36:18.568429 IP devfs1.5301 > 1.2.3.4.53: 64596 A? www.blah.com. (30)
01:36:22.568634 IP devfs1.5301 > 1.2.3.4.53: 59943 A? www.blah.com. (30)
01:36:30.568960 IP devfs1.5301 > 1.2.3.4.53: 39614 A? www.blah.com. (30)
01:36:40.569163 IP devfs1.5301 > 1.2.3.4.53: 13769 A? www.blah.com. (30)
So, effectively if I would control 1.2.3.4 machine I would be able to
answer to that queries with controlled address. I was not able
to inject 'A' record for any domain except one which was happend to
match id in my fake responses, and it looks like 'A' records are not accepted
at all (I'm far from being a DNS expert).
So, actually I consider this exploit
as a completed one, which is capable of arbitrary
NS record poisoning. Its performance is rather good: poisoning attack
requires 1-3 (sometimes more, it heavily depends on link capacity and auth dns server
performance) queries from the client to authoritative DNS server. Attacking server,
connected via gigabit link,
is easily capable to saturate whole DNS ID space while attacked resolver waits for
reply from the remote server. Math tells me that 100 mbit connection will require
about two times more requests to be sent by the client, which is still not that much.
Server side of the exploit requires root priveledges to run, since it uses raw socket
to create a datagram with IP addresses used by attacked server and appropriate authoritative
name server. Client connects to one or more attacking servers, sends them appropriate response message
and issues a DNS request for that response to the attacked server. Poisoning servers start to
flood attacked server with replies, until client sends them next reply to bomb. When client receives
fake answer from poisoned DNS server, attack stops. Exploit allows you to specify
name server to attack, NS query to inject and DNS name to have that NS record.
Having hard GigE performance numbers, I can say, that port randomization completely does not
solve DNS poisoning attack (although makes it harder), since with such link capacity attacker only needs to guess
the port, and ID space will be bruteforced before reply is received from the authoritative name server.
So far I can not test randomized-port BIND, since local Debian mirror has somehow unsigned package
for it, so I will not install it right now, but will do it later and provide numbers with randomized
server. I expect to be able to poison even that server, although not that fast as with constant port.
Have fun!
/devel/networking/dns :: Link / Comments ()
Tue, 05 Aug 2008
DNS cache poisoning attack succeeded for the constant port.
Hacking rox!
# dig @devfs1 3-c13a-15729.paypal.com.
; <<>> DiG 9.5.0-P2 <<>> @devfs1 3-c13a-15729.paypal.com.
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18330
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 0
;; QUESTION SECTION:
;3-c13a-15729.paypal.com. IN A
;; ANSWER SECTION:
3-c13a-15729.paypal.com. 123405 IN A 1.2.3.4
# dig 1-71b2-16080.money.paypal.com.
...
;; ANSWER SECTION:
1-71b2-16080.money.paypal.com. 123421 IN A 1.2.3.4
# dig @localhost 29-07f3-16098.test.com
...
;; ANSWER SECTION:
29-07f3-16098.test.com. 123411 IN A 1.2.3.4
Although it is not a complete win yet: additional section from the poisoning packet
was parsed, and entry looks like inserted into DNS server database, but subsequent
request ends up with querying remote server. Probably because my fake requests do not
contain authority section, so I will extend it and continue this game :)
Ugh, 4 A.M. My body, soul and what else wants to sleep will all hate me tomorrow.
/devel/networking/dns :: Link / Comments ()
Mon, 04 Aug 2008
Got back testing machines.
I was called a saboteur, although no one was able
to answer, what will happen, if the same load will
be performed by some virus or trojan.
Nevertheless I played some politic game, had some talks,
which I managed to cool down from angry to fun strain,
and eventually got access again.
I installed BIND on one of the servers, which by the coincidence
does not have port randomization fix, so it issues all requests from
the 5301 port. I fixed IP header initialization, so now attacking
servers send its fake DNS replies not with own IP address as a source
(that's likely was one of the main if not main reasons machines
were disabled), but using appropriate auth DNS server IP address.
Also found an interesting moment with DNS server traffic: resolver server's network
channel is so much loaded with small UDP fake DNS replies, that other ones
almost can not sneak in, so effectively real reply comes almost after the whole
ID range has been bruteforced. I remind that this is a GigE linked
machines, and attacking servers send about 200-300 thousands packets per second
average, dropping rate is about 30% (only about 45 thousands packets are received
from more than 65.000 being sent).
This basically means, that in this particular case probability of the successful poisoning
with port randomization is only limited by random port number, and random ID
almost does not play any role (since traffic generated by the attacking server will
eat the bandwidth and will not allow real reply to come first), so one should just
guess the port number and attack will succeed.
I will try to prove this theory tomorrow as long as confirm that my
exploit works.
/devel/networking/dns :: Link / Comments ()
Sat, 02 Aug 2008
DNS cache poisoning attack results.
Disabled account and turned off access to the servers.
And it is just because of several minutes of 200+ kpps
UDP DNS response storms from three machines to one of the corporate DNS servers
(I think there are hundreds of them, I just got access to couple).
Who the hell monitors it Saturday night at 2 A.M.? I specially selected
time when normal people sleep, drink or have a sex, but do not work and watch DNS server load.
The only problem actually is that those servers were also used for
POHMELFS
development and testing. Although I still able to work with two Xen domains
(where I actually develop and test initial implementations without various
stressing loads for all my current projects), so development will not stop.
I will pretend to be an idiot and to have viruses there. Linux kernel viruses.
And of course I will promise I will install all updates and will be careful next time.
Next time I will not attack known nameserver, but install my own.
It is all about the science and not to harm (I even poisoned non-existent domain).
Or they will get away my toys and kick my ass, but I will resist,
so there will be no interesting notes about DNS cache poisoning
attack (although not, I will be able to run one on my desktop
via loopback, it is quite fast machine) and nice benchmark graphs :)
/devel/networking/dns :: Link / Comments ()
Fri, 01 Aug 2008
DNS cache poisoning attack exploit completed.
I belive I've completed quite distributed client/server network exploit, which is capable to poison
given DNS cache either if it works with single source port or randomize it over some port range.
I already described
client-server architecture, so only short notes here.
Client broadcasts set of ports and fake queries to number of poisoning servers, and then asks attacked
name server a specially crafted query, which does not exist in the attacked domain. Poisoning servers send
lots of replies to the attacked DNS server with fake IP addresses and ports, which pretend to be address/port
from the authoritative DNS server. Each reply contains answer section for the current client query and additional
section, which contains information about attacked domain: the former is a subdomain of the latter, like
querying 'IN A' record for '123-456.www.blahblah.com' while reply contains 'IN A' data for '123-456.www.blahblah.com'
in answer sectino and 'IN A' data for 'www.blahblah.com' in additional section.
Client then checks reply (or falls on timeout), and if it does not contain given record for the query, sends next packet
to poisoning servers and appropriate request to the attacked cached domain server.
So far I did not succeed in this attack, but managed to load network (and actually the main name server) so much, that really lots of people around started to complain,
that they have troubles... This is also a result actually, but not that one which I expected, so I will postpone attack to the
late night today.
Tcpdumps show that broadcasted data is valid, but there were no actual poisoning, so probably I will install own
server and configure it to use single port. Currently attacked server has not very random
port distributinon, but still not constant. My poisoning servers (two servers connected via gige link to the same network as attacked server)
use 100% CPU each one, since they need to caclulate UDP checksum for each packet (since it has different ID and/or port number) and
use raw socket to transmit data (to specify source and destination addresses of the autoritative and attacked server). Each server is
usually capable of transmit about 30k-130k packets per second, which corresponds to 1-20 ports (and whole 64k ID range per port)
during 5 seconds timeout interval before the next request. This is not enough of course for the 100% guarantee, but I think after quite long
time attack may suceed, so I will put it in action for the next weekend or at least a night.
Bert Hubert made some math on this kind
of attack, result is not very promising for the attacker, but still probability is far from zero.
I do not promise success, but would like to know, if I'm on the right side, so attack has been started...
P.S. DNS has own tag in the blog now.
P.P.S. Distributed cache poisoning exploit (it may be completely incorrect!) source code can be found in archive. Sorry,
no usage details, but you can use '-h' command line parameter :)
/devel/networking/dns :: Link / Comments ()
Thu, 31 Jul 2008
DNS cache poisoning client/server architecture.
SO far I only implemented simple flooder of the requests,
which as number of destination ports as a parameter and two
names and addresses to put into answer and additional section
of the DNS reply. It uses UDP socket, so source address does not
belong to server, which should pretend to answer given query, so
actually this application will not work, and I need to implement
sending via packet socket and substitue source IP address with
DNS authoritative server's one.
Poison flooder also should not use only one name/address in answer section,
but insteda it should iterate with client, so appropriate request
and answer were synchronized.
So far, initial design of the client/server architecture of this
small project looks like this: depending on flags, either client
connects to multiple flood servers or vice versa, then client
sends a message to each server where specifies a port and ID ranges to attack,
attacked DNS server IP, requested query name and source address,
pretending to be an authoritative name server and additional resource
record data to put into replies (which will poison the cache).
Each server starts sending that data to the specified name server
with changed source address to the authoritative name server's one
and with ID and port changed in given range. When client finished
broadcasting request data to all flood servers, it sends a request
to the attacked DNS server with given query name to resolve. Now
flood servers race with authoritative one to provide an answer. When
client receives the answer, it checks if it looks like poisoned data
we wants to get, or real answer (which should be NX domain, since we
resolve non-existing names). In the former case we exit the process and
enjoy the result, otherwise client specifies next name to resolve and
the same starts again.
Looks interesting...
/devel/networking/dns :: Link / Comments ()
Wed, 30 Jul 2008
Simple DNS server/resolver.
Exact time to hack a DNS server is a middle of the night: 3 A.M. here
and I've just completed initial draft of the trivial DNS server, which
is only capable to receive a datagram from predefined port, parse it,
fill a reply for static "IN A" record (I think I will add a config file),
this record is placed into 'answer' and 'additional' resource record sections,
then the whole request is being sent back to the client.
That's how it looks for standard UNIX dig command:
$ dig @localhost -p 1025 www.google.com
;; Warning: query response not set
; <<>> DiG 9.4.2-P1 <<>> @localhost -p 1025 www.google.com
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51486
;; flags: rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;www.google.com. IN A
;; ANSWER SECTION:
www.google.com. 123456 IN A 195.178.208.66
;; ADDITIONAL SECTION:
www.google.com. 123456 IN A 195.178.208.66
;; Query time: 15 msec
;; SERVER: 127.0.0.1#1025(127.0.0.1)
;; WHEN: Wed Jul 30 02:56:23 2008
;; MSG SIZE rcvd: 64
There are several warnings, which I will fix later, but main part is
section content: www.google.com obviously does not have an IP address
of my blog site. TTL usually also does not equal to 123456.
Game continues, while I need some sleep...
/devel/networking/dns :: Link / Comments ()
Tue, 29 Jul 2008
Some DNS port distribution data.
Gathered today's late night, so that DNS server would
not be too much disturbed by other users.
Graphs below show some BIND (do not know version)
source port cloud and distribution for a thousand
runs. Each request issued non-existent subdomain of
controlled domain server, so I was able to capture dums
and analyze them a bit.

This graphs show source ports cloud and its distribution.
Each histogram corresponds to number of hits into 100 ports range,
start of the range is shown at X axis labels.
First, port range is randomly selected in 50k-65k range,
so one needs to guess much smaller amount of port.
Second, even in 1 thousand requests there are lots of
requests with the same port (stats show that there 149 ports,
which were used 2 and more times in above 1000 runs,
there is even single port which was used 4 times).
If we select range of 100 ports, then appropriate distribution
is shown on the graph.
Such behaviour allows to limit source port range even more.
Now, DNS IDs.

The whole range of IDs is used, and theirs distribution (each histogram
corresponds to number of IDs in the appropriate 100 ids range) is more uniform.
There were only 9 IDs used twice per 1000 runs.
But since I do not know exact load of the analyzed DNS server (and it can be
high even at 3 A.M.), I can not say if that numbers are due to port/id
selection algorithm implementation of just because load was high and there were
actually not only my 1000 requests.
To further play with DNS caches I decided to install local
DNS server first test things with it.
/devel/networking/dns :: Link / Comments ()
Sun, 27 Jul 2008
Lots of talks about DNS cache poisoning attack.
There are two types of this attack: DNS query ID guessing and
request source port guessing for servers which use randomized source
port, which should be turned on after Dan Kaminsky's
alert.
DNS ID is 16 bits only, so it could be guessed rather fat, one just need to force someone
who uses attacked DNS cache to issue appropriate requests. When request is received by
DNS resolver, it is stored there for predefined amount of time (TTL parameter provided
by higher-level DNS resolver or eventually authoritative name server). Dan found, that
attacker can actually ask not for attacked domain, but some subdomain of it
(if attacker tries to point www.microsoft.com to own IP, it can force sending DNS
requests for 1.microsoft.com, 2.microsoft.com and so on), and put data about actual
target into additional resource records attached to all datagrams. So, when it eventually
win the race, it can store (among lots of subdomains) needed pointers in the attacked DNS cache.
I've just thought that this attack will not be possible, if all queries from DNS
resolvers to higher-level resolvers and/or authoritative name servers would happen over
TCP instead of more common UDP. There is no need to issue requests from random ports anymore,
no need to parse and drop additional resource records. There will be no problems with truncation
of large messages... But to play a bit with the whole idea I'm implementing a simple DNS
query/response processor. Maybe will play a bit with local cache (ISP at office uses only 6 different
ports to send requests) poisoning, although its main goal is
IP-over-DNS tunnel.
This is kind of a real rest after VISA/hotel paperwork. I was told, that if I will be
called to embassy for the interview, chances are high VISA will be declined because of my
sence of humor :)
Update: zbr@gavana:~/aWork/tmp/dns$ ./query -a 195.178.208.66 -i 0x1234 -q tservice.net.ru
query: 'tservice.net.ru', class: 1, type: 1, server: 195.178.208.66:53, protocol: 17, id: 1234.
Connected to 195.178.208.66:53.
id: 1234: flags: resp: 0, opcode: 0, auth: 0, trunc: 0, RD: 1, RA: 0, rcode: 8.
: question: 1, answer: 1, auth: 2, addon: 2.
: question: name: 'tservice.net.ru.', type: 1, class: 1.
: name: 'tservice.net.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 195.178.208.66
: name: 'tservice.net.ru.', type: 2, class: 1, ttl: 86400, rdlen: 14, rdata: ns.tservice.ru.
: name: 'tservice.net.ru.', type: 2, class: 1, ttl: 86400, rdlen: 7, rdata: dns2.tservice.ru.
: name: 'ns.tservice.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 195.178.208.66
: name: 'dns2.tservice.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 62.141.76.164
And DNS protocol gets the first price among the ugliest crappies.
Now its time to create a DNS server itself, which will get requests (above dump shows BIND session),
parse them and perform appropriate actions, like sending reply with specially crafted additional resource
records, either NULL one for example (can contain upto 64k of data) or TXT (length byte followed by
character string, there may be multiple strings as long as total length (including length bytes itsef)
is less than 64k). Or additional A resource record, which may contain information about domain to poison...
/devel/networking/dns :: Link / Comments ()
Tue, 01 Jul 2008
Why is blocking sending considered harmful?
I frequently hear that whatever server you implement, it has to
be non-blocking, since in case of parallel sending it allows to
send multiple requests to fast servers, while not-sending data to
slow server, since non-blocking socket will return EAGAIN.
This is only half-right solution: when we have to put given data to
all servers, and can not free it until all servers replied with acknowledge,
non-blocking mode can bring more damage than gain.
Mainly because it
allows to eat all the memory for requests, which are still in the queue
to be sent to slow server, and which was already sent to fast ones.
In this case higher-level application (consider simple application which generates
some data and writes it into the file in distributed filesystem, which writes
file to several servers) will never block since transfer
to fast servers completes quickly, and will provide more and more data,
which will consume all RAM.
It is possible to deadlock system in this case,
since to send some data to remote server we always have to allocate at least some
data to put network headers into. With non-blocking solution we will consume
all memory and kick itself into the coma.
/devel/networking :: Link / Comments ()
Passive OS fingerprinting.
I've updated OSF
modules to xtables, so you have to enable its support in kernel config and get
recent iptables (I tested with 1.4.1.1, which is the latest release to date).
OSF allows you to match incoming packets by different sets of SYN-packet and determine,
which remote system is on the remote end, so you can make decisions based on OS type
and even version at some degreee.
Installation instruction, example and source code can be found on
homepage.
I've also sent it to netfilter-devel@ and netdev@ maillists, since my previous mails never appeared
there likely because of spam filters.
/devel/networking :: Link / Comments ()
Sat, 14 Jun 2008
Passive OS fingerprinting.
Ever dreamt to block all Linux users in your network from accessing
internet and allow full bandwidth to Windows worm? We have to care about
our smaller brothers, so this iptables extension module allows you to do
so.
OSF stands for OS Fingerprint allows you to build usual iptables
decision on incoming TCP packets, only initial handhsake containing SYN
bit is enough to understand what remote OS is. Original idea belongs to
Michal Zalewski.
This iptables module was
imlemented almost 5 years ago and lived in patch-o-matic (userspace
library is still there) iptables tree. Now I've updated it to Xtables
and send for review.
Installation steps are described on the
homepage,
but are trivial and include usual make/make lib building and loading rules into the module
via procfs file.
# insmod ./ipt_osf.ko
# ./load ./pf.os /proc/sys/net/ipv4/osf
# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector
You find something like this in syslog:
ipt_osf: Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139
/devel/networking :: Link / Comments ()
New userspace network stack release.
Fixed bug found by Salvatore Del Popolo (delpopolo_dit.unitn.it)
in TCP implementation, when system checked sending window and determined,
that packet was not allowed to be sent and nevertheless tried to do so in some
cases.
Userspace network stack
is a very fast (if working on top of
netchannels,
also supported packet socket) and very small network stack (TCP/UDP/IP/ethernet) implemeneted
entirely in userspace. Because of it lives near the very the end of the peer (i.e. very close
or even embedded into application), it allows much faster processing of some workloads, namely
small packet sending and receiving, where
it
outperforms
vanilla Linux TCP/IP stack 3 times in performance and 4 times CPU usage (sending and receiving vary).

Comapre netchannels+unetstack versus Linux sockets (2006 year numbers).
It is not about problems in the Linux stack, but overhead of syscalls, which are in turn
results of too separate data sending and reply processing in the existing model.
/devel/networking/unetstack :: Link / Comments ()
CARP: Common Address Redundancy Protocol for Linux kernel.
I've finally made a new release of the
CARP
for Linux kernel.
CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard.
The latest protocol to help provide high availability and network redundancy, it was
developed because router giant Cisco Systems believes that its Hot Standby Router
Protocol (HSRP) patent covers some of the same technical areas as VRRP.
This project allows you to build high-available clusters of multiple machines with
balanced master selection between them. Installation and setup are pretty trivial:
$ tar -zxf carp_latest.tar.gz
$ cd carp
$ make
# insmod ip_carp.ko
# modprobe cn
# insmod carp_conn.ko
# ifconfig carp0 up
# carp_conn_daemon -m master.sh -b backup.sh
And the same on all other machines.
Each script as you got from its name is executed when node becomes master or backup one,
you can put there firewall rule changes, traffic shaping setup, network daemon start/stop
scripts and whatever you like.
Its main advantage over any other existing open (well, it behaves much more robust than Cisco VRRP though)
master/backup solutions (like Hearbeat or userspace CARP) is ability to setup multicast address (via usual
/sbin/ifconfig command) and thus do not confuse some crappyCisco
hardware, which will not understand that node changed.
One can get the latest sources from CARP homepage.
Enjoy!
/devel/networking :: Link / Comments ()
Tue, 01 Apr 2008
Fix for the fundamental network/block layer race in sendfile().
Summary of the
previous
series
with this pompous header:
when sendfile() returns, pages which it sent can still be queued in tcp
stack or hardware, so subsequent write into them will endup in
corrupting data which will be eventually sent. This concerns all
->sendpage() users namely sendfile() and splice().
We can only safely reuse that pages only when ack is received from the
remote side, which will force network stack to release pages.
My simple extension allows to hook into data releasing path and perform
any actions we want. This is achieved by replacing skb->destructor with
own callback registerd by interested user, for example splice/sendfile
code. Splice (pipe info structure) in turn is extended to hold atomic
counter of the pages in flight (without structure size change because of
alignment issues it has right now), so splice code will sleep when full
pipe info (->nrbufs pages) have been sent, it will wait until number of pages
in flight hits zero, which is decremented in private splice callback.
Patch was tested with simple send and recv applications, which can be
found in archive.
One has to run them on different machines, since loopback uses a bit
different scheme (namely page is _never_ copied, so when it is received
by 'remote' side, it still exists on the 'local' side, so modifications
will endup in data corruption).
devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024
In case of failure you will get this:
Connected to devfs1:1025.
/tmp/test/1024 -> devfs1:1025
Data was corrupted: ab.
after short period of time, where above 'ab' is a hex byte writen into
mapped file, which has been sent, immediately after senfile()
returns to userspace.
Data is supposed to be always zero, and applications should run forever.
-c parameter specifies number of bytes to be sent in each run of the
sendfile(). It has to be the same on both machines.
This idea was first thought as soft barriers in
distributed storage.
/devel/networking :: Link / Comments ()
Fri, 29 Feb 2008
Debugging undebuggable.
If something looks undebuggable from the first view, than take a secon one.
Better from different angle. Some problems require third look.
Bits of history of the problem.
Pohmelfs
has extremely large latencies
when syncing local inode to the remote server. This involves sending
a command to the server to create an object with given name and receive back
a response with its real inode information (like inode number and other
fields cached for faster stat() and similar workloads). Pohmelfs
then changes local inode info to match the real data.
Syncing of small tree of 500 files takes about 40 (!) seconds. Well, in Xen
environment where I develop this things local creation of 500 files in single
ext3 directory takes more than 15 seconds, but another 25 is a pure overhead.
That was short description of previous series.
Next, problems of fixing the problems.
First, Xen version used at that testing machine is old enough, so oprofile
does not work. Second, I do not know VFS internals enough (this is my first filesystem,
interested reader can find how I managed to
step
likely on every possible rakes
on that field, some of them were even small kid rakes...) to determine where there
is a possibility to catch that long delays, but since linux filesystem is actually a
not that complex system, but set of callbacks, implementation is not really outstanding,
but knowing in which condition each callback can be invoked and which problems can be
here or there is kind of a magic... Third, remote userspace pohmelfs server was not actually
written by me, instead its bytecode was blown out because of some substances inspiration,
so it can be very much a reason for all the problems, given that it is trivial as
pretty much all my userspace code, even total rewrite will not fix the issue.
So, latency problem in pohmelfs looked really undebuggable. But you know, cup of excellent
tea (from tea-packet) with lemon can fix any problem (or high themperature and substances,
or fair amount of alcohol, everyone has fun the way he likes), so it was first
decided to implement
a simple network kernel module which would connect to remote userspace server and exchange
messages in a similar fasion like pohmelfs does.
Such module was implemented, started and showed excellent performance (about 1 thousand of messages
per second send and received back in test network, which is several orders of magnitude faster
than pohmelfs). So, move back to VFS and pray for inspiration.
Inspiration was met today (thanks Arnaldo, likely it is because I'm getting healthier :).
I always thought that number of subsequent calls for recv() is not a good idea no matter
where: in kernel or userspace, since it takes a socket lock, which in turn can introduce latencies found,
so I eliminated subsequent recvs in pohmelfs code (testing module was written better and does
sending and receiving without such 'fragments'), which resulted in... nothing, results did not changed
at all. So, wrong step, but having subsequent sending calls in a row is not a good idea too,
so I replaced them with allocation and copy, so that there would be only single kernel_sendmsg()
call. As you might expect performance... changed by 30 times. Just by having single send call instead of two
for as much as 500 invokations forced the whole network exchange to behave completely different.
So, to debug problem further I extended testing module and introduced ability to send and receive
data not by single packet but via two fragments: 4 bytes and rest of the packet (60 bytes). Here is a result
table for 1000 of messages sent and received back by testing module:
no fragments: 1.43 seconds
send fragments (4 and 60 bytes): 40.43 seconds
recv fragments (4 and 60 bytes): 1.43 seconds
both fragmentations: 40.43 seconds
It is 30 times difference just for simple application change!
tcpdump on receiving side shows that subsequent fragments sending results in a real message sending
all the time kernel_sendmsg() is invoked, which results on ack for each such message (both 4 and 60
bytes), which completely degrades tcp window and connection just can not recover with such behaviour.
So, all that words were written just to show that even undebuggable from the first view problems can be easily
solved, and that harmless (from the first view again) programming mistakes can result in very interesting results...
Now back to drawing board to think how to improve pohmelfs protocol even more to get the last bits out of the wire.
Btw, interested reader can get my network testing module and userspace from theirs
just created homepage.
/devel/networking :: Link / Comments ()
Fri, 14 Dec 2007
New release of the userspace network stack.
Changed data reading function, now it does not copy TCP header into
user's buffer, only data, and forced packet socket reading path
to limit maximum number of packets to be read, which do not match
created netchannel.
As usual, new release is available from project
homepage.
/devel/networking/unetstack :: Link / Comments ()
Tue, 04 Dec 2007
The 22'th century netchannels release.
This is the 22'th release of the netchannels, a peer-to-peer protocol
agnostic communication channel between hardware and users. It uses
unified cache to store channels, allows to allocate buffers for data
from userspace mapped area or from other preallocated set of pages
(like VFS cache). All protocol processing happens in process context.
Users of the system can be for example userspace - it allows to receive
and send traffic from the wire without any kernel interference, to
implement own protocols and offload its processing to the hardware.
This idea was originally proposed and implemented by Van Jacobson.
This patchset (with userspace netowrk stack) is a logical continuation
of the idea with move to the full peer-to-peer processing.
Short changelog:
- update cached route in the netchannel when it expires
Thanks to Salvatore Del Popolo (delpopolo_dit.unitn.it) for testing.
You can get the latest sources from netchannels homepage.
Userspace network stack is available from own homepage.
/devel/networking :: Link / Comments ()
Thu, 29 Nov 2007
The 21'th netchannels release.
Netchanel is a peer-to-peer protocol agnostic communication channel between hardware and users.
It uses unified cache to store channels, allows to allocate buffers for data
from userspace mapped area or from other preallocated set of pages
(like VFS cache). All protocol processing happens in process context.
Users of the system can be for example userspace - it allows to receive
and send traffic from the wire without any kernel interference, to
implement own protocols and offload its processing to the hardware.
This idea was originally proposed and implemented by Van Jacobson.
This patchset (with userspace netowrk stack) is a logical continuation
of the idea with move to the full peer-to-peer processing.
One of its users is userspace network stack.
Short changelog:
- fixed queue length usage
- fixed dst release path. Both problems reported by Salvatore Del Popolo (delpopolo_dit.unitn.it)
- removed nat user
More details can be found on project homepage.
/devel/networking :: Link / Comments ()
Wed, 07 Nov 2007
iWARP port sharing problem.
I read Ronald Dreier's
post
about iWARP port sharing problem and want to shed some light on it.
Besides the fact, that Ronald greatly described basics of the technology, he skipped,
that problem was discussed and solution was found with introduction of iWARP specific
aliases which should be assigned by administrator, so that network stack got a new
ifindex and application bound to different device would not get the same port as iWARP ones.
Ronald also skipped that part, where it was suggested some improvements, which were
not implemted (error propagation and fallback, automazation of the process
(like alias creation) and other bits), most of the time essentially the same
answer was received, that it is not needed... Maybe it is, but why this talk was
missed in Ronald's presentation of the evil empire of the network developers?
So, I think, RDMA people do not need a discussion, you want that your own ideas got merged
just because of the fact, that you believe it is cool, and no matter how things
are in real life and what others say you about it.
I know that, because it was me, who performed first review of the alias patches for iWARP.
/devel/networking :: Link / Comments ()
Tue, 06 Nov 2007
New release of the userspace network stack.
It is based on patches by Holger Schurig (holgerschurig_gmx.de).
Short changelog for this
unetstack release:
- added
netchannel.h, which allows to compile userspace network stack without
netchannels
support in the kernel
- killed warnings about unused wariables
/devel/networking/unetstack :: Link / Comments ()
Saving the universe from the thermal death
or decreasing world entropy. I.e. fixing bugs in the kernel.
My small contribution -
fixed sch_teql bug.
/devel/networking :: Link / Comments ()
Thu, 01 Nov 2007
Network hash tables for socket lookups.
Topic of moving hash tables to RCU rises regulary in netdev@ mail list,
but so far there is no solution for hash resizing problem because
of RCU nature. Likely it can not be fixed at all without some additional
(maybe optional) synchronization.
It was pointed that Robert Olsson's hashed
trie
can be a good solution.
Interested reader can also check my
multidimensional trie
algorithm, which I implemented for network sockets lookup and originally
got from netchannels. It was announced
at netdev@ bug I got quite passive response, so froze the project for a while
(it can be resurrected though)...
At the links above you can find performance testing comared
to hash tables in kernel with different sizes. Testing was performed
by running simple web server and huge number of clients, which
frequently connect/disconnect from server.
/devel/networking :: Link / Comments ()
Thu, 18 Oct 2007
New release of the userspace network stack.
Short changelog:
- really fixed leak in raw netchannel reading path
- changed timestamp setup
- added retransmit checking timer
- added sanity checks for addresses and ports processed in the stack - in case
of packet socket they can be incorect some times (when working over loopback for example)
- retransmit logic checks - still requires bits of work, it is not 100% correct
This rlease contains number of really useful fixes, but retransmit logic
is not yet correct. Since unetstack
uses very aggressive (non-rfc-compliant) congestion control algorithm, this can lead
(and I see this in practice) to complete dataflow suspending.
I will investigate this problem further later.
/devel/networking :: Link / Comments ()
Reading userspace network stack code.
if (!th->ack) {
ulog("%s: Strange packet.\n", __func__);
goto out;
}
Very interesting, what did I mean?
/devel/networking :: Link / Comments ()
Tue, 16 Oct 2007
Userspace network stack.
I've released new version of the
userspace network stack,
which contains a memory leak fix by Salvatore Del Popolo (delpopolo_dit.unitn.it).
Enjoy!
/devel/networking :: Link / Comments ()
|