Zbr's days.
August
Sun Mon Tue Wed Thu Fri Sat
         
30
31            
2008
Months
AugSep
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Sun, 10 Aug 2008

A bored russian.

Yes, that is how I was called by The Inquirer. Magazine even put it in bold capital letters :) The rest of the article is quite wrong though (i.e. it is not what was written in my blog).

Slashdot either got an entry, I was called hacker and then a physicist there.

What next? It is really very fun! :)

/devel/networking/dns :: Link / Comments (9)


Sat, 09 Aug 2008

Russian physicist.

That is how I was called in New York Times with all this hype about DNS poisoning attack.

Unfortunately I already do not remember what electron charge is and how to describe Higgs boson even to myself. Things moved away almost 10 years ago :)

Article says, that DJBDNS does not suffer from this attack. It does. Everyone does. With some tweaks it can take longer than BIND, but overall problem is there.

But that's enough for this story. I'm moving to another interesting developments.

/devel/networking/dns :: Link / Comments (11)


Fri, 08 Aug 2008

Successfully poisoned the latest BIND with fully randomized ports!

Exploit required to send more than 130 thousand of requests for the fake records like 131737-4795-15081.blah.com to be able to match port and ID and insert poisoned entry for the poisoned_dns.blah.com.

# dig @localhost www.blah.com +norecurse

; <<>> DiG 9.5.0-P2 <<>> @localhost www.blah.com +norecurse
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6950
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;www.blah.com.                  IN      A

;; AUTHORITY SECTION:
www.blah.com.           73557   IN      NS      poisoned_dns.blah.com.

;; ADDITIONAL SECTION:
poisoned_dns.blah.com.  73557   IN      A       1.2.3.4

# named -v
BIND 9.5.0-P2
BIND used fully randomized source port range, i.e. around 64000 ports. Two attacking servers, connected to the attacked one via GigE link, were used, each one attacked 1-2 ports with full ID range. Usually attacking server is able to send about 40-50 thousands fake replies before remote server returns the correct one, so if port was matched probability of the successful poisoning is more than 60%.

Attack took about half of the day, i.e. a bit less than 10 hours.
So, if you have a GigE lan, any trojaned machine can poison your DNS during one night...

/devel/networking/dns :: Link / Comments (62)


Wed, 06 Aug 2008

Additional note on DNS poisoning attack IN A entry injection.

Actually I did inject 'IN A' entry for the poisoned_dns.blah.com into the cache.

So, to inject arbitrary 'A' entry for the attacked.domain.com into the cache, one has to bruteforce ID (and match source port if needed) for any other subdomain of the same level, i.e. subdomain-123.domain.com, and put into additional section for that message a 'IN NS' record, which would point to attacked.domain.com, and 'IN A' record with fake IP address for that 'IN NS' one, i.e. 'IN A' record for the attacked.domain.com pointing to 1.2.3.4.

This method is a bit less flexible, than just poisoning any subdomain with NS record, which points to the controlled DNS server, but it does not require that server to exist, so it can route traffic directly to your site without first asking your DNS server, where given subdomain lives.

# ping poisoned_dns.blah.com -c100 > /dev/null 2>&1 &
# tcpdump -nn icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
11:27:20.422124 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 5, length 64
11:27:20.422333 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:21.422126 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 6, length 64
11:27:21.422310 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:22.422123 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 7, length 64
11:27:22.422286 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36
11:27:23.423122 IP devfs1 > 1.2.3.4: ICMP echo request, id 55367, seq 8, length 64
11:27:23.423311 IP gw > devfs1: ICMP host 1.2.3.4 unreachable, length 36

/devel/networking/dns :: Link / Comments (1)


More interesting (and complete) hack of the DNS.

I managed to inject following poisoning information:

# dig @localhost +norecurse www.blah.com any

;; ANSWER SECTION:
www.blah.com.		123452	IN	NS	poisoned_dns.blah.com.

;; AUTHORITY SECTION:
www.blah.com.		123452	IN	NS	poisoned_dns.blah.com.

;; ADDITIONAL SECTION:
poisoned_dns.blah.com.	123452	IN	A	1.2.3.4

# dig @localhost www.blah.com
The last command results in the following dump:
01:36:14.567622 IP devfs1.5301 > 1.2.3.4.53: 42416% [1au] A? www.blah.com. (41)
01:36:15.067816 IP devfs1.5301 > 1.2.3.4.53: 29011% [1au] A? www.blah.com. (41)
01:36:15.568013 IP devfs1.5301 > 1.2.3.4.53: 30586 A? www.blah.com. (30)
01:36:16.568182 IP devfs1.5301 > 1.2.3.4.53: 38101 A? www.blah.com. (30)
01:36:18.568429 IP devfs1.5301 > 1.2.3.4.53: 64596 A? www.blah.com. (30)
01:36:22.568634 IP devfs1.5301 > 1.2.3.4.53: 59943 A? www.blah.com. (30)
01:36:30.568960 IP devfs1.5301 > 1.2.3.4.53: 39614 A? www.blah.com. (30)
01:36:40.569163 IP devfs1.5301 > 1.2.3.4.53: 13769 A? www.blah.com. (30)
So, effectively if I would control 1.2.3.4 machine I would be able to answer to that queries with controlled address. I was not able to inject 'A' record for any domain except one which was happend to match id in my fake responses, and it looks like 'A' records are not accepted at all (I'm far from being a DNS expert).

So, actually I consider this exploit as a completed one, which is capable of arbitrary NS record poisoning. Its performance is rather good: poisoning attack requires 1-3 (sometimes more, it heavily depends on link capacity and auth dns server performance) queries from the client to authoritative DNS server. Attacking server, connected via gigabit link, is easily capable to saturate whole DNS ID space while attacked resolver waits for reply from the remote server. Math tells me that 100 mbit connection will require about two times more requests to be sent by the client, which is still not that much.

Server side of the exploit requires root priveledges to run, since it uses raw socket to create a datagram with IP addresses used by attacked server and appropriate authoritative name server. Client connects to one or more attacking servers, sends them appropriate response message and issues a DNS request for that response to the attacked server. Poisoning servers start to flood attacked server with replies, until client sends them next reply to bomb. When client receives fake answer from poisoned DNS server, attack stops. Exploit allows you to specify name server to attack, NS query to inject and DNS name to have that NS record.

Having hard GigE performance numbers, I can say, that port randomization completely does not solve DNS poisoning attack (although makes it harder), since with such link capacity attacker only needs to guess the port, and ID space will be bruteforced before reply is received from the authoritative name server.

So far I can not test randomized-port BIND, since local Debian mirror has somehow unsigned package for it, so I will not install it right now, but will do it later and provide numbers with randomized server. I expect to be able to poison even that server, although not that fast as with constant port.

Have fun!

/devel/networking/dns :: Link / Comments (0)


Tue, 05 Aug 2008

DNS cache poisoning attack succeeded for the constant port.

Hacking rox!

# dig @devfs1 3-c13a-15729.paypal.com.

; <<>> DiG 9.5.0-P2 <<>> @devfs1 3-c13a-15729.paypal.com.
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18330
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;3-c13a-15729.paypal.com.	IN	A

;; ANSWER SECTION:
3-c13a-15729.paypal.com. 123405	IN	A	1.2.3.4

# dig 1-71b2-16080.money.paypal.com.
...
;; ANSWER SECTION:
1-71b2-16080.money.paypal.com. 123421 IN A	1.2.3.4

# dig @localhost 29-07f3-16098.test.com
...
;; ANSWER SECTION:
29-07f3-16098.test.com.	123411	IN	A	1.2.3.4
Although it is not a complete win yet: additional section from the poisoning packet was parsed, and entry looks like inserted into DNS server database, but subsequent request ends up with querying remote server. Probably because my fake requests do not contain authority section, so I will extend it and continue this game :)

Ugh, 4 A.M. My body, soul and what else wants to sleep will all hate me tomorrow.

/devel/networking/dns :: Link / Comments (4)


Mon, 04 Aug 2008

Got back testing machines.

I was called a saboteur, although no one was able to answer, what will happen, if the same load will be performed by some virus or trojan.
Nevertheless I played some politic game, had some talks, which I managed to cool down from angry to fun strain, and eventually got access again.

I installed BIND on one of the servers, which by the coincidence does not have port randomization fix, so it issues all requests from the 5301 port. I fixed IP header initialization, so now attacking servers send its fake DNS replies not with own IP address as a source (that's likely was one of the main if not main reasons machines were disabled), but using appropriate auth DNS server IP address.

Also found an interesting moment with DNS server traffic: resolver server's network channel is so much loaded with small UDP fake DNS replies, that other ones almost can not sneak in, so effectively real reply comes almost after the whole ID range has been bruteforced. I remind that this is a GigE linked machines, and attacking servers send about 200-300 thousands packets per second average, dropping rate is about 30% (only about 45 thousands packets are received from more than 65.000 being sent).

This basically means, that in this particular case probability of the successful poisoning with port randomization is only limited by random port number, and random ID almost does not play any role (since traffic generated by the attacking server will eat the bandwidth and will not allow real reply to come first), so one should just guess the port number and attack will succeed.
I will try to prove this theory tomorrow as long as confirm that my exploit works.

/devel/networking/dns :: Link / Comments (0)


Sat, 02 Aug 2008

DNS cache poisoning attack results.

Disabled account and turned off access to the servers.

And it is just because of several minutes of 200+ kpps UDP DNS response storms from three machines to one of the corporate DNS servers (I think there are hundreds of them, I just got access to couple). Who the hell monitors it Saturday night at 2 A.M.? I specially selected time when normal people sleep, drink or have a sex, but do not work and watch DNS server load.

The only problem actually is that those servers were also used for POHMELFS development and testing. Although I still able to work with two Xen domains (where I actually develop and test initial implementations without various stressing loads for all my current projects), so development will not stop.

I will pretend to be an idiot and to have viruses there. Linux kernel viruses.
And of course I will promise I will install all updates and will be careful next time.
Next time I will not attack known nameserver, but install my own.
It is all about the science and not to harm (I even poisoned non-existent domain).

Or they will get away my toys and kick my ass, but I will resist, so there will be no interesting notes about DNS cache poisoning attack (although not, I will be able to run one on my desktop via loopback, it is quite fast machine) and nice benchmark graphs :)

/devel/networking/dns :: Link / Comments (2)


Fri, 01 Aug 2008

DNS cache poisoning attack exploit completed.

I belive I've completed quite distributed client/server network exploit, which is capable to poison given DNS cache either if it works with single source port or randomize it over some port range.
I already described client-server architecture, so only short notes here.
Client broadcasts set of ports and fake queries to number of poisoning servers, and then asks attacked name server a specially crafted query, which does not exist in the attacked domain. Poisoning servers send lots of replies to the attacked DNS server with fake IP addresses and ports, which pretend to be address/port from the authoritative DNS server. Each reply contains answer section for the current client query and additional section, which contains information about attacked domain: the former is a subdomain of the latter, like querying 'IN A' record for '123-456.www.blahblah.com' while reply contains 'IN A' data for '123-456.www.blahblah.com' in answer sectino and 'IN A' data for 'www.blahblah.com' in additional section.
Client then checks reply (or falls on timeout), and if it does not contain given record for the query, sends next packet to poisoning servers and appropriate request to the attacked cached domain server.

So far I did not succeed in this attack, but managed to load network (and actually the main name server) so much, that really lots of people around started to complain, that they have troubles... This is also a result actually, but not that one which I expected, so I will postpone attack to the late night today.

Tcpdumps show that broadcasted data is valid, but there were no actual poisoning, so probably I will install own server and configure it to use single port. Currently attacked server has not very random port distributinon, but still not constant. My poisoning servers (two servers connected via gige link to the same network as attacked server) use 100% CPU each one, since they need to caclulate UDP checksum for each packet (since it has different ID and/or port number) and use raw socket to transmit data (to specify source and destination addresses of the autoritative and attacked server). Each server is usually capable of transmit about 30k-130k packets per second, which corresponds to 1-20 ports (and whole 64k ID range per port) during 5 seconds timeout interval before the next request. This is not enough of course for the 100% guarantee, but I think after quite long time attack may suceed, so I will put it in action for the next weekend or at least a night.

Bert Hubert made some math on this kind of attack, result is not very promising for the attacker, but still probability is far from zero.

I do not promise success, but would like to know, if I'm on the right side, so attack has been started...

P.S. DNS has own tag in the blog now.
P.P.S. Distributed cache poisoning exploit (it may be completely incorrect!) source code can be found in archive. Sorry, no usage details, but you can use '-h' command line parameter :)

/devel/networking/dns :: Link / Comments (0)


Thu, 31 Jul 2008

DNS cache poisoning client/server architecture.

SO far I only implemented simple flooder of the requests, which as number of destination ports as a parameter and two names and addresses to put into answer and additional section of the DNS reply. It uses UDP socket, so source address does not belong to server, which should pretend to answer given query, so actually this application will not work, and I need to implement sending via packet socket and substitue source IP address with DNS authoritative server's one.
Poison flooder also should not use only one name/address in answer section, but insteda it should iterate with client, so appropriate request and answer were synchronized.

So far, initial design of the client/server architecture of this small project looks like this: depending on flags, either client connects to multiple flood servers or vice versa, then client sends a message to each server where specifies a port and ID ranges to attack, attacked DNS server IP, requested query name and source address, pretending to be an authoritative name server and additional resource record data to put into replies (which will poison the cache).
Each server starts sending that data to the specified name server with changed source address to the authoritative name server's one and with ID and port changed in given range. When client finished broadcasting request data to all flood servers, it sends a request to the attacked DNS server with given query name to resolve. Now flood servers race with authoritative one to provide an answer. When client receives the answer, it checks if it looks like poisoned data we wants to get, or real answer (which should be NX domain, since we resolve non-existing names). In the former case we exit the process and enjoy the result, otherwise client specifies next name to resolve and the same starts again.

Looks interesting...

/devel/networking/dns :: Link / Comments (0)


Wed, 30 Jul 2008

Simple DNS server/resolver.

Exact time to hack a DNS server is a middle of the night: 3 A.M. here and I've just completed initial draft of the trivial DNS server, which is only capable to receive a datagram from predefined port, parse it, fill a reply for static "IN A" record (I think I will add a config file), this record is placed into 'answer' and 'additional' resource record sections, then the whole request is being sent back to the client.

That's how it looks for standard UNIX dig command:

$ dig @localhost -p 1025 www.google.com
;; Warning: query response not set

; <<>> DiG 9.4.2-P1 <<>> @localhost -p 1025 www.google.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51486
;; flags: rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;www.google.com.			IN	A

;; ANSWER SECTION:
www.google.com.		123456	IN	A	195.178.208.66

;; ADDITIONAL SECTION:
www.google.com.		123456	IN	A	195.178.208.66

;; Query time: 15 msec
;; SERVER: 127.0.0.1#1025(127.0.0.1)
;; WHEN: Wed Jul 30 02:56:23 2008
;; MSG SIZE  rcvd: 64
There are several warnings, which I will fix later, but main part is section content: www.google.com obviously does not have an IP address of my blog site. TTL usually also does not equal to 123456.
Game continues, while I need some sleep...

/devel/networking/dns :: Link / Comments (0)


Tue, 29 Jul 2008

Some DNS port distribution data.

Gathered today's late night, so that DNS server would not be too much disturbed by other users.
Graphs below show some BIND (do not know version) source port cloud and distribution for a thousand runs. Each request issued non-existent subdomain of controlled domain server, so I was able to capture dums and analyze them a bit.

DNS source ports cloud DNS source ports distribution

This graphs show source ports cloud and its distribution. Each histogram corresponds to number of hits into 100 ports range, start of the range is shown at X axis labels.
First, port range is randomly selected in 50k-65k range, so one needs to guess much smaller amount of port.
Second, even in 1 thousand requests there are lots of requests with the same port (stats show that there 149 ports, which were used 2 and more times in above 1000 runs, there is even single port which was used 4 times). If we select range of 100 ports, then appropriate distribution is shown on the graph.
Such behaviour allows to limit source port range even more.

Now, DNS IDs.

DNS ID cloud DNS ID distribution

The whole range of IDs is used, and theirs distribution (each histogram corresponds to number of IDs in the appropriate 100 ids range) is more uniform. There were only 9 IDs used twice per 1000 runs.

But since I do not know exact load of the analyzed DNS server (and it can be high even at 3 A.M.), I can not say if that numbers are due to port/id selection algorithm implementation of just because load was high and there were actually not only my 1000 requests.

To further play with DNS caches I decided to install local DNS server first test things with it.

/devel/networking/dns :: Link / Comments (0)


Sun, 27 Jul 2008

Lots of talks about DNS cache poisoning attack.

There are two types of this attack: DNS query ID guessing and request source port guessing for servers which use randomized source port, which should be turned on after Dan Kaminsky's alert.

DNS ID is 16 bits only, so it could be guessed rather fat, one just need to force someone who uses attacked DNS cache to issue appropriate requests. When request is received by DNS resolver, it is stored there for predefined amount of time (TTL parameter provided by higher-level DNS resolver or eventually authoritative name server). Dan found, that attacker can actually ask not for attacked domain, but some subdomain of it (if attacker tries to point www.microsoft.com to own IP, it can force sending DNS requests for 1.microsoft.com, 2.microsoft.com and so on), and put data about actual target into additional resource records attached to all datagrams. So, when it eventually win the race, it can store (among lots of subdomains) needed pointers in the attacked DNS cache.

I've just thought that this attack will not be possible, if all queries from DNS resolvers to higher-level resolvers and/or authoritative name servers would happen over TCP instead of more common UDP. There is no need to issue requests from random ports anymore, no need to parse and drop additional resource records. There will be no problems with truncation of large messages... But to play a bit with the whole idea I'm implementing a simple DNS query/response processor. Maybe will play a bit with local cache (ISP at office uses only 6 different ports to send requests) poisoning, although its main goal is IP-over-DNS tunnel.

This is kind of a real rest after VISA/hotel paperwork. I was told, that if I will be called to embassy for the interview, chances are high VISA will be declined because of my sence of humor :)

Update:

zbr@gavana:~/aWork/tmp/dns$ ./query -a 195.178.208.66 -i 0x1234 -q tservice.net.ru
query: 'tservice.net.ru', class: 1, type: 1, server: 195.178.208.66:53, protocol: 17, id: 1234.
Connected to 195.178.208.66:53.
id: 1234: flags: resp: 0, opcode: 0, auth: 0, trunc: 0, RD: 1, RA: 0, rcode: 8.
        : question: 1, answer: 1, auth: 2, addon: 2.
	: question: name: 'tservice.net.ru.', type: 1, class: 1.
	: name: 'tservice.net.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 195.178.208.66
	: name: 'tservice.net.ru.', type: 2, class: 1, ttl: 86400, rdlen: 14, rdata: ns.tservice.ru.
	: name: 'tservice.net.ru.', type: 2, class: 1, ttl: 86400, rdlen: 7, rdata: dns2.tservice.ru.
	: name: 'ns.tservice.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 195.178.208.66
	: name: 'dns2.tservice.ru.', type: 1, class: 1, ttl: 86400, rdlen: 4, rdata: 62.141.76.164
And DNS protocol gets the first price among the ugliest crappies.
Now its time to create a DNS server itself, which will get requests (above dump shows BIND session), parse them and perform appropriate actions, like sending reply with specially crafted additional resource records, either NULL one for example (can contain upto 64k of data) or TXT (length byte followed by character string, there may be multiple strings as long as total length (including length bytes itsef) is less than 64k). Or additional A resource record, which may contain information about domain to poison...

/devel/networking/dns :: Link / Comments (0)


Tue, 01 Jul 2008

Why is blocking sending considered harmful?

I frequently hear that whatever server you implement, it has to be non-blocking, since in case of parallel sending it allows to send multiple requests to fast servers, while not-sending data to slow server, since non-blocking socket will return EAGAIN.

This is only half-right solution: when we have to put given data to all servers, and can not free it until all servers replied with acknowledge, non-blocking mode can bring more damage than gain.

Mainly because it allows to eat all the memory for requests, which are still in the queue to be sent to slow server, and which was already sent to fast ones. In this case higher-level application (consider simple application which generates some data and writes it into the file in distributed filesystem, which writes file to several servers) will never block since transfer to fast servers completes quickly, and will provide more and more data, which will consume all RAM.

It is possible to deadlock system in this case, since to send some data to remote server we always have to allocate at least some data to put network headers into. With non-blocking solution we will consume all memory and kick itself into the coma.

/devel/networking :: Link / Comments (2)


Passive OS fingerprinting.

I've updated OSF modules to xtables, so you have to enable its support in kernel config and get recent iptables (I tested with 1.4.1.1, which is the latest release to date).

OSF allows you to match incoming packets by different sets of SYN-packet and determine, which remote system is on the remote end, so you can make decisions based on OS type and even version at some degreee.

Installation instruction, example and source code can be found on homepage.

I've also sent it to netfilter-devel@ and netdev@ maillists, since my previous mails never appeared there likely because of spam filters.

/devel/networking :: Link / Comments (0)


Sat, 14 Jun 2008

Passive OS fingerprinting.

Ever dreamt to block all Linux users in your network from accessing internet and allow full bandwidth to Windows worm? We have to care about our smaller brothers, so this iptables extension module allows you to do so. OSF stands for OS Fingerprint allows you to build usual iptables decision on incoming TCP packets, only initial handhsake containing SYN bit is enough to understand what remote OS is. Original idea belongs to Michal Zalewski.
This iptables module was imlemented almost 5 years ago and lived in patch-o-matic (userspace library is still there) iptables tree. Now I've updated it to Xtables and send for review.

Installation steps are described on the homepage, but are trivial and include usual make/make lib building and loading rules into the module via procfs file.

# insmod ./ipt_osf.ko
# ./load ./pf.os /proc/sys/net/ipv4/osf
# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector
You find something like this in syslog:
ipt_osf: Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139

/devel/networking :: Link / Comments (0)


New userspace network stack release.

Fixed bug found by Salvatore Del Popolo (delpopolo_dit.unitn.it) in TCP implementation, when system checked sending window and determined, that packet was not allowed to be sent and nevertheless tried to do so in some cases.

Userspace network stack is a very fast (if working on top of netchannels, also supported packet socket) and very small network stack (TCP/UDP/IP/ethernet) implemeneted entirely in userspace. Because of it lives near the very the end of the peer (i.e. very close or even embedded into application), it allows much faster processing of some workloads, namely small packet sending and receiving, where it outperforms vanilla Linux TCP/IP stack 3 times in performance and 4 times CPU usage (sending and receiving vary).

ATCP gigabit test

Comapre netchannels+unetstack versus Linux sockets (2006 year numbers).

It is not about problems in the Linux stack, but overhead of syscalls, which are in turn results of too separate data sending and reply processing in the existing model.

/devel/networking/unetstack :: Link / Comments (0)


CARP: Common Address Redundancy Protocol for Linux kernel.

I've finally made a new release of the CARP for Linux kernel.

CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard. The latest protocol to help provide high availability and network redundancy, it was developed because router giant Cisco Systems believes that its Hot Standby Router Protocol (HSRP) patent covers some of the same technical areas as VRRP.

This project allows you to build high-available clusters of multiple machines with balanced master selection between them. Installation and setup are pretty trivial:

$ tar -zxf carp_latest.tar.gz
$ cd carp
$ make

# insmod ip_carp.ko
# modprobe cn
# insmod carp_conn.ko
# ifconfig carp0 up
# carp_conn_daemon -m master.sh -b backup.sh
And the same on all other machines.
Each script as you got from its name is executed when node becomes master or backup one, you can put there firewall rule changes, traffic shaping setup, network daemon start/stop scripts and whatever you like.

Its main advantage over any other existing open (well, it behaves much more robust than Cisco VRRP though) master/backup solutions (like Hearbeat or userspace CARP) is ability to setup multicast address (via usual /sbin/ifconfig command) and thus do not confuse some crappyCisco hardware, which will not understand that node changed.

One can get the latest sources from CARP homepage.
Enjoy!

/devel/networking :: Link / Comments (0)


Tue, 01 Apr 2008

Fix for the fundamental network/block layer race in sendfile().

Summary of the previous series with this pompous header: when sendfile() returns, pages which it sent can still be queued in tcp stack or hardware, so subsequent write into them will endup in corrupting data which will be eventually sent. This concerns all ->sendpage() users namely sendfile() and splice().

We can only safely reuse that pages only when ack is received from the remote side, which will force network stack to release pages. My simple extension allows to hook into data releasing path and perform any actions we want. This is achieved by replacing skb->destructor with own callback registerd by interested user, for example splice/sendfile code. Splice (pipe info structure) in turn is extended to hold atomic counter of the pages in flight (without structure size change because of alignment issues it has right now), so splice code will sleep when full pipe info (->nrbufs pages) have been sent, it will wait until number of pages in flight hits zero, which is decremented in private splice callback.

Patch was tested with simple send and recv applications, which can be found in archive.

One has to run them on different machines, since loopback uses a bit different scheme (namely page is _never_ copied, so when it is received by 'remote' side, it still exists on the 'local' side, so modifications will endup in data corruption).

devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024
In case of failure you will get this:
Connected to devfs1:1025.
/tmp/test/1024 -> devfs1:1025
Data was corrupted: ab.
after short period of time, where above 'ab' is a hex byte writen into mapped file, which has been sent, immediately after senfile() returns to userspace. Data is supposed to be always zero, and applications should run forever.
-c parameter specifies number of bytes to be sent in each run of the sendfile(). It has to be the same on both machines.

This idea was first thought as soft barriers in distributed storage.

/devel/networking :: Link / Comments (0)


Fri, 29 Feb 2008

Debugging undebuggable.

If something looks undebuggable from the first view, than take a secon one. Better from different angle. Some problems require third look.

Bits of history of the problem. Pohmelfs has extremely large latencies when syncing local inode to the remote server. This involves sending a command to the server to create an object with given name and receive back a response with its real inode information (like inode number and other fields cached for faster stat() and similar workloads). Pohmelfs then changes local inode info to match the real data.
Syncing of small tree of 500 files takes about 40 (!) seconds. Well, in Xen environment where I develop this things local creation of 500 files in single ext3 directory takes more than 15 seconds, but another 25 is a pure overhead.
That was short description of previous series.

Next, problems of fixing the problems.
First, Xen version used at that testing machine is old enough, so oprofile does not work. Second, I do not know VFS internals enough (this is my first filesystem, interested reader can find how I managed to step likely on every possible rakes on that field, some of them were even small kid rakes...) to determine where there is a possibility to catch that long delays, but since linux filesystem is actually a not that complex system, but set of callbacks, implementation is not really outstanding, but knowing in which condition each callback can be invoked and which problems can be here or there is kind of a magic... Third, remote userspace pohmelfs server was not actually written by me, instead its bytecode was blown out because of some substances inspiration, so it can be very much a reason for all the problems, given that it is trivial as pretty much all my userspace code, even total rewrite will not fix the issue.

So, latency problem in pohmelfs looked really undebuggable. But you know, cup of excellent tea (from tea-packet) with lemon can fix any problem (or high themperature and substances, or fair amount of alcohol, everyone has fun the way he likes), so it was first decided to implement a simple network kernel module which would connect to remote userspace server and exchange messages in a similar fasion like pohmelfs does.
Such module was implemented, started and showed excellent performance (about 1 thousand of messages per second send and received back in test network, which is several orders of magnitude faster than pohmelfs). So, move back to VFS and pray for inspiration.

Inspiration was met today (thanks Arnaldo, likely it is because I'm getting healthier :).
I always thought that number of subsequent calls for recv() is not a good idea no matter where: in kernel or userspace, since it takes a socket lock, which in turn can introduce latencies found, so I eliminated subsequent recvs in pohmelfs code (testing module was written better and does sending and receiving without such 'fragments'), which resulted in... nothing, results did not changed at all. So, wrong step, but having subsequent sending calls in a row is not a good idea too, so I replaced them with allocation and copy, so that there would be only single kernel_sendmsg() call. As you might expect performance... changed by 30 times. Just by having single send call instead of two for as much as 500 invokations forced the whole network exchange to behave completely different.
So, to debug problem further I extended testing module and introduced ability to send and receive data not by single packet but via two fragments: 4 bytes and rest of the packet (60 bytes). Here is a result table for 1000 of messages sent and received back by testing module:

no fragments:				1.43 seconds
send fragments (4 and 60 bytes):	40.43 seconds
recv fragments (4 and 60 bytes):	1.43 seconds
both fragmentations:			40.43 seconds
It is 30 times difference just for simple application change!
tcpdump on receiving side shows that subsequent fragments sending results in a real message sending all the time kernel_sendmsg() is invoked, which results on ack for each such message (both 4 and 60 bytes), which completely degrades tcp window and connection just can not recover with such behaviour.

So, all that words were written just to show that even undebuggable from the first view problems can be easily solved, and that harmless (from the first view again) programming mistakes can result in very interesting results...

Now back to drawing board to think how to improve pohmelfs protocol even more to get the last bits out of the wire.

Btw, interested reader can get my network testing module and userspace from theirs just created homepage.

/devel/networking :: Link / Comments (4)


Fri, 14 Dec 2007

New release of the userspace network stack.

Changed data reading function, now it does not copy TCP header into user's buffer, only data, and forced packet socket reading path to limit maximum number of packets to be read, which do not match created netchannel.
As usual, new release is available from project homepage.

/devel/networking/unetstack :: Link / Comments (0)


Tue, 04 Dec 2007

The 22'th century netchannels release.

This is the 22'th release of the netchannels, a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing.

Short changelog:

  • update cached route in the netchannel when it expires
Thanks to Salvatore Del Popolo (delpopolo_dit.unitn.it) for testing.

You can get the latest sources from netchannels homepage.

Userspace network stack is available from own homepage.

/devel/networking :: Link / Comments (0)


Thu, 29 Nov 2007

The 21'th netchannels release.

Netchanel is a peer-to-peer protocol agnostic communication channel between hardware and users. It uses unified cache to store channels, allows to allocate buffers for data from userspace mapped area or from other preallocated set of pages (like VFS cache). All protocol processing happens in process context.

Users of the system can be for example userspace - it allows to receive and send traffic from the wire without any kernel interference, to implement own protocols and offload its processing to the hardware.

This idea was originally proposed and implemented by Van Jacobson. This patchset (with userspace netowrk stack) is a logical continuation of the idea with move to the full peer-to-peer processing.

One of its users is userspace network stack.

Short changelog:

  • fixed queue length usage
  • fixed dst release path. Both problems reported by Salvatore Del Popolo (delpopolo_dit.unitn.it)
  • removed nat user
More details can be found on project homepage.

/devel/networking :: Link / Comments (0)


Wed, 07 Nov 2007

iWARP port sharing problem.

I read Ronald Dreier's post about iWARP port sharing problem and want to shed some light on it.
Besides the fact, that Ronald greatly described basics of the technology, he skipped, that problem was discussed and solution was found with introduction of iWARP specific aliases which should be assigned by administrator, so that network stack got a new ifindex and application bound to different device would not get the same port as iWARP ones.
Ronald also skipped that part, where it was suggested some improvements, which were not implemted (error propagation and fallback, automazation of the process (like alias creation) and other bits), most of the time essentially the same answer was received, that it is not needed... Maybe it is, but why this talk was missed in Ronald's presentation of the evil empire of the network developers?
So, I think, RDMA people do not need a discussion, you want that your own ideas got merged just because of the fact, that you believe it is cool, and no matter how things are in real life and what others say you about it.

I know that, because it was me, who performed first review of the alias patches for iWARP.

/devel/networking :: Link / Comments (0)


Tue, 06 Nov 2007

New release of the userspace network stack.

It is based on patches by Holger Schurig (holgerschurig_gmx.de).
Short changelog for this unetstack release:

  • added netchannel.h, which allows to compile userspace network stack without netchannels support in the kernel
  • killed warnings about unused wariables

/devel/networking/unetstack :: Link / Comments (0)


Saving the universe from the thermal death

or decreasing world entropy. I.e. fixing bugs in the kernel.

My small contribution - fixed sch_teql bug.

/devel/networking :: Link / Comments (0)


Thu, 01 Nov 2007

Network hash tables for socket lookups.

Topic of moving hash tables to RCU rises regulary in netdev@ mail list, but so far there is no solution for hash resizing problem because of RCU nature. Likely it can not be fixed at all without some additional (maybe optional) synchronization.
It was pointed that Robert Olsson's hashed trie can be a good solution.

Interested reader can also check my multidimensional trie algorithm, which I implemented for network sockets lookup and originally got from netchannels. It was announced at netdev@ bug I got quite passive response, so froze the project for a while (it can be resurrected though)...
At the links above you can find performance testing comared to hash tables in kernel with different sizes. Testing was performed by running simple web server and huge number of clients, which frequently connect/disconnect from server.

/devel/networking :: Link / Comments (0)


Thu, 18 Oct 2007

New release of the userspace network stack.

Short changelog:

  • really fixed leak in raw netchannel reading path
  • changed timestamp setup
  • added retransmit checking timer
  • added sanity checks for addresses and ports processed in the stack - in case of packet socket they can be incorect some times (when working over loopback for example)
  • retransmit logic checks - still requires bits of work, it is not 100% correct
This rlease contains number of really useful fixes, but retransmit logic is not yet correct. Since unetstack uses very aggressive (non-rfc-compliant) congestion control algorithm, this can lead (and I see this in practice) to complete dataflow suspending.
I will investigate this problem further later.

/devel/networking :: Link / Comments (0)


Reading userspace network stack code.

	if (!th->ack) {
		ulog("%s: Strange packet.\n", __func__);
		goto out;
	}
Very interesting, what did I mean?

/devel/networking :: Link / Comments (0)


Tue, 16 Oct 2007

Userspace network stack.

I've released new version of the userspace network stack, which contains a memory leak fix by Salvatore Del Popolo (delpopolo_dit.unitn.it).
Enjoy!

/devel/networking :: Link / Comments (0)


Mon, 08 Oct 2007

Async IPsec support in Linux kernel.

Herbert Xu (current crypto maintainer) started preparatory work to move IPsec to become asynchoronous - so far he moved common code around into generic helpers, added skb shared structure to hold XFRM (linux IPsec stack) per-packet state (header and sequence number) during IP processing, which includes time while packet is being encrypted, and also removed (or replaced) potentially unused/redundant elements during XFRM processing.
As is, it does not add any async possibilities, since the main output loop, where crypto (ESP or AH) processing function is called, was not broken into completion parts, but with above changes it will be much simpler.
For example acrypto async IPsec patch, which was introduced for 2.6.15 kernel tree, did not have common code for IPv6 and IPv4 processing code, so it only supported IPv4 ESP mode.
I'm pretty sure Herbert will implement async processing as a callback invocation model (although I saw implementation of a crypto function as a busy waiting for completion), when skb will contain all information needed for further packet handling (mainly it is either routing information or device output function, obtained from routing info), and that callback will be provided to encryption device. One big problem with such apporach can be that crypto hardware device (actually the only one real async hardware supported by existing Linux crypto stack is HIFN 795x adapters) can call provided callbacks from hardirq context instead of softirq (or process context) like in current network stack.
We will see, how this will be implemented. I really wish Herbert success and hope it will find its way into 2.6.24 tree (ugh, I need to complete misaligned handling in HIFN driver, but new driver policy is a bit less restrictive in matter of time limits after merge window is opened by Linus, so I will try really hard to kill my laziness, so that driver will be ready this week).

/devel/networking :: Link / Comments (0)


Tue, 25 Sep 2007

New release of the userspace network stack.

You think I forgot it? No.
Many thanks to Salvatore Del Popolo for kicking me with bugs.

This release contains number of bug fixes and ability to be used with packet sockets instead of netchannels.
And extended README with bits of documentation and examples.

New release is available at project's homepage.

/devel/networking :: Link / Comments (0)


Sat, 01 Sep 2007

Memory defragmentation.

Christoph Lameter from SGI has posted a patchset aimed to implement memory defragmentation in his SLUB memory allocator.
Main idea of the new version is to find pages, which are in the slab, but are not referenced, to free them and combine into bigger chunks (compound pages).

SLAB/SLUB/SLOB still does not support in-page defragmentation, which was one of the main issues in my network (tree) allocator, which combined any objects close to each other, so that allowed bigger allocations, but not only pages into compound pages (which is can do too). Main feature of the network allocator was the idea, that objects should never be freed on the different CPU than where it was allocated. In SLAB this approach is never used and objects can be freed on different than allocation CPUs which leads to the fragmentation.

/devel/networking/nta :: Link / Comments (0)


Thu, 02 Aug 2007

Fixing bug in Linux network stack.

There was an interesting bug posted to netdev@ today by user with name John (actually the same one was posted sevaral times already, but this time he included simple application to trigger it). It was possible that at the end of the connection, the last TCP segment was sent with wrong port number, like in example below:

17:50:43.414212 IP 127.0.0.1.50000 > 127.0.0.1.10250: S 1312601602:1312601602(0) win 1500
17:50:43.452081 IP 127.0.0.1.10250 > 127.0.0.1.50000: S 864201221:864201221(0) ack 1312601603 win 32792 
17:50:43.414364 IP 127.0.0.1.50000 > 127.0.0.1.10250: . ack 1 win 1500
17:50:43.452649 IP 127.0.0.1.50000 > 127.0.0.1.10250: P 1:17(16) ack 1 win 1500
17:50:43.452666 IP 127.0.0.1.10250 > 127.0.0.1.50000: . ack 17 win 32792
17:50:43.452735 IP 127.0.0.1.50000 > 127.0.0.1.10250: R 1312601619:1312601619(0) win 1500
17:50:43.564760 IP 127.0.0.1.54076 > 127.0.0.1.50000: R 1:1(0) ack 17 win 32792
As you can see, the last RST segment contains wrong port number 54076 instead of 10250. This does not break anything actually, but is bad as is, so I decided to spent ths day helping the world instead of hacking the hash.
Number of tricks I used is more than enourmous - that was debug printks all over the place, padding red zones to catch overflows, delayed operations and even total rename of given variable.
Bug scenario was only 'detectable' when socket is closed, but there is data unread, so that RST should (according to RFC 2525) be sent, so it is quite rare condition.
Eventually I tracked it down to the fact, that when socket is being closed, it already contains wrong port field. Work with timers showed, that in such short lived connection neither of three TCP timers fired, but processing of the last RST in the connection above shows that port number is still valied there.
Stumbled. how is it ever possible that nothing happend, but something was broken?
In the MIPT, especially in the physical laboratoris, I was continuously told, that there are no miracles, and I think that is true, so I audited every single usage of the port field in the inet socket structure (surprisingly not that many cases, maybe several dozens only), and eventually found that inet_autobind() fucntion can change given field, after number of debug prints problem was completely localized and fixed by simple patch, which checks if socket is really alive and thus requires binding. Problem described above can only happen for semi-alive socket, when it was partially released (namely its port value is freed) and thus smells bad, but still can be accessed from userspace (socket reference itself is not released), so that any subsequent sending call could endup changing port number by binding to the new port. My simple fix checks if socket is partially alive and if so, it does not allow sending (which is not allowed anyway later in tcp_sendmsg()) and does not perform autobinding.
That's all, but it sucked about 6 hours. Do not even know if this number is good or bad.
Bug submitter John reports that this bug exists even in 2.4.0 kernel, and it was referred in web multiple times, but did not force anyone to fix.
Now it is gone. Even if my fix is not correct, I provided enough information, so that real fix would be simple.

/devel/networking :: Link / Comments (1)


Tue, 03 Jul 2007

Network related VM deadlock prevention.


There is a funamental issue when doing VM operations over network attached storage/device - each operation requires at least one additional allocation (for the network protocol headers) right now, and frequently (in case of guaranteed delivery like in TCP) to receive an acknowledge from remote peer that data was either written or data itself, which is another allocation. So, if ssytem is out of memory and wants to swap a page over the net, it can deadlock trying to allocate a space to send page or receive an ack.

Peter Zijlstra (and initially Daniel Phillips) proposed several times an approach to fix that issue - they decided that the best way is to create a reserve pool when system is under initial pressure, and then only provide data for sockets initially marked as 'special', so that it would be possible to make small progress which is likely would be enough in the most deadlock cases.
I was slowly opposed against this idea, since in the given implementation it is possible to fail to allocate reserve, there is no fair way to mark sockets as 'special' - only couple of them were setup in kernel, and if it would be exported to userspace, everyone could put own sockets into reservable and thus effectively block the whole idea of providing reserve only for real needs of deadlock avoidance.

Instead I proposed network allocator, which was specially designed to be exlusively used by network users.
It grabs number of pages from the main memory and use it for skb allocations, thus effectively not depending on main memory conditions. Such separation is the way to go in perfect world, but in real life there are problems too (and one of them is the idea of separation main system allocations and networking ones, which rised objections from people), although network allocator has set of features especially useful in network environment, right now I want to talk not about it, but about deadlock avoidance.

Distributed storage is such a device, which can suffer (actually as any other) from described above situation, so I need to think about how to solve it without too invasive changes in the rest of the kernel.
The best thing I think is to get ideas both from network allocator and Peter Zijlstra's idas - I plan to create a patch, which would allow to bind a independent reserve for any socket - such a reserve can be stolen from socket buffer itself (each socket has a limited socket buffer where packets are allocated from, it accounts both data and control (skb) lengths), so when main allocation via common path fails, it would be possible to get data from own reserve. This allows sending sockets to make a progress in case of deadlock.
For receiving situation is worse, since system does not know in advance to which socket given packet will belong to, so it must allocate from global pool (and thus there must be independent global reserve), and then exchange part of the socket's reserve to the global one (or just copy packet to the new one, allocated from socket's reseve is it was setup, or drop it otherwise). Global independent reserve is what I proposed when stopped to advertise network allocator, but it seems that it was not taken into account, and reserve was always allocated only when system has serious memory pressure.

Why does this idea better (from my point of view) than first two?
First, because it is not that invasive like network allocator.
Second, it allows to separate sockets and effectively make them fair - system administrator or programmer can limit socket's buffer a bit and request a reserve for special communication channels, which will have guaranteed ability to have both sending and receiving progress, no matter how many of them were setup.
Third, it does not require any changes behind network.

/devel/networking :: Link / Comments (0)


Sat, 12 May 2007

TCP congestion avoidance algorithms.


Stephen Hemminger ran a simple tests to show how modern Linux congestion algorithms work, his graphs cane be found here.
He tested 1Mbit DSL link with a 100ms RTT.
As we can see from graphs, all they follow RFC, which forces to halve sending window in case of congestion (duplicate ack received by sender), created more than 10 years ago.
Speeds have changed hugely since that times...

So, I'm thinking about high-speed congestion control suitable for small RTT in modern ethernet networks, which will not halve sending window, but decrease it exponentially until decrease rate reaches 50% from current window size. Maybe word 'exponentially' scares a bit, so let me describe idea in more details.
Let's say sender has received a duplicate ack, which means that some segment was arrived not in order (either there is a misordering in network or likely some segment was lost), fast retransmit suggests to resend missed segment and then to halve a window, but what if sender will slightly decrease window, say for 10%, if rate is still very high, and new duplicate acks are being received, window is decreased by 20% and so on, until we reach 50% limit dictated by old RFC.
Actually such congestion control algorithm is implemented in my userspace network stack, but it is much smaller, since my stack does not support extended states like Linux kernel has (and a lot of congestion controls behind old Reno use them).

Needs to think about...

/devel/networking :: Link / Comments (0)


Tue, 08 May 2007

Unified socket storage.


I've just released a second patchset which implements unified cache of sockets for network instead of old hash tables. It stores all types of sockets (although I only implemented af_inet, unix, netlink and raw ones for now) in single object structure called multidimensional trie (which is similar to judy array in some way).

I performed simple performance test with handmade client and httperf. The former is just epoll driven client which issues requested number of requests one-by-one (or with some concurrency, which does not yet proven to work correctly). With mpm apache on test machine I got sustained 2k/s requests for mdt and about 1200/s for (untuned) hash. With lighttpd and httperf (10k max, 1k rate) I got sustained 1k/s for mdt and 550-1000/s for untuned hash. With tuned hash (thash_entries=1000000) I got both 1k/s, with 30k max, 3k rate httperf I got 1650 for mdt and 1k for tuned hash. Server was with lighttpd 1.4.13. (handmade server as long as 'echo -en "GET / HTTP/1.0\n\n" | nc server 80' does not work due to unknown reason, I did not investigate). Results are quite small for that machine (amd athlon64 3500+ with 1gb of ram and gigabit r8169 adapter), but I have all debug options turned on (including heavy slab/vm).

Patch has been sent to @netdev for review. I asked about discussion about future of this project before making any further steps (mainly statistics code).

/devel/networking :: Link / Comments (0)


Sun, 06 May 2007

Unified socket storage testing.


I've written an utility which managed to heavily crash my testing system. I expected that so currently that is being investigated. Utility is quite small and simple web request generator, which can work with different concurrency level and maximum number of requests. I write my own instead of using Apache benchmark of httperf because both are quite heavy and do not allow to fairly test remote side (ab is client limited, since when I tested kevent it used 100% of the client's CPU, which should not happen, httperf uses old poll, so it does not scale to thousands of simultaneous requests). My application uses epoll and is quite small and is not intended to replace any of the above, but created only for my own tests. So far it is not 100% completed, but already crash unified socket trie cache.

/devel/networking :: Link / Comments (0)


Van Jacobson's talk about modern networking and related problems.


@Google.video.

/devel/networking :: Link / Comments (0)


Fri, 20 Apr 2007

Power-of-two allocators.


Eric Dumazet has rised an interesting question about existing power-of-two allocator related to no-mmu implementation of the - is it possible to allocate higher-order page and then return part of it as unused (for example if someone has allocated 10-order page and then return 8-order part).
As far as I can tell (I'm not absolutely sure though) it is impossible with SLAB one - each page can only be 'split' into the same-sized parts, so either 10-order, or two 9-order, or 4 8-order, but not one 8-order and one 10-order minus 8-order.
That was one of the reasons I created network allocator, which I proved does fix such power-of-two overhead in the single page, i.e. blocks are combined when freed to form bigger one, and it is possible to allocate exactly requested block not aligned to power-of-two boundary.
But my allocator did not get enough attention (did I say that already for something unrelated? :), so it was a bit postponed.
Let's see, if there will be some interesting suggestions in the thread.

Update: David Howells of RedHat seems to be sure, that it is possible to allocate an order-10 page, then release part of it (say an order-8 subpage). But from the whole thread it seems that he says about no-mmu case, which can work on top of SLOB allocator in some embedded system.

/devel/networking/nta :: Link / Comments (0)


Next 40 entries