Zbr's days.
June
Sun Mon Tue Wed Thu Fri Sat
         
19
2007
Months
Jun

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Tue, 19 Jun 2007

CRC performance depending on the word size.


To select perfect CRC coding technique which will be used for distributed storage I'm about to start designing, I've ran set of tests to determine how CRC speed depends on size of the word used in calculation. I use simple XOR CRC which just performs XOR operation over provided data block several times and then shows its performance (i.e. number of operations per timeframe).

Here is CRC performance for the same blocksize and number of iterations (i.e. number of CRC sums performed over the same block):

size: 4096, num: 100000, bits: 64, speed: 459.630646 kop/sec.
size: 4096, num: 100000, bits: 32, speed: 348.913483 kop/sec.
size: 4096, num: 100000, bits: 16, speed: 176.067184 kop/sec.
As expected, CRC speed was improved with increasing size of the word, unfortunately for galois field multiplication speed increase is not enough.
Let's discuss Reed-Solomon coding, which requires galois-field multiplication of checksummed (or raw) data words.
For 16 bit CRC sum galois filed multiplication can be performed using 2^16 table (2 lookups per multiplication or division operation) lookup, but for 32 bit CRC galois field multiplication in GF(2^32) can not be performed using table, since it will be too big. To perform galois field multiplication essentially one needs to solve a system of equations, number of equations is equal to size of the values being multiplied, i.e. 32 in our case. Even the fastest method of solving system of equations will be much much slower than 16 bit CRC plus its 2 lookups. Even using Cauchy matrixes for Reed-Solomon encoding with galois multiplication instead of Vandermonde one results in much slower code than table lookup.

So, for the maximum performance we either need to limit Reed-Solomon system to 16 bit CRC (galois field multiplication is used in RAID coding for example), or to use different coding method.
I'm studing WEAVER codes which are faster by design (and use only XOR operations without complex galois field multiplications), but they have own limits, so I'm trying to understand how distributed system with such erasure coding will suffer from using it.

/devel/math/codes :: Link / Comments (0)


Network batched xmit testing.


I've ran several simple tests with desktop e1000 adapter I managed to find.

Test machine is amd athlon64 3500+ with 1gb of ram. Another point is dektop core duo 3.4 ghz with 2 gb of ram and sky2 driver.

Simple test included test -> desktop and vice versa traffic with 128 and 4096 block size in netperf-2.4.3 setup.

Test machine runs 2.6.22-rc5-batch and mainline tree (there is a test with 2.6.22-rc4 and there is a noticeble performance win compared to that tree in the latest git, likely tcp congestion changes resulted in better utilisation). Batched xmit has better numbers.

Results:

2.6.20-ubuntu -> test machine

2.6.22-rc4-batch:
128: 725.43 724.14
4096: 698.63 712.77

2.6.22-rc5-mainline:
128: 760.91 762.04
4096: 784.32 788.53

2.6.22-rc5-batch:
128: 766.70
4096: 788.24


test machine -> 2.6.20-ubuntu

2.6.22-rc5-mainline:
128: 558.16 (Desired confidence was not achieved within the specified iterations.)
4096: 814.01

2.6.22-rc5-batch:
128: 569.14 (Desired confidence was not achieved within the specified iterations.)
4096: 822.72
Batch patchset's design is quite simple: instead of setting up DMA transfer for each new skb, it queues them and if there are several packets in the queue sets up DMA once for whole batched queue (not more than a threshold though).

Update: here is a result for pktgen UDP testing:
batch (batch_min: 0):   241319pps 115Mb/sec
batch (batch_min: 400): 182607pps 87Mb/sec
mainline:               497520pps 238Mb/sec

/devel/other :: Link / Comments (0)


Recent developments.


I was too lazy for the latest week and actually did not perform anything interesting. I can only complted testing and bugfixing of Jens Axboe's network splice (bug happend to be in generic splice code, but was triggered only by network splice patchset, likely it will be included in the next release), some work on SLAB page reference counting (that was even interesting - it was a ground work for initial network splice mechanism, but eventualy I found it is not enough and simpler way is to just reference an skb and do not free it until splice pipe is empty). Right now I'm setting up a testbed for Jamal's network batching (although I did not want to participate initially - first because I did not see a big margin in the idea, probably because I did not understand it fully, and second, work with Jamal was not that productive - after his request for libevent support of kevent he dissapeared and frequently all talks were never ended with code, but with empty discussions, which I do not like). Anyway, I decided to answer, since I was in Cc: list (hint: if I will be in kernel summit and in good mood, I will rise a request to always answer if person is in Cc list, original sender added people on purpose, and if someone has nothing to say, just say that 'it is not interesting for me', 'I have no time', 'I will look at it later' or 'it is in my backlog, I remember this' but never fucking dissapear like in black hole).
So, I found old e1000 adapter and will test Jamal's network batching code with simple netperf setup.

Another testing task is to run btrfs tests in my automatic filesystem testbed (well, semi-automatic, graph generation scripts require manual run so far, maybe I will extend it later). Will start, when Chris Mason released 0.3 version (home quite soon).

But frankly, I'm too bored with paid work testing (I need to write a bug reports for other's automatic testing system, which can not start shell scripts), with splice testing and batching testing.

I want my own project to continue, so expect something new soon.

/devel/other :: Link / Comments (0)