|
|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Mon, 30 Jun 2008
Filesystem development rumors.
Rumor number one. SWsoft
aka Parallels actively searches for Linux kernel hackers in
lead Moscow universities, namely MSU and MIPT. I saw theirs
posters, where among other (wanted) requirements there is
distributed filesystem knowledge.
Rumor number two. Alexey Kuznetsov (if you do not know,
its the guy who wrote major part of linux network stack,
namely TCP/UDP/IP and socket implementations, and although
there was lots of changes in the stack since then, I think it will not
be an exaggeration to call him the author), who also worked
on Virtuozzo and OpenVZ (and its interesting VFS parts, which
AFAICS are not in kernel, maybe yet), so he works on some
filesystem too. The last time we 'confronted' was couple
of years ago, when I first time implemented
netchannels
and tried to convince network community (and namely Alexey Kuznetsov
and David Miller)
that netchannel idea worth further investigation and implementation.
IIRC I did not succeed, although results were very
impressive.
Let's see what will happen with filesystems :)
Rumor number three. SWsoft recently started to actively search
for kernel hacker for 'new interesting open source project'. They
always searched for kernel programmers, but never told anything
about projects, now something changed.
Rumor number four. OpenVZ and Virtuozzo have serious problems with NFS
(especially when server dies), probably because of very ugly NFS protocol
(yes it is), so its hard to properly virtualize it (or not?). There are
no alternatives for NFS right now in major productions, but you all know about
POHMELFS
which right now can be used as really good replacement.
Rumor number five. SWsoft has long history of PHD defences (at least in MIPT) based on
theoretical FS called TorFS (namely Tormasov FileSystem), year ago it was still
not very alive project in practice,
but I heard that it was very impressive in theory. This rumor exists
really many years.
So, I have a quite clear picture, that SWsoft started development of the new
distributed filesystem, which is aimed at first to replace NFS in virtualized
environments. I can also imagine very interesting distributed parallel facilities
needed for virtualized systems. And they try to attract lots of people to the
project as long as really heavy artillery like Alexey Kuznetsov.
Which basically means, that sooner or later my development will meet strong
concurency from this company, which has lots of really good professionals.
And that's very interesting and cool :)
P.S. or it may be a complete bullshit and delirium of my fevered consciousness.
And one fact about
POHMELFS:
today I finished client support for padded crypto processing of all requests
and started to work out server bits, I expect to finish it in a day or around,
so new release is very close.
/devel/fs :: Link / Comments (3)
Sat, 28 Jun 2008
Listened how my trumpet can sound.
It was really interesting. Although it is very simple student
model, a friend produced very good sounds. He did not practice
many years already, but nevertheless it was not that bad.
My everyday half to hour exercises usually produce worse sound, although
sometimes I do find really cool notes. Unfortunately I still do not
know some magic bit about how to catch on that sound, it borns and
dissapears on its own, but I'm sure I will find it, and I think I'm close
to where it hides :)
/other :: Link / Comments (0)
Need to rethink POHMELFS crypto a bit.
1. Because of encryption problem - data to be encrypted has to be
blocksize aligned, so some informaion about padding has to
be added into network command as long as crypto data size.
2. IV generation. I decided to extend network command and put there
64 bit IV for given packet. using simple sequence number is enough
to protect against repeat message attack.
3. Encryption/hashing data. I decided not to ecnrypt/hash network headers,
and only do it for transmitted data. If transaction contains several
commands, data for all commands will be encrypted/hashed, in case of hash,
signle digest/hmac will be generated and placed into transaction header.
4. It is possible, that I will add strong header checksum, which will be generated
only for header and placed into special field. It will be calculated
assuming checksum field is zero. This step is optional so far, but network header
has 32 reserved bits, which can be used for it.
Right now hashing and encryption work, but are not checked on server (although generated),
because of crypto alignment ugliness I decided to rethink approach a bit.
Evolution process in action...
/devel/fs :: Link / Comments (0)
Fri, 27 Jun 2008
0:3
That was really suck - yes, we played bad. Just like it was before.
It is not somewhat surprising.
But what was the fucking ubnormal week ago agains Holland? That
was new, was cool, was bloody great, but not today. Tired or whatever...
What's the difference right now, we lose.
Yes, Spain played really good, my congratulations.
But our command showed, that it is possible.
That there is nothing impossible.
We can, when we want. You can, when you want.
Thanks a lot for the games!
/other :: Link / Comments (0)
Thu, 26 Jun 2008
POHMELFS server got initial crypto processing capabilities.
POHMELFS server is able to handshake hash/cipher names and operation
modes, to initialize appropriate algorithms and perfrom basic operations
(like more generic hash_update() instead of different
functions with different arguments used to hash data depending on operation mode,
either simple digest or hmac: EVP_DigestUpdate()/HMAC_Update().
I'm working on the right way of doing crypto processing, since how it is done right now is a bit hairy,
i.e. without serious changes in the code.
I already hate OpenSSL API: EVP_get_cipherbyname(), EVP_MD_CTX, EVP_DigestFinal_ex().
It looks like above functions were written by three different persons and they
never actually talked to each other about how to make them look similar... But it is
a minor issue of course.
So, when things are settled down, I will make a new release, likely it will see the light this week.
/devel/fs :: Link / Comments (0)
Hacking your ISP for fun and profit.
My ISP again blocked my account and can not unblock it although there
are money on the deposit. There are serious problems in its billing
system which requires manual intervention of the operator. Unfortunately
it is a real challenge to call them, it already took more than half of a hour
yesterday, and without success.
So, I decided to implement an interesting idea on how to bypass its blocking.
It is based on the security 'hole' in its (and I think vast majority
of ISPs do the same) DNS configuration, which allows
to request any DNS record even if account is blocked. It will be fetched from
remote DNS server if there are no records in the IPSs cache.
Thus attack vector becomes visible: implement IP over DNS tunnel network device
and setup local routing to use it by default. One has to control at least one
remote machine which hosts DNS records for given domain name, since it is required
to parse incoming DNS requests and process them accordingly.
There are at least two known IP over DNS tunnel solutions:
NSTX
(howto) and
OzymanDNS
(howto). Both solutions require that you own one or another
server to run ip-over-dns tunnel server on it.
Unfortunately I have only single machine with static IP address, which is not protected
by lots of firewalls and allows incoming connections.
The simplest solution for this problem is to create iptables input target rule
for the server, which will parse incoming DNS requests and redirect usual queries up
the network stack to the userspace server, and handle 'poisoned' queries as tunnel.
Client can be TUN/TAP based, but can also be a tunnel network device.
I believe the more weird it looks, the more interesting it is, so likely will think
more about kernel based tunnels.
DNS queries are limited enough not to allow binary data (IIRC,
the most interesting is DNS TXT records), but it can be appropriately
encoded and enciphered. So, will put it into
todo list.
I even think that it is not that bad idea to have such modules in kernel :)
/devel/other :: Link / Comments (8)
Wed, 25 Jun 2008
POHMELFS input crypto processing engine is ready for testing.
But testing can not be done without appropriate server support, which
is now the main task. POHMELFS uses lazy crypto engine - each network state
(it represents connection between client and one server) contains
number of fields used exclusively for semi-lockless input data processing
(it locks state when performs actual reading, but does not
hold that lock when processing incoming messages, since it is the only
path, which receives data), now it also has crypto information about
how to manage reply messages (they include read page reply for example),
so it does not queue work to be done by crypto threads, but does that itself
instead. It may or may not be the bottleneck of the input path, tests will
provide facts, so far I do not have plans to change it, but it can be done
of course if performance will suck.
After I finish crypto processing in both client (it has been written, but requires lots
of testing with server) and server (just have started to recall how to work with
OpenSSL. Well, I've read how HMAC works in OpenSSL, found it to be simple enough
and then started to read how to parse binary data in LISP :)
But anything which is interesting for me now, ends up in good results for all other
projects), I will switch to something different for a while.
Some voices in the brain ask to be spread it in lots of interesting directions :)
/devel/fs :: Link / Comments (0)
POHMELFS crypto performance.
I've ran read/reread and write/rewrite tests as described
in previous run,
now with HMAC(SHA1) of all outgoing transactions (note, that reading response data is not yet
encrypted and does not contain digital signature, server also does not support neither operation),
essentially only writing should be affected by this, but I also ran reading tests for compelteness.
Results show zero performance overhead of the full data SHA1 hashing, but note that quite fast
machines were used (2 3Ghz Xeons (2 physical and 2 logical CPUs, HT enabled) with 1 GB of RAM). All the time only
two crypto threads were actively hashing data, since there are only two pdflush threads on this machine.


Writing is even faster with hashing, but results drifted around, so essentially performance is the same.
/devel/fs :: Link / Comments (0)
Tue, 24 Jun 2008
VM gotcha: forbidden double kmapping.
I've just known, that it is impossible to map the same page
twice: for example first time using kmap()/kunmap()
and second one via kmap_atomic()/kunmap_atomic().
Although mechanisms are a bit different in both mappings, it is
forbidden to do and system will panic like this:
IP: [] kmap_atomic_prot+0x1b/0xc5
*pdpt = 0000000031c79001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Pid: 6478, comm: pohmelfs-crypto Not tainted (2.6.25 #27)
EIP: 0060:[] EFLAGS: 00010202 CPU: 2
EIP is at kmap_atomic_prot+0x1b/0xc5
EAX: ebc7c000 EBX: 00000003 ECX: 00000000 EDX: 00000003
ESI: 00000fdc EDI: 00000163 EBP: 80000000 ESP: ebc7dee4
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process pohmelfs-crypto (pid: 6478, ti=ebc7c000 task=f25040b0 task.ti=ebc7c000)
Stack: 00000000 00000003 00000fdc f7cf4078 00000fdc c0114144 00000163 80000000
c01991b1 ebc7df44 f70e3580 00000000 ebc7dfa8 ebc7df40 f70e3580 00000003
00000000 f7cf4000 f70e3580 f70ff8b0 f70ff880 f7096c00 c019a771 f70e3580
Call Trace:
[] kmap_atomic+0x11/0x14
[] update2+0x7c/0x13f
[] hmac_update+0x49/0x50
[] pohmelfs_crypto_thread_func+0x304/0x3e8 [pohmelfs]
[] hrtick_set+0x7a/0xd7
[] autoremove_wake_function+0x0/0x2b
[] pohmelfs_crypto_thread_func+0x0/0x3e8 [pohmelfs]
[] kthread+0x38/0x5f
[] kthread+0x0/0x5f
[] kernel_thread_helper+0x7/0x10
This happend for exacly above case, when page was first mapped via
kmap() in POHMELFS and then via
kmap_atomic() in HMAC crypto processing code.
I wonder what will happen if we ever try to send kmapped pages
over IPsec tunnel. Likely it will ooops too...
This can happen for example when pages are mapped in
tcp_sendpage() when calling sendfile()
over the interface, which does not support hardware checksumming
and scater-gather: mapped pages are pushed down the network stack
where they will be eventually encrypted/hashed in IPsec, which
will in turn call kmap_atomic().
So, if you will find obscure oops in kmap_atomic()
and friends, first check that calling stack did not map page
earlier.
/devel/other :: Link / Comments (0)
Mon, 23 Jun 2008
POHMELFS client got initial part of multithreaded crypto/checksum processing.
So far it only includes encryption and hash calculation for outgoing
transactions. System has (mount option) number of threads per superblock,
which are responsible for encryption/hashing (each thread has own crypto structure,
so there are no additional allocations in the fast path, although I think
they would not harm performance since should be small enough
fraction on top of crypto processing overhead) and subsequent data sending,
so original caller (like writeback/readahead code) will not block if there
are ready threads, otherwise it will wait until some thread finishes its current crypto work.
I decided to implement kind of continuation for such transactions, when network sending
code (which is supposed to be started after crypto processing) will be invoked from those threads,
which performed crypto operations, and not returning back to originall caller context.
For massively multiqueue NICs that should be a benefit, but so far I did not test its performance.
Next step is receiving crypto support and userspace changes.
/devel/fs :: Link / Comments (0)
Crypto processing in POHMELFS. OpenSSL vs GNU TLS.
If I did not miss something,
GNU TLS (I never worked with it)
supports very limited amount of ciphers and hashes, so it is not appropriate for
filesystem data protection layer.
According to its
documentation
GNU TLS only supports AES, RC4 and 3DES ciphers and SHA1 and MD5 hashes. There is also only CBC
chaining mode and several hash/cipher schemes.
So, POHMELFS server will use OpenSSL for data protection. Sooner or later OpenSSL
will get hardware crypto support on Linux too (well, Linux crypto stack should first
implement userspace API, which does not exist yet, although there is a
work
by Loc Ho from AMCC to add such support).
So far I decided to implement following protection scheme: checksumm or encryption
will cover full transaction data, but will be applied by chunks:
- Transaction 'first-level' data, i.e. header and data immediately placed after transaction
header. For all commands except page writing it will be finish.
- For write pages command, each header is generated dynamically and does not exist
until data is really being sent, so crypto code will run over all pages and update checksum
processing headers and data pages separately. Checkum update should be simple enough, since
there are crypto helpers to update and finalize checksum, but encryption is more complex:
I requires all chunks to be setup in advance in single scatterlist chain, with dynamic header
generation it is too big overhead (it requires not only scatterlist allocation, but also
header allocation just for encryption), so encryption will be done separately for headers and pages,
and I will have to create some IV propagation scheme (like last bytes of previous unencrypted chunk
will become IV for the next chunk, or something like that). I understand, that it may be not very
secure approach though.
- Reading data back from server is simpler, since there are no transactions,
and data will be encrypted/checksummed like in the first step above. It is possible, that it will
force to increase network header structure a bit (32 or 16 bits to store size of the attached checksumm).
/devel/fs :: Link / Comments (4)
Sun, 22 Jun 2008
3:1. It is fucking unbelivable, but we wooooon!
We won against Holland, one of the stongest football team!
There was blood on the field, rule breaks and other crap, but we made it!
3:1 against Holland. "Russia rised from the knees" as screamed.
We are in one half of the final. You have been warned, russians are coming! :)
Pivo and vodka, we will not sleep today!
/other :: Link / Comments (3)
We do not belive in penalti!
It is fucking unbelivable, but Russia plays with Holland
and score is 1:1. Not only its equal, we do play a cool football!
And Holland equaled score in a 87 minute, we were so close, but
it is not yet stopped. We can win. We will win!
I do not understand, how in the hell our team started to play that
good, we can. We fucking can, when we want. We play not for the goal, not
for the money, not for fucking anyhintg, we play just for the game.
And game wins!
Ended first half of the additional time. Russia vs Holland 1:1.
We can. Just because we can.
/other :: Link / Comments (0)
Thu, 19 Jun 2008
POHMELFS and HMAC/crypto operations.
As I found with
distributed storage
project, any communication channels, which involve huge amount of data transfers,
have to have additional strong checksum embedded in the protocol, since TCP one is not
enough in some cases. There are some options, like TCP MD5 signatures or IPsec transformations,
but it is not always available.
POHMELFS
will include ability to both encrypt whole data channel and/or only digitally
sign all messages. This will be implemented on transaction level, so no higher layer code
(like reading/writing data functions) will ever be affected.
POHMELFS will also have mount time self-configuration, i.e. client will send to server
information about supported capabilities, requested by administrator, and if server does not
support some of them (for example it can only do HMAC and not encryption, and both operations were
requested at mount time), they will be dropped (and mount failed optionally).
In the future it will be possible to extend it with additional flags if needed.
mount is not very convenient command to transfer crypto information (like binary keys)
to kernel, so I use the same infrastructure as initial server group initialization (i.e. using
POHMELFS existing configuration utility).
Support for HMAC and encryption will force server to depend on OpenSSL,
but I do not think it is a problem. In some future time I can write autoconfiguration, which will
allow to compile server without crypto support (and thus do not accept encrypted clients and
do not check signatures) if there is no OpenSSL.
After crypto operations are implemented (I expect it to be finished this week), I will release as promised
new netchannel
version (and will remove unneded functionality like NAT), and add some interesting bits (like async
processing) into distributed storage,
so expect its new release soon too.
Stay tuned!
/devel/fs :: Link / Comments (2)
CLISP socket streams.
Excellent documentation with examples.
I expect that it is implementation (i.e. CLISP) specific and will not work with SBCL or Allegro
for example, but nevertheless I want to learn and somewhat use it.
If it will be good for my usage cases, what my next userspace server will be written with? :)
/devel/other :: Link / Comments (0)
POHMELFS, NFS, Ext4 and XFS in iozone benchmark. Graphs.
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs)
3Ghz 32 bit Xeons with 1gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with SEAGATE
ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed is about 90 MB/s.
Software used in testing: 2.6.25 kernels (on server and client), in-kernel async NFS server,
userspace POHMELFS server.
Tests were performed with 8gb files (amount of ram was reduced to 1gb to eliminate caching
influence) with different (from 8 to 1024 KB) record size. I ran write/rewrite, read/reread and
random read and write tests.


/devel/fs :: Link / Comments (0)
CRFS got metadata cache coherency support.
Zach Brown has
committed
cache coherency support into CRFS repository.
Cache coherency protocol works by broadcasting special messages from
server, and each client invalidates appropriate inodes (and dentries if needed)
before sending back a reply.
POHMELFS
uses a bit different mechanism: client does not send acks back to server,
so all such messages are kind of advisory-only, but I did not yet complete (well,
I did not even think about this problem this week) locking design, so it can change.
Main problem with sync cache coherency support is its absolute non-scalability.
While number of sage cases might require such behaviour, I expect that if not major,
but noticeble part of users do not want perfromance degradation as a price for
posix-like coherency expectation. This approach is worse that write-through cache,
since there is whole round-trip of the cache coherency request instead of just
data sending during its writing. Single direction sending is faster than sending+waiting,
so for me it is still a questionable approach.
I will think a lot of this problem later this week(end), so that solution would
satisfy both high-perfomance and safety camps (although at some degree only I think).
/devel/fs :: Link / Comments (0)
Wed, 18 Jun 2008
LISP macros rox!
(defmacro with-output-dir ((out pos dir flags) &body form)
`(let ((,pos 2))
(dolist (operation (nthcdr 2 *iozone-tests*))
(let* ((dir (pathname-as-directory dir))
(output-file (make-pathname
:directory (pathname-directory ,dir)
:name operation
:type "gnuplot")))
(with-open-file (,out output-file :direction :output :if-exists ,flags)
,@form))
(incf pos))))
(defun write-gnuplot-headers (dir)
(with-output-dir (out pos dir :supersede)
(format out "set title \"Iozone performance: ~a, KB/s\"~%" operation)
(format out "set terminal png small size 450 350~%")
(format out "set logscale x~%")
(format out "set xlabel \"Record size in KBytes\"~%")
(format out "set ylabel \"Kbytes/sec\"~%")
(format out "set output \"~a.png\"~%" (elt *iozone-tests* pos))
(format out "plot ")))
(defun update-gnuplot-headers (dir file)
(with-output-dir (out pos dir :append)
(unless *first-file-p*
(format out ", "))
(let* ((fstype (pathname-name file))
(name (make-output-name file)))
(format out "\"~a\" using 1:~d title \"~a\" with lines" name (1+ pos) fstype))))
Macros are really the coolest feature of the LISP. Now I believe I started to understand LISP kung-fu.
Iozone parser is essentially ready. I was a bit pessimistic yesterday: it took only half of the day and several
hours today, and code itself is rather ugly (and frequently really ugly, likely far from the LISP way), but it works:
it runs over given dir, searches there for files with given extensions, parses them (removes unneded iozone information),
writes result to specified directory. Also runs over iozone test strings and generate gnuplot scripts for them, which
will build a graph based on filesystem info it gathered traversing the tree above, so results looks like this:
$ ./parser.lisp
Processing: /tmp/iozone/tmpfs/nfs.out ... done
Processing: /tmp/iozone/tmpfs/pohmelfs.out ... done
$ cat /tmp/iozone/tmpfs/out/read.gnuplot
set title "Iozone performance: read, KB/s"
set terminal png small size 450 350
set logscale x
set xlabel "Record size in KBytes"
set ylabel "Kbytes/sec"
set output "read.png"
plot "/tmp/iozone/tmpfs/nfs.out.data" using 1:5 title "nfs" with lines,
"/tmp/iozone/tmpfs/pohmelfs.out.data" using 1:5 title "pohmelfs" with lines
/devel/other :: Link / Comments (2)
Tue, 17 Jun 2008
LISP development zen.
(defun string_to_list (str)
(let ((num 0) (ret '()) (string_len (length str)))
(dotimes (i string_len)
(let ((sym (elt str i)))
(cond
((not (char-number-p sym))
(unless (eql num 0)
;(format t ": ~d~%" num)
(push num ret)
(setf num 0)))
(t (setf num (+ (* num 10) (to_number sym)))
(when (eql i (- string_len 1))
(push num ret))))))
(nreverse ret)))
Which is a part of my LISP parser for iozone output files. So far it is able to convert its output numbers (performance in KB/sec)
into LISP lists (one list per record), so single line of iozone output becomes a single list of numbers
(ugh, I was forced to write string-to-number conversion function).
It is not that serious achievement likely, and it took the whole day, but nevertheless I like it,
although I would write the same in C much faster :)
Main problem with Lisp for me is its functional-conditioning system. Converted to C it looks like:
if (a) {
if (b) {
if (c) {
do_stuff()
}
}
}
While I would write:
if (!a)
return;
if (!b)
return;
if (!c)
return;
do_stuff()
So far I did not use macros at all, and all the time looked into
Practical Common Lisp book
(and frankly got from there directory processing functions, although
modified it a bit), but what would you expect from the first project. Tomorrow I will extend it to
write gnuplot-compatible file and finally generate some graphs (I do not know
how to call external programms from LISP though).
Frankly, I'm not yet excited about how cool LISP is, but I like it, since it is different.
Just like I like my neverending
appartment development process.
Ugh, and with proper automatic vim highlightning I am not afraid of parenthesis.
Interested reader can grab my sources
and comment on ugliness.
Also found an 'interesting' article at IEEE about LISP:
Migration of Common Lisp Programs to the Java Platform -The Linj Approach :)
/devel/other :: Link / Comments (2)
Sun, 15 Jun 2008
Meanwhile at appartment development side.
Decided to work on completely different than usual
area today, so neverending appartment development.
Today I painted whole ceiling in the kitched and I want to belive,
that it is the last time. It was not that quick, but took noticebly smaller
amount of day.
Main task was floor in the hall. I finially covered it with ceramic granite.
It was supposed to be seamless granite installation, but... tiles have so precise
dimensions, that difference between them was never more than half of santimeter
in each side, so I was forced to make small seams and move tiles around quite
for a while before they formed somewhat straight lines, although there are
lots of non-straight crosses.
Nevertheless it looks cool, I'm glad I finished this part.
/devel/flat :: Link / Comments (0)
Sat, 14 Jun 2008
Passive OS fingerprinting.
Ever dreamt to block all Linux users in your network from accessing
internet and allow full bandwidth to Windows worm? We have to care about
our smaller brothers, so this iptables extension module allows you to do
so.
OSF stands for OS Fingerprint allows you to build usual iptables
decision on incoming TCP packets, only initial handhsake containing SYN
bit is enough to understand what remote OS is. Original idea belongs to
Michal Zalewski.
This iptables module was
imlemented almost 5 years ago and lived in patch-o-matic (userspace
library is still there) iptables tree. Now I've updated it to Xtables
and send for review.
Installation steps are described on the
homepage,
but are trivial and include usual make/make lib building and loading rules into the module
via procfs file.
# insmod ./ipt_osf.ko
# ./load ./pf.os /proc/sys/net/ipv4/osf
# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector
You find something like this in syslog:
ipt_osf: Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139
/devel/networking :: Link / Comments (0)
New userspace network stack release.
Fixed bug found by Salvatore Del Popolo (delpopolo_dit.unitn.it)
in TCP implementation, when system checked sending window and determined,
that packet was not allowed to be sent and nevertheless tried to do so in some
cases.
Userspace network stack
is a very fast (if working on top of
netchannels,
also supported packet socket) and very small network stack (TCP/UDP/IP/ethernet) implemeneted
entirely in userspace. Because of it lives near the very the end of the peer (i.e. very close
or even embedded into application), it allows much faster processing of some workloads, namely
small packet sending and receiving, where
it
outperforms
vanilla Linux TCP/IP stack 3 times in performance and 4 times CPU usage (sending and receiving vary).

Comapre netchannels+unetstack versus Linux sockets (2006 year numbers).
It is not about problems in the Linux stack, but overhead of syscalls, which are in turn
results of too separate data sending and reply processing in the existing model.
/devel/networking/unetstack :: Link / Comments (0)
CARP: Common Address Redundancy Protocol for Linux kernel.
I've finally made a new release of the
CARP
for Linux kernel.
CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard.
The latest protocol to help provide high availability and network redundancy, it was
developed because router giant Cisco Systems believes that its Hot Standby Router
Protocol (HSRP) patent covers some of the same technical areas as VRRP.
This project allows you to build high-available clusters of multiple machines with
balanced master selection between them. Installation and setup are pretty trivial:
$ tar -zxf carp_latest.tar.gz
$ cd carp
$ make
# insmod ip_carp.ko
# modprobe cn
# insmod carp_conn.ko
# ifconfig carp0 up
# carp_conn_daemon -m master.sh -b backup.sh
And the same on all other machines.
Each script as you got from its name is executed when node becomes master or backup one,
you can put there firewall rule changes, traffic shaping setup, network daemon start/stop
scripts and whatever you like.
Its main advantage over any other existing open (well, it behaves much more robust than Cisco VRRP though)
master/backup solutions (like Hearbeat or userspace CARP) is ability to setup multicast address (via usual
/sbin/ifconfig command) and thus do not confuse some crappyCisco
hardware, which will not understand that node changed.
One can get the latest sources from CARP homepage.
Enjoy!
/devel/networking :: Link / Comments (0)
Fri, 13 Jun 2008
The latest iozone benchmark of POHMELFS, NFS, XFS and Ext4.
1Gb of RAM, 8Gb files. SEAGATE ST3300007LC 10k rpm 300 Gb on Adaptec AIC7902 Ultra320 SCSI adapter.
Performance in KB/s.
NFS:
random random
KB reclen write rewrite read reread read write
8388608 8 53210 57769 24304 24448 1360 4775
8388608 16 54577 57481 23871 24080 2592 7937
8388608 32 54736 56203 24015 24114 4738 12637
8388608 64 52075 54051 23653 23555 7610 18475
8388608 128 52307 54636 23305 23375 13017 26584
8388608 256 52189 53030 23585 23531 15615 34390
8388608 512 52938 54063 23709 23882 17524 42781
8388608 1024 57458 57006 24187 24292 29701 43892
POHMELFS:
random random
KB reclen write rewrite read reread read write
8388608 8 66473 63721 74232 74288 1103 4953
8388608 16 52604 62339 73423 74259 2001 8438
8388608 32 53278 62283 73497 74115 3360 13849
8388608 64 56931 61370 73135 74077 5076 21063
8388608 128 59419 62743 72736 74122 8068 30279
8388608 256 60861 63094 73284 74554 10848 38869
8388608 512 59438 62081 73329 74441 17290 48722
8388608 1024 62790 62130 73322 74100 27741 46470
POHMELFS write speed about 10% faster, read speed 3-3.5 times faster
(essentially disk/local fs IO limit, see below).
POHMELFS random read speed is smaller, and that is task with the highest priority now,
especially compared to local FS results.POHMELFS random write is slightly faster than NFS.
For comparison, local filesystem, used for tests.
mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/:
random random
KB reclen write rewrite read reread read write
8388608 8 75124 60560 77672 77797 1860 5059
8388608 16 75044 60036 77754 77775 3601 8772
8388608 32 75958 62038 77593 77765 6821 14781
8388608 64 74728 59384 77688 77782 12475 23228
8388608 128 74889 59676 77731 77736 21734 32241
8388608 256 75022 59285 77676 77718 28833 40324
8388608 512 74885 59187 77653 77713 40013 48057
8388608 1024 74838 64217 77796 77765 55100 46104
And Ext4 to the group (mount options: rw,noatime,data=writeback,extents):
random random
KB reclen write rewrite read reread read write
8388608 8 72107 73017 77276 77335 1849 5015
8388608 16 72276 73849 77304 77287 3577 8666
8388608 32 72680 73647 77284 77326 6755 14394
8388608 64 71965 74287 77327 77288 12366 22513
8388608 128 72660 73864 77207 77343 21617 31160
8388608 256 72813 74058 77296 77338 28652 42003
8388608 512 72985 73317 77284 77343 40572 50619
8388608 1024 72184 74131 77264 77250 55649 50365
Nice graphs will be done, when I will write Lisp (no less :) parser for it.
Stay tuned!
/devel/fs :: Link / Comments (3)
New POHMELFS release: doing it wrong fast is at least better than doing it wrong slowly.
Via Ashleigh Brilliant and bits of Tullamore Dew.
Here we go, short changelog for this release:
- Read requests (data read, directory listing, lookup requests) balancing between multiple servers.
- Write requests are sent to multiple servers and completed only when all of them sent an ack.
- Ability to add and/or remove servers from working set at run-time from userspace (via netlink,
so the same command can be processed from real network though, but since server does not support it
yet, I dropped network part).
- Documentation (overall view and protocol commands)!
- Rename command (oops, forgot it in previous releases :)
- Several new mount options to control client behaviour instead of hardcoded numbers.
- Bug fixes.
I will complete documentation in a few moments and send this release to the mail lists.
Very likely it is last non-bug-fixing release of the kernel client side, next release will incorporate
features, needed for distributed parallel data processing (like ability to add new servers via network
command from another servers), so most of the work will be devoted to server code.
/devel/fs :: Link / Comments (0)
Wed, 11 Jun 2008
Preparing for the next (last non-bug-fixing?) release.
Essnetially that's it, I belive really most of the features I wanted
from network distributed parallel filesystem, which should live
in client, are already implemented in POHMELFS.
Client has following (if did not forget something interesting,
listed only interesting from parallel point of view) features:
- Automatic failover reconnect to the same server.
- Run-time addition/removal of the servers from the working set
(only via userspace command, since server does not support that yet,
but addition is trivial).
- Coherent data and metadata cache
- Transactions support. Full failover for all operations. Resending transactions to different servers on timeout or error.
- Load balancing of reading (directory reading and lookups inclusive) requests and
simultaneous writing to all servers in current working set.
It is damn fast (but remember, that random reading
is no yet optimal enough, and in
the last tests it was slower NFS).
Userspace server meantime does not support lots of features it has to support
to be called complete parallel distributed solution, and main work should now
be concentrated on it.
Main missing (and the most complex) features are:
- Distributed data coherency protocol like PAXOS for server data, stored on multiple machines.
- Ability to mirror data itself on multiple machines.
So, likely release will see the light tomorrow or Friday.
/devel/fs :: Link / Comments (0)
Tue, 10 Jun 2008
Sun and water. Sasha and Masha.

Thank you, that was great!
/life :: Link / Comments (2)
Fri, 06 Jun 2008
Contributors we are losing and kernel summit talk about it.
By 'we' I mean kernel community, although I do not think
I personally win or lose if someone decided not to hack
on Linux kernel.
I even found myself in a
'contributors we are losing'
list :)
And yes, very likely Linux kernel community lost me (and I do believe
none cares as long as me).
But not Linux kernel, it is definitely the place I like.
People, who want to hack on Linux kernel will do that without all
that empty talks and brilliant ideas, all of which are only aimed in
a single direction: do what we will ask you to do for us. Be fair and
admit that you do not want new ideas implemented, you want old bugs (introduced
by someone else) fixed only, so that kernel got more respect without
possible additional work for you.
It is not how interested people work, instead they just decide themself
how and what to do. That's why kernel janitor project did not succeed:
it is not interesting for anyone. The same applies to its refocus to bugfixes.
And I do know what is kernel janitorial: I started with that not long time ago: fixed
trivial error checks like request_region()/check_region() code
and other minor things like PCI remap errors.
That was hell of crap. Frequently there was a situation,
when I fixed lots (like 20 or more) drivers in one go and submitted a patch,
instead I was asked to split it to separate patches, to add each driver maintainer
into the copy, wait for theirs ACK, resubmit and so on. And frequently happend
(especially when new feature was introduced and lot of small code has to be changed
a little), that while I did that, some other known kernel hacker did the same, and his
patch was immediately applied.
Janitorial and all hypocrisy about 'we want more developers' just suck.
My advice for those who really want to hack on kernel: just do what you like,
try yourself in whatever subsystem you want, implement your ideas, be creative and do
whatever you like with kernel and not what all those kernel heads tell you to do.
The only way to succeed is to move forward!
Argh, and do not listen for any such kind of advices at all :)
/devel/other :: Link / Comments (3)
POHMELFS development status.
POHMELFS
got ability to add/remove servers in run-time (although not via network command,
since I do not know, how to test it yet), but via netlink interface. The same
message can be passed via network though, so it will be simple to extend.
Also, POHMELFS got readahead support via ->readpages()
callback. I removed AIO reading from POHMELFS in favour of readahead
and got excellent result in sequential reading: 3-3.5 times faster than NFS
and essentially reaching disk IO bandwidth (a bit less though),
but random reading dropped to miserable numbers.
Also rewritten reading method should provide better balanced between multiple servers
capabilities for the system, but it will not show any benefit in single-threaded
iozone benchmark, since it reads data via single call to read(),
which gets sequential data access, which in turn is faster than network bandwidth.
So multithreaded load should greatly benefit from read balancing, but I did not
yet test that.
I ran sequential read/reread, write/rewrite and random read/write tests for
XFS, Ext4, NFS (over XFS) and POHMELFS (over XFS) with 1Gb of RAM and 8Gb
of test files (to eliminate VFS caching influence) with 8Kb to 1Mb record size.
Results exist in text files in standard iozone output format, but since I'm learning
LISP I decided to write a graph generator (via gnuplot) using my very basic
knowledge of this language, so nice graph results can take a while...
Also, tomorrow morning I will flight away to my friends marriage and will only
return monday 9. I will not have internet access there, only lots of fun.
/devel/fs :: Link / Comments (0)
Thu, 05 Jun 2008
Travelling to Uganda.
Friend calls to move to Uganda this September. Promises beautiful nature a very interesting travel as is.
Thinking...
/life :: Link / Comments (2)
Wed, 04 Jun 2008
Optimized POHMELFS transactions.
Now they eat less memory, and single writing transaction can accumulate
up to 1024 pages. This can be further tuned especially for small requests
mixed with sync. Currently writing transaction is allocated for its maximum
size, and then pages pointers are written to the allocated area, so
if number of dirty pages requiring writeback is small, quite lots of
space will be wasted.
It is a task for the next optimization, nevertheless currently sequential
writing is only limited by disk throughput or network bandwidth in case of
multiple servers, since link
is shared between machines, so effective bandwidth becomes equal
to GigE/number of servers, or about 60 MB/s in my environment with two servers
and single client.
Also, reading path was not changed at all (only transaction
internals) - there is still no readahead
and new transaction is allocated for each page to be read. Nevertheless,
see how reading was improved: POHMELFS not only outperformed NFS again,
but reached disk bandwidth limit already for 16Kb requsts (almost two
times faster than NFS). Table shows IO throughput in KB/s.
random random
KB reclen write rewrite read reread read write
8388608 8 74058 68392 40130 79509 43588 4818
8388608 16 62332 66978 73714 122074 42160 8434
8388608 32 64775 67073 109357 171139 145416 14183
8388608 64 66962 66602 147350 217323 227962 22257
8388608 128 67724 67133 185574 266855 321060 32681
8388608 256 68233 67922 201591 283567 474657 40944
8388608 512 68339 66514 213513 295995 646897 50303
8388608 1024 67744 67384 220858 297748 676582 48796
I will create nice graphs out of this tables and also will include
optimized reading tests (tomorrow likely) and two data server results.
What also should be done, is testing with either bigger files or smaller
amount of ram and thus smaller VFS cache size. As you saw in all tests, when
lots of reads start to hit the cache, picture becomes completely non-informative
for filesystem behaviour. So I want to limit all three testing machines
to 1Gb of RAM (booting with mem=1G parameter) and perform the same iozone
bench for 8Gb file. Results should be more realistic.
In parallel I will implement userspace run-time server addition/removal
command, which will also be used as-is for network message from one
or another server, connected before. With optimized reading transactions
it will be a good ground for the next POHMELFS release. So I plan to schedule
it to thursday or middle of the next week, since I will be on small vacation
jun 6-9.
/devel/fs :: Link / Comments (0)
Mon, 02 Jun 2008
AppArmor and path-based security approaches vs object bound policies.
- So again, can you offer an alternative?
- Just give up on this dumb idea completely.
It is not about AppArmor in general (although maybe about it too), but about security hooks which provide
path information into inode callbacks. There are pros and cons for this decision,
but things look like path based security hooks will not be accepted.
There is a really trivial way to fix it. No kidding, it is simple: create own
name cache and do not bind it to dentries, but instead index it by inode number.
This allows you to have whatever you want callbacks and information in stricktly
bound VFS operations. Need to have path info in ->inode_create()?
Put it into own tree indexed by inode number for parent inode, lookup that data in
security hook and make a decision. Yes, it is slower, but active security was never
a fast solution. It is still against the rules others created for security based
systems, but still formally it in the all boundaries of the created (maybe ugly
for someone) interfaces.
And I will not point to project, which already uses such approach in different area
though :)
It is interesting to implement your ideas not by breaking something (although sometimes
it is need, but that's likely an exeption or when you are hacking deeply internal kernel
part), but instead by hacking around existing limitations.
/devel/fs :: Link / Comments (4)
Trumpet kung-fu.
I think I found a way to have a progress in my trumpet playing exercises
(read: ear scratching screaming, it sounds much worse than wrong note on piano).

It is of course practice, but even without whole tube and using only mouthpiece I can train
breathing path. Musicians have several hours per day exercises, kernel hackers about half of an hour each morning
in parallel with listening to ACDC and Metallica as an alarm clock. Mouthpiece is rather quite (noticebly
louder than usual talk though), but produces about the same resistance for air flow, so I think
it is a good training. When embouchure will be stable enough I will attach trumpet, since currently
sound frequently drops and jumps. Nevertheless I got a big progress (I think so at least) after started
such trainings recently.
My home guardian Socket although does not have ears, looks like do not like it. Alhtough he only likes
to eat.
/life :: Link / Comments (0)
Pros are talking.
- If you haven't noticed, I don't take "no" for an answer,
- And now please tell us step 2 in your secret plan to win friends and influence.
- WTF are you getting at?
Fun thread :)
There is actually a serious problem in kernel community, when some new idea is being implemented,
and it moves against something which sits in mind of one or another big kernel hacker out there.
When such person replies, that this is bad idea (sometimes without technical arguments), people
just stop looking at replies and do not follow arguments of the author just because they frequently do
not know area in question enough to make decision and thus rely on others.
This only works when 'others', i.e. core kernel maintainers, are good and do not base theirs decisions
on personal feeling and only get technical side into assumption. Unfortunately it is not always the case,
and political methods are used. Sometimes even only political methods are used...
/devel/other :: Link / Comments (0)
As promised, let's see shadowed miserable POHMELFS results.
Usually you will not see bad benchmark results for developing
technology, but any such result is actually a _very_ good result
for work-in-progress and not yet completed system. It allows
to see how new proof-of-concept code can be comparable
with already completed tuned and optimized system.
Conclusions from such test results in a really superior decisions.
Let's compare iozone read/reread, write/rewrite and random
read and write for POHMELFS and NFS with 8Gb test files
different record size (from 8Kb to 1Mb) on XFS over the GigE link.
I described hardware and local iozone benchmark results in details
previously.
Now its time for network tests.
Async NFS in-kernel server results.
random random
KB reclen write rewrite read reread read write
8388608 8 60969 57743 39705 97031 464898 5160
8388608 16 59925 57402 39045 98269 641388 8827
8388608 32 58094 55263 39075 94654 775064 14389
8388608 64 58168 57156 40306 98639 868796 22360
8388608 128 58908 56573 40392 100018 941509 33211
8388608 256 59444 56446 40842 102503 1030451 41576
8388608 512 60280 57686 39835 97879 1042570 49858
8388608 1024 60817 57886 40886 96646 851175 47993
And now POHMELFS results.
random random
KB reclen write rewrite read reread read write
8388608 8 70073 64232 12518 14817 40334 5079
8388608 16 63984 67948 31976 19106 41462 8702
8388608 32 67250 63440 47506 38657 75908 14357
8388608 64 69970 66198 41899 29566 136294 21385
8388608 128 69838 68523 76232 33971 222909 30946
8388608 256 70012 66439 69125 58223 330886 40685
8388608 512 70946 68291 76460 58738 428881 51001
8388608 1024 70985 64958 76317 59561 421973 48531
Sequential writing is 10-15% faster for POHMELFS (and limited by underlying
fs speed), while random writing
is essentially the same and is limited by disk speed. But sequential reading
is _much_ worse for small requests. THe reason is simple: POHMELFS does not support readahead,
since it does not have ->readpages() callback, so any
sequential access ends up with set of ->readpage() callbacks,
which waits for theirs completion, which is slow, so currently readahead
is not invoked from reading path.
I could not resist to highlight, that big
sized requests are 1.5-2 times faster for POHMELFS than NFS :)
and is also limited by underlying filesystem.
One can note, that
NFS random reading results are actually better than local filesystem behaviour,
and its is better very noticebly. Why does local filesystem behave worse than
being mounted via NFS in random reading?
I believe that's because in a network case we actually have double buffering:
on client, where the most active pages are in RAM, and on server, where
readahead populated pages, which are not active (since active pages are being
read from client's cache, so they will be evicted from server's page cache,
since client will not try to read them from server), but those server pages,
which are not active currently will be accessed soon by client, when it will read
next portion of the random data, and it will be very fast access to RAM.
So we have really good caching scheme, where the most actively used pages are
in client RAM, and they are flushed to disk on server, and isntead server populated
other less active pages via readahead.
This reading behaviour is just a result of yet not completed VFS callback implementation
of the POHMELFS. With ->readpages() in place it will be faster than
NFS even in this bench. Also POHMELFS has multiple-server parallel read balancing and
simultaneous writing to them, but there are no results yet.
I already created a mind model of the optimized read and write transactions (based
on memory pools for the maximum OOM-robustness and small memory usage overhead), so
in a day or two it will be implemented in code.
Stay tuned, now its time for excellent POHMELFS results!
/devel/fs :: Link / Comments (0)
Sun, 01 Jun 2008
We won! DPQE:DGAP 81:67
Screaming, drinking, cheering...
Although I was not there today, since
some friends became ill and others moved to their tasks,
it still was really cool (yesterday).
My congratulations to the team and department itself :)
/life :: Link / Comments (0)
|