Zbr's days.
July
Sun Mon Tue Wed Thu Fri Sat
   
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
2008
Months
JulAug Sep
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Sat, 05 Jul 2008

Midnight creatiff. Casted by LHC start.

- Shit! There are no more M8 screw-nuts.
- What? Use M12, bozon should pass through.
- We all will be fucked this Monday!

Building LHC

Good night. Actually as a former physicist I can say, that at least two out of four killing theories are really stupid, but nevertheless its interesting!

/other :: Link / Comments (2)


Fri, 04 Jul 2008

In case we will die this Monday...

I've started a countdown...

Countdown has been started

Large Hadron Collider will be started in 3 days...

/other :: Link / Comments (0)


Thu, 03 Jul 2008

POHMELFS crypto support has been completed.

kernel$ git commit -a
Created commit b07e3ed: Added crypto support.
 9 files changed, 1534 insertions(+), 221 deletions(-)
 create mode 100644 fs/pohmelfs/crypto.c

fserver$ git commit -a -m "Aded crypto support."
Created commit f916b2f: Aded crypto support.
 3 files changed, 788 insertions(+), 94 deletions(-)
I implemented pool of crypto processing threads (number of them is mount option parameter), each of which has pool of pages to encrypt data into, so crypto thread is not released until server returns acknowledge that data was successfully written, so one should tune number of threads and page pool (number of pages in each thread is maximum number of pages per transaction, this limit has own mount option too) according to desired behaviour.

Testing shows that writing performance was reduced with this approach noticebly: with 4 encryption threads and 4 receiving thread in server perfromance dropped by around 30% from 65+ MB/s down to 46+ MB/s, but I think it can be improved with larger number of encryption threads. During iozone write/rewrite test each of 4 crypto threads ate about 20-30% of CPU, while server ate about 130% (4 threads totally). In all previous iozone tests the larger number of userspace was used, the worse results were (this is somewhat expected, since iozone is singlethreaded benchmark, so larger number of threads lead only to performance degradation), so I will test different setups (namely larger number of crypto threads and smaller number of server threads).

But this behaviour is not a problem, and I expect it to be tuned, real problem is reading performance. Right now there is only single thread, which reads from one socket: it was done intentionally, since reading data from socket is longer operation than searching page in radix tree or any other operation performed by that thread, so there is no way to saturate its capabilities. Until we start encryption, which is slow, so any subsequent data reading from the socket can not be done in parallel with crypto processing, and overall reading performance drops to ground.

This problem has to be fixed, so I plan to use the same crypto processing threads to decrypt and/or perform hash check for received data and push it up to the VFS stack.

/devel/fs :: Link / Comments (0)


Wed, 02 Jul 2008

POHMELFS crypto: feel incredibly stupid.

First, POHMELFS does need to have encryption. Because I plan to use distributed hash table approach in server (well, consider POHMELFS kernel client as a kind of bittorrent filesystem client), and as in any non-centralized system, content transferred via uncontrolled data channels has to be encrypted.

But... I'm incredibly stupid: I implemented encryption and decryption in place, i.e. VFS page is being encrypted prior to be written to the servers, so subsequent reading leads to... Yes, it reads encrypted content.
To fix this issue I plan to encrypt data into different pages and send them, leaving VFS ones as is. There are two approaches I consider:

  • allocate and send pages at writeback time - we want to send 5 pages, so allocate 5 pages, encrypt data into them and broadcast them to all needed servers.
  • allocate (potentially large) pool of pages at mount time per crypto thread and encrypt data into them. This will have about zero run-time overhead for VFS, except slightly delayed because of encryption write completion.

/devel/fs :: Link / Comments (7)


Louis Maggio trumpet school: never smile.

/life :: Link / Comments (0)


Holy shit: kernel summit.

We would like to invite you to the 2008 Kernel summit, and we hope that you will be able to join us...
I'm trying to recall previous kernel summit:



That was fun, but no one wanted to play football instead of talking about whatever we talked about.

For that year I only committed a HIFN driver into the tree, and there was no kevent :)

This time in US, thinking...

/devel/other :: Link / Comments (5)


Tue, 01 Jul 2008

Why is blocking sending considered harmful?

I frequently hear that whatever server you implement, it has to be non-blocking, since in case of parallel sending it allows to send multiple requests to fast servers, while not-sending data to slow server, since non-blocking socket will return EAGAIN.

This is only half-right solution: when we have to put given data to all servers, and can not free it until all servers replied with acknowledge, non-blocking mode can bring more damage than gain.

Mainly because it allows to eat all the memory for requests, which are still in the queue to be sent to slow server, and which was already sent to fast ones. In this case higher-level application (consider simple application which generates some data and writes it into the file in distributed filesystem, which writes file to several servers) will never block since transfer to fast servers completes quickly, and will provide more and more data, which will consume all RAM.

It is possible to deadlock system in this case, since to send some data to remote server we always have to allocate at least some data to put network headers into. With non-blocking solution we will consume all memory and kick itself into the coma.

/devel/networking :: Link / Comments (2)


Passive OS fingerprinting.

I've updated OSF modules to xtables, so you have to enable its support in kernel config and get recent iptables (I tested with 1.4.1.1, which is the latest release to date).

OSF allows you to match incoming packets by different sets of SYN-packet and determine, which remote system is on the remote end, so you can make decisions based on OS type and even version at some degreee.

Installation instruction, example and source code can be found on homepage.

I've also sent it to netfilter-devel@ and netdev@ maillists, since my previous mails never appeared there likely because of spam filters.

/devel/networking :: Link / Comments (0)


Mon, 30 Jun 2008

Filesystem development rumors.

Rumor number one. SWsoft aka Parallels actively searches for Linux kernel hackers in lead Moscow universities, namely MSU and MIPT. I saw theirs posters, where among other (wanted) requirements there is distributed filesystem knowledge.

Rumor number two. Alexey Kuznetsov (if you do not know, its the guy who wrote major part of linux network stack, namely TCP/UDP/IP and socket implementations, and although there was lots of changes in the stack since then, I think it will not be an exaggeration to call him the author), who also worked on Virtuozzo and OpenVZ (and its interesting VFS parts, which AFAICS are not in kernel, maybe yet), so he works on some filesystem too. The last time we 'confronted' was couple of years ago, when I first time implemented netchannels and tried to convince network community (and namely Alexey Kuznetsov and David Miller) that netchannel idea worth further investigation and implementation. IIRC I did not succeed, although results were very impressive.
Let's see what will happen with filesystems :)

Rumor number three. SWsoft recently started to actively search for kernel hacker for 'new interesting open source project'. They always searched for kernel programmers, but never told anything about projects, now something changed.

Rumor number four. OpenVZ and Virtuozzo have serious problems with NFS (especially when server dies), probably because of very ugly NFS protocol (yes it is), so its hard to properly virtualize it (or not?). There are no alternatives for NFS right now in major productions, but you all know about POHMELFS which right now can be used as really good replacement.

Rumor number five. SWsoft has long history of PHD defences (at least in MIPT) based on theoretical FS called TorFS (namely Tormasov FileSystem), year ago it was still not very alive project in practice, but I heard that it was very impressive in theory. This rumor exists really many years.

So, I have a quite clear picture, that SWsoft started development of the new distributed filesystem, which is aimed at first to replace NFS in virtualized environments. I can also imagine very interesting distributed parallel facilities needed for virtualized systems. And they try to attract lots of people to the project as long as really heavy artillery like Alexey Kuznetsov.

Which basically means, that sooner or later my development will meet strong concurency from this company, which has lots of really good professionals.
And that's very interesting and cool :)

P.S. or it may be a complete bullshit and delirium of my fevered consciousness.

And one fact about POHMELFS: today I finished client support for padded crypto processing of all requests and started to work out server bits, I expect to finish it in a day or around, so new release is very close.

/devel/fs :: Link / Comments (3)


Sat, 28 Jun 2008

Listened how my trumpet can sound.

It was really interesting. Although it is very simple student model, a friend produced very good sounds. He did not practice many years already, but nevertheless it was not that bad.

My everyday half to hour exercises usually produce worse sound, although sometimes I do find really cool notes. Unfortunately I still do not know some magic bit about how to catch on that sound, it borns and dissapears on its own, but I'm sure I will find it, and I think I'm close to where it hides :)

/other :: Link / Comments (0)


Need to rethink POHMELFS crypto a bit.

1. Because of encryption problem - data to be encrypted has to be blocksize aligned, so some informaion about padding has to be added into network command as long as crypto data size.

2. IV generation. I decided to extend network command and put there 64 bit IV for given packet. using simple sequence number is enough to protect against repeat message attack.

3. Encryption/hashing data. I decided not to ecnrypt/hash network headers, and only do it for transmitted data. If transaction contains several commands, data for all commands will be encrypted/hashed, in case of hash, signle digest/hmac will be generated and placed into transaction header.

4. It is possible, that I will add strong header checksum, which will be generated only for header and placed into special field. It will be calculated assuming checksum field is zero. This step is optional so far, but network header has 32 reserved bits, which can be used for it.

Right now hashing and encryption work, but are not checked on server (although generated), because of crypto alignment ugliness I decided to rethink approach a bit.
Evolution process in action...

/devel/fs :: Link / Comments (0)


Fri, 27 Jun 2008

0:3

That was really suck - yes, we played bad. Just like it was before. It is not somewhat surprising.
But what was the fucking ubnormal week ago agains Holland? That was new, was cool, was bloody great, but not today. Tired or whatever... What's the difference right now, we lose.

Yes, Spain played really good, my congratulations.
But our command showed, that it is possible.
That there is nothing impossible.
We can, when we want. You can, when you want.

Thanks a lot for the games!

/other :: Link / Comments (0)


Thu, 26 Jun 2008

POHMELFS server got initial crypto processing capabilities.

POHMELFS server is able to handshake hash/cipher names and operation modes, to initialize appropriate algorithms and perfrom basic operations (like more generic hash_update() instead of different functions with different arguments used to hash data depending on operation mode, either simple digest or hmac: EVP_DigestUpdate()/HMAC_Update(). I'm working on the right way of doing crypto processing, since how it is done right now is a bit hairy, i.e. without serious changes in the code.
I already hate OpenSSL API: EVP_get_cipherbyname(), EVP_MD_CTX, EVP_DigestFinal_ex(). It looks like above functions were written by three different persons and they never actually talked to each other about how to make them look similar... But it is a minor issue of course.

So, when things are settled down, I will make a new release, likely it will see the light this week.

/devel/fs :: Link / Comments (0)


Hacking your ISP for fun and profit.

My ISP again blocked my account and can not unblock it although there are money on the deposit. There are serious problems in its billing system which requires manual intervention of the operator. Unfortunately it is a real challenge to call them, it already took more than half of a hour yesterday, and without success.
So, I decided to implement an interesting idea on how to bypass its blocking.

It is based on the security 'hole' in its (and I think vast majority of ISPs do the same) DNS configuration, which allows to request any DNS record even if account is blocked. It will be fetched from remote DNS server if there are no records in the IPSs cache.
Thus attack vector becomes visible: implement IP over DNS tunnel network device and setup local routing to use it by default. One has to control at least one remote machine which hosts DNS records for given domain name, since it is required to parse incoming DNS requests and process them accordingly.

There are at least two known IP over DNS tunnel solutions: NSTX (howto) and OzymanDNS (howto). Both solutions require that you own one or another server to run ip-over-dns tunnel server on it. Unfortunately I have only single machine with static IP address, which is not protected by lots of firewalls and allows incoming connections.

The simplest solution for this problem is to create iptables input target rule for the server, which will parse incoming DNS requests and redirect usual queries up the network stack to the userspace server, and handle 'poisoned' queries as tunnel.
Client can be TUN/TAP based, but can also be a tunnel network device.
I believe the more weird it looks, the more interesting it is, so likely will think more about kernel based tunnels.

DNS queries are limited enough not to allow binary data (IIRC, the most interesting is DNS TXT records), but it can be appropriately encoded and enciphered. So, will put it into todo list.
I even think that it is not that bad idea to have such modules in kernel :)

/devel/other :: Link / Comments (3)


Wed, 25 Jun 2008

POHMELFS input crypto processing engine is ready for testing.

But testing can not be done without appropriate server support, which is now the main task. POHMELFS uses lazy crypto engine - each network state (it represents connection between client and one server) contains number of fields used exclusively for semi-lockless input data processing (it locks state when performs actual reading, but does not hold that lock when processing incoming messages, since it is the only path, which receives data), now it also has crypto information about how to manage reply messages (they include read page reply for example), so it does not queue work to be done by crypto threads, but does that itself instead. It may or may not be the bottleneck of the input path, tests will provide facts, so far I do not have plans to change it, but it can be done of course if performance will suck.

After I finish crypto processing in both client (it has been written, but requires lots of testing with server) and server (just have started to recall how to work with OpenSSL. Well, I've read how HMAC works in OpenSSL, found it to be simple enough and then started to read how to parse binary data in LISP :) But anything which is interesting for me now, ends up in good results for all other projects), I will switch to something different for a while.
Some voices in the brain ask to be spread it in lots of interesting directions :)

/devel/fs :: Link / Comments (0)


POHMELFS crypto performance.

I've ran read/reread and write/rewrite tests as described in previous run, now with HMAC(SHA1) of all outgoing transactions (note, that reading response data is not yet encrypted and does not contain digital signature, server also does not support neither operation), essentially only writing should be affected by this, but I also ran reading tests for compelteness.

Results show zero performance overhead of the full data SHA1 hashing, but note that quite fast machines were used (2 3Ghz Xeons (2 physical and 2 logical CPUs, HT enabled) with 1 GB of RAM). All the time only two crypto threads were actively hashing data, since there are only two pdflush threads on this machine.

Read Reread

Write Rewrite

Writing is even faster with hashing, but results drifted around, so essentially performance is the same.

/devel/fs :: Link / Comments (0)


Tue, 24 Jun 2008

VM gotcha: forbidden double kmapping.

I've just known, that it is impossible to map the same page twice: for example first time using kmap()/kunmap() and second one via kmap_atomic()/kunmap_atomic().
Although mechanisms are a bit different in both mappings, it is forbidden to do and system will panic like this:

IP: [] kmap_atomic_prot+0x1b/0xc5
*pdpt = 0000000031c79001 *pde = 0000000000000000 
Oops: 0000 [#1] SMP 

Pid: 6478, comm: pohmelfs-crypto Not tainted (2.6.25 #27)
EIP: 0060:[] EFLAGS: 00010202 CPU: 2
EIP is at kmap_atomic_prot+0x1b/0xc5
EAX: ebc7c000 EBX: 00000003 ECX: 00000000 EDX: 00000003
ESI: 00000fdc EDI: 00000163 EBP: 80000000 ESP: ebc7dee4
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process pohmelfs-crypto (pid: 6478, ti=ebc7c000 task=f25040b0 task.ti=ebc7c000)
Stack: 00000000 00000003 00000fdc f7cf4078 00000fdc c0114144 00000163 80000000 
       c01991b1 ebc7df44 f70e3580 00000000 ebc7dfa8 ebc7df40 f70e3580 00000003 
       00000000 f7cf4000 f70e3580 f70ff8b0 f70ff880 f7096c00 c019a771 f70e3580 
Call Trace:
 [] kmap_atomic+0x11/0x14
 [] update2+0x7c/0x13f
 [] hmac_update+0x49/0x50
 [] pohmelfs_crypto_thread_func+0x304/0x3e8 [pohmelfs]
 [] hrtick_set+0x7a/0xd7
 [] autoremove_wake_function+0x0/0x2b
 [] pohmelfs_crypto_thread_func+0x0/0x3e8 [pohmelfs]
 [] kthread+0x38/0x5f
 [] kthread+0x0/0x5f
 [] kernel_thread_helper+0x7/0x10
This happend for exacly above case, when page was first mapped via kmap() in POHMELFS and then via kmap_atomic() in HMAC crypto processing code.
I wonder what will happen if we ever try to send kmapped pages over IPsec tunnel. Likely it will ooops too...
This can happen for example when pages are mapped in tcp_sendpage() when calling sendfile() over the interface, which does not support hardware checksumming and scater-gather: mapped pages are pushed down the network stack where they will be eventually encrypted/hashed in IPsec, which will in turn call kmap_atomic().

So, if you will find obscure oops in kmap_atomic() and friends, first check that calling stack did not map page earlier.

/devel/other :: Link / Comments (0)


Mon, 23 Jun 2008

POHMELFS client got initial part of multithreaded crypto/checksum processing.

So far it only includes encryption and hash calculation for outgoing transactions. System has (mount option) number of threads per superblock, which are responsible for encryption/hashing (each thread has own crypto structure, so there are no additional allocations in the fast path, although I think they would not harm performance since should be small enough fraction on top of crypto processing overhead) and subsequent data sending, so original caller (like writeback/readahead code) will not block if there are ready threads, otherwise it will wait until some thread finishes its current crypto work.

I decided to implement kind of continuation for such transactions, when network sending code (which is supposed to be started after crypto processing) will be invoked from those threads, which performed crypto operations, and not returning back to originall caller context. For massively multiqueue NICs that should be a benefit, but so far I did not test its performance.
Next step is receiving crypto support and userspace changes.

/devel/fs :: Link / Comments (0)


Crypto processing in POHMELFS. OpenSSL vs GNU TLS.

If I did not miss something, GNU TLS (I never worked with it) supports very limited amount of ciphers and hashes, so it is not appropriate for filesystem data protection layer.
According to its documentation GNU TLS only supports AES, RC4 and 3DES ciphers and SHA1 and MD5 hashes. There is also only CBC chaining mode and several hash/cipher schemes.

So, POHMELFS server will use OpenSSL for data protection. Sooner or later OpenSSL will get hardware crypto support on Linux too (well, Linux crypto stack should first implement userspace API, which does not exist yet, although there is a work by Loc Ho from AMCC to add such support).

So far I decided to implement following protection scheme: checksumm or encryption will cover full transaction data, but will be applied by chunks:

  • Transaction 'first-level' data, i.e. header and data immediately placed after transaction header. For all commands except page writing it will be finish.
  • For write pages command, each header is generated dynamically and does not exist until data is really being sent, so crypto code will run over all pages and update checksum processing headers and data pages separately. Checkum update should be simple enough, since there are crypto helpers to update and finalize checksum, but encryption is more complex: I requires all chunks to be setup in advance in single scatterlist chain, with dynamic header generation it is too big overhead (it requires not only scatterlist allocation, but also header allocation just for encryption), so encryption will be done separately for headers and pages, and I will have to create some IV propagation scheme (like last bytes of previous unencrypted chunk will become IV for the next chunk, or something like that). I understand, that it may be not very secure approach though.
  • Reading data back from server is simpler, since there are no transactions, and data will be encrypted/checksummed like in the first step above. It is possible, that it will force to increase network header structure a bit (32 or 16 bits to store size of the attached checksumm).

/devel/fs :: Link / Comments (2)


Sun, 22 Jun 2008

3:1. It is fucking unbelivable, but we wooooon!

We won against Holland, one of the stongest football team!
There was blood on the field, rule breaks and other crap, but we made it!

3:1 against Holland. "Russia rised from the knees" as screamed. We are in one half of the final. You have been warned, russians are coming! :)

Pivo and vodka, we will not sleep today!

/other :: Link / Comments (3)


We do not belive in penalti!

It is fucking unbelivable, but Russia plays with Holland and score is 1:1. Not only its equal, we do play a cool football!
And Holland equaled score in a 87 minute, we were so close, but it is not yet stopped. We can win. We will win!
I do not understand, how in the hell our team started to play that good, we can. We fucking can, when we want. We play not for the goal, not for the money, not for fucking anyhintg, we play just for the game. And game wins!

Ended first half of the additional time. Russia vs Holland 1:1.
We can. Just because we can.

/other :: Link / Comments (0)


Thu, 19 Jun 2008

POHMELFS and HMAC/crypto operations.

As I found with distributed storage project, any communication channels, which involve huge amount of data transfers, have to have additional strong checksum embedded in the protocol, since TCP one is not enough in some cases. There are some options, like TCP MD5 signatures or IPsec transformations, but it is not always available.

POHMELFS will include ability to both encrypt whole data channel and/or only digitally sign all messages. This will be implemented on transaction level, so no higher layer code (like reading/writing data functions) will ever be affected.
POHMELFS will also have mount time self-configuration, i.e. client will send to server information about supported capabilities, requested by administrator, and if server does not support some of them (for example it can only do HMAC and not encryption, and both operations were requested at mount time), they will be dropped (and mount failed optionally). In the future it will be possible to extend it with additional flags if needed.

mount is not very convenient command to transfer crypto information (like binary keys) to kernel, so I use the same infrastructure as initial server group initialization (i.e. using POHMELFS existing configuration utility).

Support for HMAC and encryption will force server to depend on OpenSSL, but I do not think it is a problem. In some future time I can write autoconfiguration, which will allow to compile server without crypto support (and thus do not accept encrypted clients and do not check signatures) if there is no OpenSSL.

After crypto operations are implemented (I expect it to be finished this week), I will release as promised new netchannel version (and will remove unneded functionality like NAT), and add some interesting bits (like async processing) into distributed storage, so expect its new release soon too.

Stay tuned!

/devel/fs :: Link / Comments (2)


CLISP socket streams.

Excellent documentation with examples. I expect that it is implementation (i.e. CLISP) specific and will not work with SBCL or Allegro for example, but nevertheless I want to learn and somewhat use it.
If it will be good for my usage cases, what my next userspace server will be written with? :)

/devel/other :: Link / Comments (0)


POHMELFS, NFS, Ext4 and XFS in iozone benchmark. Graphs.

Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs) 3Ghz 32 bit Xeons with 1gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with SEAGATE ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed is about 90 MB/s.

Software used in testing: 2.6.25 kernels (on server and client), in-kernel async NFS server, userspace POHMELFS server.

Tests were performed with 8gb files (amount of ram was reduced to 1gb to eliminate caching influence) with different (from 8 to 1024 KB) record size. I ran write/rewrite, read/reread and random read and write tests.

Read Reread

Write Rewrite

Random read Random write

/devel/fs :: Link / Comments (0)


CRFS got metadata cache coherency support.

Zach Brown has committed cache coherency support into CRFS repository.
Cache coherency protocol works by broadcasting special messages from server, and each client invalidates appropriate inodes (and dentries if needed) before sending back a reply.
POHMELFS uses a bit different mechanism: client does not send acks back to server, so all such messages are kind of advisory-only, but I did not yet complete (well, I did not even think about this problem this week) locking design, so it can change.

Main problem with sync cache coherency support is its absolute non-scalability. While number of sage cases might require such behaviour, I expect that if not major, but noticeble part of users do not want perfromance degradation as a price for posix-like coherency expectation. This approach is worse that write-through cache, since there is whole round-trip of the cache coherency request instead of just data sending during its writing. Single direction sending is faster than sending+waiting, so for me it is still a questionable approach.

I will think a lot of this problem later this week(end), so that solution would satisfy both high-perfomance and safety camps (although at some degree only I think).

/devel/fs :: Link / Comments (0)


Wed, 18 Jun 2008

LISP macros rox!

(defmacro with-output-dir ((out pos dir flags) &body form)
  `(let ((,pos 2))
     (dolist (operation (nthcdr 2 *iozone-tests*))
       (let* ((dir (pathname-as-directory dir))
	     (output-file (make-pathname
			 :directory (pathname-directory ,dir)
			 :name operation
			 :type "gnuplot")))
        (with-open-file (,out output-file :direction :output :if-exists ,flags)
	  ,@form))
       (incf pos))))

(defun write-gnuplot-headers (dir)
  (with-output-dir (out pos dir :supersede)
		    (format out "set title \"Iozone performance: ~a, KB/s\"~%" operation)
	            (format out "set terminal png small size 450 350~%")
	            (format out "set logscale x~%")
	            (format out "set xlabel \"Record size in KBytes\"~%")
	            (format out "set ylabel \"Kbytes/sec\"~%")
	            (format out "set output \"~a.png\"~%" (elt *iozone-tests* pos))
		    (format out "plot ")))

(defun update-gnuplot-headers (dir file)
  (with-output-dir (out pos dir :append)
		   (unless *first-file-p*
		     (format out ", "))
		   (let* ((fstype (pathname-name file))
			  (name (make-output-name file)))
		     (format out "\"~a\" using 1:~d title \"~a\" with lines" name (1+ pos) fstype))))
Macros are really the coolest feature of the LISP. Now I believe I started to understand LISP kung-fu.
Iozone parser is essentially ready. I was a bit pessimistic yesterday: it took only half of the day and several hours today, and code itself is rather ugly (and frequently really ugly, likely far from the LISP way), but it works: it runs over given dir, searches there for files with given extensions, parses them (removes unneded iozone information), writes result to specified directory. Also runs over iozone test strings and generate gnuplot scripts for them, which will build a graph based on filesystem info it gathered traversing the tree above, so results looks like this:
$ ./parser.lisp
Processing: /tmp/iozone/tmpfs/nfs.out ... done
Processing: /tmp/iozone/tmpfs/pohmelfs.out ... done
$ cat /tmp/iozone/tmpfs/out/read.gnuplot 
set title "Iozone performance: read, KB/s"
set terminal png small size 450 350
set logscale x
set xlabel "Record size in KBytes"
set ylabel "Kbytes/sec"
set output "read.png"
plot "/tmp/iozone/tmpfs/nfs.out.data" using 1:5 title "nfs" with lines,
	"/tmp/iozone/tmpfs/pohmelfs.out.data" using 1:5 title "pohmelfs" with lines

/devel/other :: Link / Comments (2)


Tue, 17 Jun 2008

LISP development zen.

(defun string_to_list (str)
  (let ((num 0) (ret '()) (string_len (length str)))
    (dotimes (i string_len)
      (let ((sym (elt str i)))
        (cond
	  ((not (char-number-p sym))
            (unless (eql num 0)
	      ;(format t ": ~d~%" num)
	      (push num ret)
              (setf num 0)))
          (t (setf num (+ (* num 10) (to_number sym)))
	     (when  (eql i (- string_len 1))
	       (push num ret))))))
  (nreverse ret)))
Which is a part of my LISP parser for iozone output files. So far it is able to convert its output numbers (performance in KB/sec) into LISP lists (one list per record), so single line of iozone output becomes a single list of numbers (ugh, I was forced to write string-to-number conversion function).
It is not that serious achievement likely, and it took the whole day, but nevertheless I like it, although I would write the same in C much faster :)

Main problem with Lisp for me is its functional-conditioning system. Converted to C it looks like:
if (a) {
  if (b) {
    if (c) {
      do_stuff()
    }
  }
}
While I would write:
if (!a)
  return;
if (!b)
  return;
if (!c)
  return;
do_stuff()
So far I did not use macros at all, and all the time looked into Practical Common Lisp book (and frankly got from there directory processing functions, although modified it a bit), but what would you expect from the first project. Tomorrow I will extend it to write gnuplot-compatible file and finally generate some graphs (I do not know how to call external programms from LISP though).
Frankly, I'm not yet excited about how cool LISP is, but I like it, since it is different. Just like I like my neverending appartment development process.
Ugh, and with proper automatic vim highlightning I am not afraid of parenthesis.

Interested reader can grab my sources and comment on ugliness.

Also found an 'interesting' article at IEEE about LISP: Migration of Common Lisp Programs to the Java Platform -The Linj Approach :)

/devel/other :: Link / Comments (2)


Sun, 15 Jun 2008

Meanwhile at appartment development side.

Decided to work on completely different than usual area today, so neverending appartment development.

Today I painted whole ceiling in the kitched and I want to belive, that it is the last time. It was not that quick, but took noticebly smaller amount of day.

Main task was floor in the hall. I finially covered it with ceramic granite.
It was supposed to be seamless granite installation, but... tiles have so precise dimensions, that difference between them was never more than half of santimeter in each side, so I was forced to make small seams and move tiles around quite for a while before they formed somewhat straight lines, although there are lots of non-straight crosses.
Nevertheless it looks cool, I'm glad I finished this part.

Hall ceramic granite

/devel/flat :: Link / Comments (0)


Sat, 14 Jun 2008

Passive OS fingerprinting.

Ever dreamt to block all Linux users in your network from accessing internet and allow full bandwidth to Windows worm? We have to care about our smaller brothers, so this iptables extension module allows you to do so. OSF stands for OS Fingerprint allows you to build usual iptables decision on incoming TCP packets, only initial handhsake containing SYN bit is enough to understand what remote OS is. Original idea belongs to Michal Zalewski.
This iptables module was imlemented almost 5 years ago and lived in patch-o-matic (userspace library is still there) iptables tree. Now I've updated it to Xtables and send for review.

Installation steps are described on the homepage, but are trivial and include usual make/make lib building and loading rules into the module via procfs file.

# insmod ./ipt_osf.ko
# ./load ./pf.os /proc/sys/net/ipv4/osf
# iptables -I INPUT -j ACCEPT -p tcp -m osf --genre Linux --log 0 --ttl 2 --connector
You find something like this in syslog:
ipt_osf: Windows [2000:SP3:Windows XP Pro SP1, 2000 SP3]: 11.22.33.55:4024 -> 11.22.33.44:139

/devel/networking :: Link / Comments (0)


New userspace network stack release.

Fixed bug found by Salvatore Del Popolo (delpopolo_dit.unitn.it) in TCP implementation, when system checked sending window and determined, that packet was not allowed to be sent and nevertheless tried to do so in some cases.

Userspace network stack is a very fast (if working on top of netchannels, also supported packet socket) and very small network stack (TCP/UDP/IP/ethernet) implemeneted entirely in userspace. Because of it lives near the very the end of the peer (i.e. very close or even embedded into application), it allows much faster processing of some workloads, namely small packet sending and receiving, where it outperforms vanilla Linux TCP/IP stack 3 times in performance and 4 times CPU usage (sending and receiving vary).

ATCP gigabit test

Comapre netchannels+unetstack versus Linux sockets (2006 year numbers).

It is not about problems in the Linux stack, but overhead of syscalls, which are in turn results of too separate data sending and reply processing in the existing model.

/devel/networking/unetstack :: Link / Comments (0)


CARP: Common Address Redundancy Protocol for Linux kernel.

I've finally made a new release of the CARP for Linux kernel.

CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard. The latest protocol to help provide high availability and network redundancy, it was developed because router giant Cisco Systems believes that its Hot Standby Router Protocol (HSRP) patent covers some of the same technical areas as VRRP.

This project allows you to build high-available clusters of multiple machines with balanced master selection between them. Installation and setup are pretty trivial:

$ tar -zxf carp_latest.tar.gz
$ cd carp
$ make

# insmod ip_carp.ko
# modprobe cn
# insmod carp_conn.ko
# ifconfig carp0 up
# carp_conn_daemon -m master.sh -b backup.sh
And the same on all other machines.
Each script as you got from its name is executed when node becomes master or backup one, you can put there firewall rule changes, traffic shaping setup, network daemon start/stop scripts and whatever you like.

Its main advantage over any other existing open (well, it behaves much more robust than Cisco VRRP though) master/backup solutions (like Hearbeat or userspace CARP) is ability to setup multicast address (via usual /sbin/ifconfig command) and thus do not confuse some crappyCisco hardware, which will not understand that node changed.

One can get the latest sources from CARP homepage.
Enjoy!

/devel/networking :: Link / Comments (0)


Fri, 13 Jun 2008

The latest iozone benchmark of POHMELFS, NFS, XFS and Ext4.

1Gb of RAM, 8Gb files. SEAGATE ST3300007LC 10k rpm 300 Gb on Adaptec AIC7902 Ultra320 SCSI adapter.

Performance in KB/s.

NFS:

                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   53210   57769    24304    24448    1360    4775
8388608      16   54577   57481    23871    24080    2592    7937
8388608      32   54736   56203    24015    24114    4738   12637
8388608      64   52075   54051    23653    23555    7610   18475
8388608     128   52307   54636    23305    23375   13017   26584
8388608     256   52189   53030    23585    23531   15615   34390
8388608     512   52938   54063    23709    23882   17524   42781
8388608    1024   57458   57006    24187    24292   29701   43892
POHMELFS:
                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   66473   63721    74232    74288    1103    4953
8388608      16   52604   62339    73423    74259    2001    8438
8388608      32   53278   62283    73497    74115    3360   13849
8388608      64   56931   61370    73135    74077    5076   21063
8388608     128   59419   62743    72736    74122    8068   30279
8388608     256   60861   63094    73284    74554   10848   38869
8388608     512   59438   62081    73329    74441   17290   48722
8388608    1024   62790   62130    73322    74100   27741   46470
POHMELFS write speed about 10% faster, read speed 3-3.5 times faster (essentially disk/local fs IO limit, see below). POHMELFS random read speed is smaller, and that is task with the highest priority now, especially compared to local FS results.POHMELFS random write is slightly faster than NFS.

For comparison, local filesystem, used for tests.
mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/
:
                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   75124   60560    77672    77797    1860    5059
8388608      16   75044   60036    77754    77775    3601    8772
8388608      32   75958   62038    77593    77765    6821   14781
8388608      64   74728   59384    77688    77782   12475   23228
8388608     128   74889   59676    77731    77736   21734   32241
8388608     256   75022   59285    77676    77718   28833   40324
8388608     512   74885   59187    77653    77713   40013   48057
8388608    1024   74838   64217    77796    77765   55100   46104
And Ext4 to the group (mount options: rw,noatime,data=writeback,extents):
                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   72107   73017    77276    77335    1849    5015
8388608      16   72276   73849    77304    77287    3577    8666
8388608      32   72680   73647    77284    77326    6755   14394
8388608      64   71965   74287    77327    77288   12366   22513
8388608     128   72660   73864    77207    77343   21617   31160
8388608     256   72813   74058    77296    77338   28652   42003
8388608     512   72985   73317    77284    77343   40572   50619
8388608    1024   72184   74131    77264    77250   55649   50365
Nice graphs will be done, when I will write Lisp (no less :) parser for it.
Stay tuned!

/devel/fs :: Link / Comments (3)


New POHMELFS release: doing it wrong fast is at least better than doing it wrong slowly.

Via Ashleigh Brilliant and bits of Tullamore Dew.

Here we go, short changelog for this release:

  • Read requests (data read, directory listing, lookup requests) balancing between multiple servers.
  • Write requests are sent to multiple servers and completed only when all of them sent an ack.
  • Ability to add and/or remove servers from working set at run-time from userspace (via netlink, so the same command can be processed from real network though, but since server does not support it yet, I dropped network part).
  • Documentation (overall view and protocol commands)!
  • Rename command (oops, forgot it in previous releases :)
  • Several new mount options to control client behaviour instead of hardcoded numbers.
  • Bug fixes.
I will complete documentation in a few moments and send this release to the mail lists.
Very likely it is last non-bug-fixing release of the kernel client side, next release will incorporate features, needed for distributed parallel data processing (like ability to add new servers via network command from another servers), so most of the work will be devoted to server code.

/devel/fs :: Link / Comments (0)


Wed, 11 Jun 2008

Preparing for the next (last non-bug-fixing?) release.

Essnetially that's it, I belive really most of the features I wanted from network distributed parallel filesystem, which should live in client, are already implemented in POHMELFS.

Client has following (if did not forget something interesting, listed only interesting from parallel point of view) features:

  • Automatic failover reconnect to the same server.
  • Run-time addition/removal of the servers from the working set (only via userspace command, since server does not support that yet, but addition is trivial).
  • Coherent data and metadata cache
  • Transactions support. Full failover for all operations. Resending transactions to different servers on timeout or error.
  • Load balancing of reading (directory reading and lookups inclusive) requests and simultaneous writing to all servers in current working set.
It is damn fast (but remember, that random reading is no yet optimal enough, and in the last tests it was slower NFS).

Userspace server meantime does not support lots of features it has to support to be called complete parallel distributed solution, and main work should now be concentrated on it.
Main missing (and the most complex) features are:
  • Distributed data coherency protocol like PAXOS for server data, stored on multiple machines.
  • Ability to mirror data itself on multiple machines.
So, likely release will see the light tomorrow or Friday.

/devel/fs :: Link / Comments (0)


Tue, 10 Jun 2008

Sun and water. Sasha and Masha.



Thank you, that was great!

/life :: Link / Comments (2)


Fri, 06 Jun 2008

Contributors we are losing and kernel summit talk about it.

By 'we' I mean kernel community, although I do not think I personally win or lose if someone decided not to hack on Linux kernel.

I even found myself in a 'contributors we are losing' list :)

And yes, very likely Linux kernel community lost me (and I do believe none cares as long as me). But not Linux kernel, it is definitely the place I like.

People, who want to hack on Linux kernel will do that without all that empty talks and brilliant ideas, all of which are only aimed in a single direction: do what we will ask you to do for us. Be fair and admit that you do not want new ideas implemented, you want old bugs (introduced by someone else) fixed only, so that kernel got more respect without possible additional work for you.

It is not how interested people work, instead they just decide themself how and what to do. That's why kernel janitor project did not succeed: it is not interesting for anyone. The same applies to its refocus to bugfixes.
And I do know what is kernel janitorial: I started with that not long time ago: fixed trivial error checks like request_region()/check_region() code and other minor things like PCI remap errors.
That was hell of crap. Frequently there was a situation, when I fixed lots (like 20 or more) drivers in one go and submitted a patch, instead I was asked to split it to separate patches, to add each driver maintainer into the copy, wait for theirs ACK, resubmit and so on. And frequently happend (especially when new feature was introduced and lot of small code has to be changed a little), that while I did that, some other known kernel hacker did the same, and his patch was immediately applied.

Janitorial and all hypocrisy about 'we want more developers' just suck.

My advice for those who really want to hack on kernel: just do what you like, try yourself in whatever subsystem you want, implement your ideas, be creative and do whatever you like with kernel and not what all those kernel heads tell you to do.
The only way to succeed is to move forward!

Argh, and do not listen for any such kind of advices at all :)

/devel/other :: Link / Comments (3)


POHMELFS development status.

POHMELFS got ability to add/remove servers in run-time (although not via network command, since I do not know, how to test it yet), but via netlink interface. The same message can be passed via network though, so it will be simple to extend.
Also, POHMELFS got readahead support via ->readpages() callback. I removed AIO reading from POHMELFS in favour of readahead and got excellent result in sequential reading: 3-3.5 times faster than NFS and essentially reaching disk IO bandwidth (a bit less though), but random reading dropped to miserable numbers.
Also rewritten reading method should provide better balanced between multiple servers capabilities for the system, but it will not show any benefit in single-threaded iozone benchmark, since it reads data via single call to read(), which gets sequential data access, which in turn is faster than network bandwidth. So multithreaded load should greatly benefit from read balancing, but I did not yet test that.

I ran sequential read/reread, write/rewrite and random read/write tests for XFS, Ext4, NFS (over XFS) and POHMELFS (over XFS) with 1Gb of RAM and 8Gb of test files (to eliminate VFS caching influence) with 8Kb to 1Mb record size.
Results exist in text files in standard iozone output format, but since I'm learning LISP I decided to write a graph generator (via gnuplot) using my very basic knowledge of this language, so nice graph results can take a while...

Also, tomorrow morning I will flight away to my friends marriage and will only return monday 9. I will not have internet access there, only lots of fun.

/devel/fs :: Link / Comments (0)


Thu, 05 Jun 2008

Travelling to Uganda.

Friend calls to move to Uganda this September. Promises beautiful nature a very interesting travel as is.

Thinking...

/life :: Link / Comments (2)


Wed, 04 Jun 2008

Optimized POHMELFS transactions.

Now they eat less memory, and single writing transaction can accumulate up to 1024 pages. This can be further tuned especially for small requests mixed with sync. Currently writing transaction is allocated for its maximum size, and then pages pointers are written to the allocated area, so if number of dirty pages requiring writeback is small, quite lots of space will be wasted.
It is a task for the next optimization, nevertheless currently sequential writing is only limited by disk throughput or network bandwidth in case of multiple servers, since link is shared between machines, so effective bandwidth becomes equal to GigE/number of servers, or about 60 MB/s in my environment with two servers and single client.

Also, reading path was not changed at all (only transaction internals) - there is still no readahead and new transaction is allocated for each page to be read. Nevertheless, see how reading was improved: POHMELFS not only outperformed NFS again, but reached disk bandwidth limit already for 16Kb requsts (almost two times faster than NFS). Table shows IO throughput in KB/s.

                                                    random  random
      KB  reclen   write rewrite    read    reread    read   write
 8388608       8   74058   68392    40130    79509   43588    4818
 8388608      16   62332   66978    73714   122074   42160    8434
 8388608      32   64775   67073   109357   171139  145416   14183
 8388608      64   66962   66602   147350   217323  227962   22257
 8388608     128   67724   67133   185574   266855  321060   32681
 8388608     256   68233   67922   201591   283567  474657   40944
 8388608     512   68339   66514   213513   295995  646897   50303
 8388608    1024   67744   67384   220858   297748  676582   48796
I will create nice graphs out of this tables and also will include optimized reading tests (tomorrow likely) and two data server results.

What also should be done, is testing with either bigger files or smaller amount of ram and thus smaller VFS cache size. As you saw in all tests, when lots of reads start to hit the cache, picture becomes completely non-informative for filesystem behaviour. So I want to limit all three testing machines to 1Gb of RAM (booting with mem=1G parameter) and perform the same iozone bench for 8Gb file. Results should be more realistic.

In parallel I will implement userspace run-time server addition/removal command, which will also be used as-is for network message from one or another server, connected before. With optimized reading transactions it will be a good ground for the next POHMELFS release. So I plan to schedule it to thursday or middle of the next week, since I will be on small vacation jun 6-9.

/devel/fs :: Link / Comments (0)


Mon, 02 Jun 2008

AppArmor and path-based security approaches vs object bound policies.

- So again, can you offer an alternative?
- Just give up on this dumb idea completely.
It is not about AppArmor in general (although maybe about it too), but about security hooks which provide path information into inode callbacks. There are pros and cons for this decision, but things look like path based security hooks will not be accepted.

There is a really trivial way to fix it. No kidding, it is simple: create own name cache and do not bind it to dentries, but instead index it by inode number. This allows you to have whatever you want callbacks and information in stricktly bound VFS operations. Need to have path info in ->inode_create()? Put it into own tree indexed by inode number for parent inode, lookup that data in security hook and make a decision. Yes, it is slower, but active security was never a fast solution. It is still against the rules others created for security based systems, but still formally it in the all boundaries of the created (maybe ugly for someone) interfaces.

And I will not point to project, which already uses such approach in different area though :)
It is interesting to implement your ideas not by breaking something (although sometimes it is need, but that's likely an exeption or when you are hacking deeply internal kernel part), but instead by hacking around existing limitations.

/devel/fs :: Link / Comments (4)


Trumpet kung-fu.

I think I found a way to have a progress in my trumpet playing exercises (read: ear scratching screaming, it sounds much worse than wrong note on piano).

Ear cracking device

It is of course practice, but even without whole tube and using only mouthpiece I can train breathing path. Musicians have several hours per day exercises, kernel hackers about half of an hour each morning in parallel with listening to ACDC and Metallica as an alarm clock. Mouthpiece is rather quite (noticebly louder than usual talk though), but produces about the same resistance for air flow, so I think it is a good training. When embouchure will be stable enough I will attach trumpet, since currently sound frequently drops and jumps. Nevertheless I got a big progress (I think so at least) after started such trainings recently.

My home guardian Socket although does not have ears, looks like do not like it. Alhtough he only likes to eat.

Socket - a home guardian

/life :: Link / Comments (0)


Pros are talking.

- If you haven't noticed, I don't take "no" for an answer,
- And now please tell us step 2 in your secret plan to win friends and influence.
- WTF are you getting at?
Fun thread :)

There is actually a serious problem in kernel community, when some new idea is being implemented, and it moves against something which sits in mind of one or another big kernel hacker out there. When such person replies, that this is bad idea (sometimes without technical arguments), people just stop looking at replies and do not follow arguments of the author just because they frequently do not know area in question enough to make decision and thus rely on others.
This only works when 'others', i.e. core kernel maintainers, are good and do not base theirs decisions on personal feeling and only get technical side into assumption. Unfortunately it is not always the case, and political methods are used. Sometimes even only political methods are used...

/devel/other :: Link / Comments (0)


As promised, let's see shadowed miserable POHMELFS results.

Usually you will not see bad benchmark results for developing technology, but any such result is actually a _very_ good result for work-in-progress and not yet completed system. It allows to see how new proof-of-concept code can be comparable with already completed tuned and optimized system.
Conclusions from such test results in a really superior decisions.

Let's compare iozone read/reread, write/rewrite and random read and write for POHMELFS and NFS with 8Gb test files different record size (from 8Kb to 1Mb) on XFS over the GigE link.
I described hardware and local iozone benchmark results in details previously.

Now its time for network tests.
Async NFS in-kernel server results.

						    random  random
      KB  reclen   write rewrite    read    reread    read   write
 8388608       8   60969   57743    39705    97031  464898    5160
 8388608      16   59925   57402    39045    98269  641388    8827
 8388608      32   58094   55263    39075    94654  775064   14389
 8388608      64   58168   57156    40306    98639  868796   22360
 8388608     128   58908   56573    40392   100018  941509   33211
 8388608     256   59444   56446    40842   102503 1030451   41576
 8388608     512   60280   57686    39835    97879 1042570   49858
 8388608    1024   60817   57886    40886    96646  851175   47993
And now POHMELFS results.
						    random  random
      KB  reclen   write rewrite    read    reread    read   write
 8388608       8   70073   64232    12518    14817   40334    5079
 8388608      16   63984   67948    31976    19106   41462    8702
 8388608      32   67250   63440    47506    38657   75908   14357
 8388608      64   69970   66198    41899    29566  136294   21385
 8388608     128   69838   68523    76232    33971  222909   30946
 8388608     256   70012   66439    69125    58223  330886   40685
 8388608     512   70946   68291    76460    58738  428881   51001
 8388608    1024   70985   64958    76317    59561  421973   48531
Sequential writing is 10-15% faster for POHMELFS (and limited by underlying fs speed), while random writing is essentially the same and is limited by disk speed. But sequential reading is _much_ worse for small requests. THe reason is simple: POHMELFS does not support readahead, since it does not have ->readpages() callback, so any sequential access ends up with set of ->readpage() callbacks, which waits for theirs completion, which is slow, so currently readahead is not invoked from reading path.
I could not resist to highlight, that big sized requests are 1.5-2 times faster for POHMELFS than NFS :) and is also limited by underlying filesystem.

One can note, that NFS random reading results are actually better than local filesystem behaviour, and its is better very noticebly. Why does local filesystem behave worse than being mounted via NFS in random reading?
I believe that's because in a network case we actually have double buffering: on client, where the most active pages are in RAM, and on server, where readahead populated pages, which are not active (since active pages are being read from client's cache, so they will be evicted from server's page cache, since client will not try to read them from server), but those server pages, which are not active currently will be accessed soon by client, when it will read next portion of the random data, and it will be very fast access to RAM.
So we have really good caching scheme, where the most actively used pages are in client RAM, and they are flushed to disk on server, and isntead server populated other less active pages via readahead.

This reading behaviour is just a result of yet not completed VFS callback implementation of the POHMELFS. With ->readpages() in place it will be faster than NFS even in this bench. Also POHMELFS has multiple-server parallel read balancing and simultaneous writing to them, but there are no results yet.
I already created a mind model of the optimized read and write transactions (based on memory pools for the maximum OOM-robustness and small memory usage overhead), so in a day or two it will be implemented in code.

Stay tuned, now its time for excellent POHMELFS results!

/devel/fs :: Link / Comments (0)


Sun, 01 Jun 2008

We won! DPQE:DGAP 81:67

Screaming, drinking, cheering...
Although I was not there today, since some friends became ill and others moved to their tasks, it still was really cool (yesterday).

My congratulations to the team and department itself :)

/life :: Link / Comments (0)


Sat, 31 May 2008

Ole-ole-ole-ole, kvanti chempion!

Match of the century - 24 hours of footbal in my Alma Mater.

Today Department of Quantum and Physic Electronic (which I finished do not even remember when, but I started studying in MIPT 10 years ago) play with axes, or theirs another name: Department of General and Applied Physics.
After about half of the match we won with +18 goals (31:13).

This happens once per year and usually I tried to move to MIPT and watch part of the game like this year. Tomorrow will move there too of course to met with old friends and celebrate the win!

/life :: Link / Comments (0)


Fri, 30 May 2008

Local filesystem randomg read/write performance. POHMELFS parallel testing.

I promised to publish POHMELFS parallel processing results yesterday, even if they are miserable. Unfortunately there are no interesting results at all. In the released version POHMELFS is 32bit only, since it does not have special ->open() callback which forces to open files with O_LARGEFILE flag to support more than 4Gb (actually only 2Gb, since kernel uses signed size_t, which is only 31 bit large) sizes and superblock maximum size is set to 32 bits, so all 32 bit results are not very interesting, since having 2Gb/s random read speed is really stupid sentence, since all reading happend from the cache.

While results with more than 2Gb are... Let me first show you how XFS and Ext3 behave in case of random writes.

A short preface.
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs) 3Ghz 32 bit Xeons with 8gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with SEAGATE ST3300007LC 10k rpm 300 Gb testing disk.
Its linear reading speed is about 90 MB/s. Dmesg:

scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 3.0
        <Adaptec AIC7902 Ultra320 SCSI adapter>
        aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs
scsi 0:0:2:0: Direct-Access     SEAGATE  ST3300007LC      0003 PQ: 0 ANSI: 3
 target0:0:2: asynchronous
scsi0:A:2:0: Tagged Queuing enabled.  Depth 32
 target0:0:2: Beginning Domain Validation
 target0:0:2: wide asynchronous
 target0:0:2: FAST-160 WIDE SCSI 320.0 MB/s DT IU QAS RDSTRM RTI WRFLOW PCOMP (6.25 ns, offset 63)
 target0:0:2: Ending Domain Validation
Kernel version is 2.6.25 (and 2.6.24 for the first ext3 test).

I used two such machines as servers for iozone read/reread, write/rewrite and random read/write testing. File size is limited to 8Gb only, since it is the only interesting fair case, record size varies from 8Kb to 1Mb.

Before I started 8Gb POHMELFS testing, I decided to check how local filesystem behave in such scenario.
XFS was tuned this way: (mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1; mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/)
Ext3 was created and mounted with default options on machine with only 4Gb of RAM though.

So, testing.
Here is a results table from iozone (before I interrupted it) with read/reread, write/rewrite and random read/write tests for XFS (either default, or tuned like on link above).
                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   73671   64052    77565    80107   35281    5085
8388608      16   74437   66095    77611    80065   66854    8808
8388608      32   74683   66780    77564    80202  121442   14576
8388608      64   74936   66908    77537    80372  215377   22583
8388608     128   74928   68598    77542    80247  339304   32280
8388608     256   73609   69615    77534    80143  365081   40571
8388608     512   73763   69830    77547    80317  420704   48501
8388608    1024   73940   69474    77602    80065  406266   47295
I.e. 5 MB/s random write speed for 8kb record!

Do you really want to know ext3 speed? Pregnant kids and women should skip next paragraph.
I interrupted test after almost 2 (!) hours or random writing of 8Gb file with 8Kb records on default ext3. Test was not completed and I do not really know its performance (note, that this machine has only 4Gb of ram, other hardware details were described above), but it will be less than 1 MB/s.
Ext4 behaves much better in this aspect (ount options: rw,noatime,data=writeback,extents):
                                                   random  random
     KB  reclen   write rewrite    read    reread    read   write
8388608       8   69593   74200    77324    81340   35538    5088
8388608      16   66745   70038    73676    77271   65715    8704
8388608      32   68253   70320    73652    77258  121690   14469
8388608      64   68421   71291    73653    77042  209629   22005
8388608     128   68438   71340    73658    76988  332021   30381
8388608     256   68921   71254    73651    76912  435586   40683
8388608     512   69079   71728    73551    76815  549136   49298
8388608    1024   66611   71217    73683    76581  552459   49220
POHMELFS results are coming...

/devel/fs :: Link / Comments (0)


Wed, 28 May 2008

POHMELFS got read balancing between multiple server and simultaneous write to them.

I hate laziness, but sometimes drop into that hole... So last couple of days I just stupidly wasted by time (well, I read Lisp and failed to find GTK binding for CLISP, made some code and kernel bug fix, but that does not count). Today lazyness started to be really boring, so I made some small progress in POHMELFS parallel processing.

It got ability to send transactions to multiple servers by default and balance reading between them (so far it does it always from the first server, in case of error it switches to second, but it is trivial to change). This was implemented via special routes for each transaction, which are stored per network state, so if one of the servers did not answer, we would not resend data to others. It also makes trees smaller, which should allow faster reading in case of lots pending writing transactions.
Code is in testing stage currently, I will complete read balancing tomorrow and test it against multiple servers on different machines, when data is placed on disk, so that random access would be slow. Having two servers I exect to get linear speed increase. If test will be disk IO bound, it is possible to add multiple servers on the same machine, so that each server would run on its own disk (I have two resonable fast SCSI disks on each testing machine).
Results will be published here of course (well, even if they are miserable :).

/devel/fs :: Link / Comments (0)


Sun, 25 May 2008

Every lisper did that.

#!/usr/bin/clisp
(defun f (m)
  (do ((k 0 (1+ k))
       (c 0 n)
       (n 1 (+ c n)))
    ((eql k m)
     (format t "~r" c))))
(f 317)

Guess the result:seven hundred and ninety-three vigintillion, five hundred and ninety-one novemdecillion, four hundred and seven octodecillion, eight hundred and four septendecillion, one hundred and fifty-one sexdecillion, nine hundred and twenty-six quindecillion, five hundred and ninety-three quattuordecillion, seven hundred and ninety-three tredecillion, forty-two duodecillion, one hundred and twenty-six undecillion, eight hundred and ninety-one decillion, one hundred and twenty-eight nonillion, eight hundred and nineteen octillion, six hundred and ten septillion, seven hundred and ten sextillion, one hundred and forty quintillion, one hundred and forty-five quadrillion, thirty-seven trillion, nine hundred and fifty-eight billion, two hundred and seventy-three million, seven hundred and seventy-seven thousand, three hundred and ninety-seven

/devel/other :: Link / Comments (4)


New POHMELFS release. Full transaction support. Data and metadata cache coherency.

Irish Tullamore Dew helped this POHMELFS release to see the light.

Short changelog:

  • Full transaction support for all operations (object creation/removal, data reading and writing). Data reading transactions are not optimal yet and will be improved in the next release (although fast).
  • Data and metadata cache coherency support. More details on how this is implemented one can find in appropriate section.
  • Transaction timeout based resending. If given transaction did not receive reply after specified timeout, transaction will be resent (possibly to different server).
  • Switched writepage path to ->sendpage() which improved performance and robustness of the writing.
  • Preliminary support for parallel data processing. Code to write data to multiple servers in parallel and balance reading between them was imported, but is not used right now.
  • Fair number of bugfixes.
Next release is scheduled for the beginning of the next month, and will likely include following features:
  • Improved reading transactions.
  • Server redundancy extensions (ability to store data in multiple locations according to regexp rules, like '*.txt' in /root1 and '*.jpg' in /root1 and /root2.
  • Client parallel extensions: ability to write to multiple servers and balance reading between them. Code was imported to the current version, but not enabled yet.
  • Client dynamical server reconfiguration: ability to add/remove servers from working set by server command and from userspace.
  • Start generic server distribution development.
As usual one can grab the latest source from archive or GIT tree.

/devel/fs :: Link / Comments (0)


Sat, 24 May 2008

This was supposed to be POHMELFS release day.

But no, it is scheduled for tomorrow because of the very interesting way I decided to implement reading transactions. The way it works right now is quite miserable, so I want to clean things up and make a really good patch.

Page reading code will create single transaction for the bunch of pages and will schedule next one if pages are not yet received instead of waiting for transaction to be completed, and only wait at the very end (if needed). With addition of async copy from receiving kernel thread into reading userspace via copy_to_user() (in todo), this will became the fastest possible way of doing reading over the net I think.

So far changelog contains following items:

  • Full transaction support for all operations (object creation/removal, data reading and writing). Data reading transactions are not optimal yet and will be improved in the next release.
  • Data and metadata cache coherency support. More details on how this is implemented one can find in devel section.
  • Transaction timeout based resending. If given transaction did not receive reply after specified timeout, transaction will be resent (possibly to different server).
  • Switched writepage path to ->sendpage() which improved performance and robustness of the writing.
  • Fair number of bugfixes.

/devel/fs :: Link / Comments (0)


Wed, 21 May 2008

iput() locking in POHMELFS.

iput() is a very tricky call in Linux VFS, besides the fact that it drops inode when its reference counter reached zero, it also waits until all associated pages are flushed to storage too.
POHMELFS uses singler per network state (network connection structure) thread, which only reads async replies from the server, so it is possible, that reply which requres iput() (for example create command reply) will happend in parallel with object removal, so inode will be deleted, but yet not freed. When reply is received and iput() called, it will try to free inode and wait until all associated to its mapping pages are synced. But page sync happens on reply to another command (consider for example several writeback transactions), which can not be processed, since thread is waiting them to be completed. This problem can not be fixed by introducing multiple threads, since each one can be exactly in the same situation simultaneously.

In turn we should not allow to grab inode and free it in the receiving path. This is ok for writeback transactions, since inode can not be freed until pages are synced, so just by holding pages we are able not to lock, but object creation for empty files or directories does not have pages attached, so they have to be synced with special transaction. There still can be a problem with empty file though - some pages can be attached and it can be removed while system waits for creation transaction complete, but actually we do not need to know about that - we shuold not grab inode it all, since transaction already contains all needed into, namely inode number, so we can lookup inode (if it still exist) and mark it as created without need for lock-prone grab/put.

This bit took me last three days, during which POHMELFS moved to non-blocking receiving and timeout-based sending (and returned back), it got scanning 'watchdog' which resends trasactions if they were not acked after some time and eventually dropes them if they still does not get a reply, POHMELFS got couple of new operations supported and likely something else to existing set of features implemented to date (full transaction support for all operations and data and metadata coherency protool were added for the next release).
New release is scheduled for the end of the week, and there is no readpage transaction support yet...
So, stay tuned!

/devel/fs :: Link / Comments (3)


Things getting worse...

$ clisp 
  i i i i i i i       ooooo    o        ooooooo   ooooo   ooooo
  I I I I I I I      8     8   8           8     8     o  8    8
  I  \ `+' /  I      8         8           8     8        8    8
   \  `-+-'  /       8         8           8      ooooo   8oooo
    `-__|__-'        8         8           8           8  8
        |            8     o   8           8     o     8  8
  ------+------       ooooo    8oooooo  ooo8ooo   ooooo   8

Welcome to GNU CLISP 2.42 (2007-10-16) 

Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2007

Type :h and hit Enter for context help.

[1]> (defun test-func () (format t "It's a test func"))
TEST-FUNC
[2]> (test-func) 
It's a test func
NIL
[3] (exit)
Bye.
This one has, imho, the less ugly command line... And I'm against SLIME and Emacs. Also tried SBCL, GNU CL and something else, but likely CLIPS will stay.

Instead of sleeping (it will be time to wake up soon in Moscow slums) or at least catching POHMELFS bugs (last several days were solely devoted to this task and fair number of them were fixed as long as some interesting features introduced (probably new), so likely new release will see the light later this week), I'm drinking some beer and making first steps into this. So far looks quite new and probably interesting, but every entrance article about it I read told, that if you are after 25 years old, it is likely impossible to change something in your perception. I'm after, but think that it will be fun and probably will become a really good tool for me.

The more I think about it, the more interesting tasks (as long as those I'm already thinking about like CAPTCHA) I find...

/devel/other :: Link / Comments (5)


Mon, 19 May 2008

Russia Canada5:4

Yesterday Russia became a hockey world champion, first time for the last 15 years!

Final goal!

/other :: Link / Comments (2)


Sat, 17 May 2008

POHMELFS got full data and metadata cache coherency support. Transaction support for majority of the commands.

linux-2.6.pohmelfs$ git-diff-tree -r --stat 21549d0a101 master
 fs/pohmelfs/dir.c   |  108 ++++++--------------
 fs/pohmelfs/inode.c |  279 ++++++++++++++++++++++++++++++++++++--------------
 fs/pohmelfs/net.c   |  216 ++++++++++++++++++++++++++++++---------
 fs/pohmelfs/netfs.h |   43 +++++++-
 fs/pohmelfs/trans.c |   55 +++++++++-
 5 files changed, 484 insertions(+), 217 deletions(-)
It was rather simple task due to async event processing support.
Each time client creates, reads or writes object to server, information about its interest is stored on server. When any other client updates the same object (like changing attributes or writes data), all interested clients get notifications with new data (new attributes, or in case of writing possibly new size and flag, which page has to be fetched from the server, since it is not valid anymore). Writing happens during writeback as before, so commands like "echo Some_message > /mnt/file" immediately syncs size of the file to zero and after some time writes there actual data, when system will decide to start writeback.

Also ported all but one commands to transaction mechanism, which means they all will be resent if currently active network connection goes down. Although most of the commands are not synchronous, and thus will not be resent after timeout, this can be trivially changed if there will be major demand on that.

Only reading has not yet been ported to transaction model, which is a next task to complete. This transactions have to be synchronous, since we do want to read data, while do not actually care about full directory content.

This changes have to be seriously tested and all problematic places to be resolved, for example they slow metadata operations noticebly, since now system sends a message each time new object is created, although kernel archive untarring now takes about 5 seconds against previous 2-3 including sync on 4-way machine with 8gb of RAM and it is still not comparable to 30+ seconds for async NFS, it has to be investigated further.

After full move to transaction model and cache coherency testing (that model may be not complete for some usage, since locks are not yet supported), POHMELFS will make its first steps into distributed area...

Stay tuned!

/devel/fs :: Link / Comments (0)


Fri, 16 May 2008

Metadata cache coherency support in POHMELFS.

Client:

$ ls -lai /mnt/test
3 -rw-r--r--  1 root root 94208 2008-05-16 22:27 test
$ sudo chown zbr.zbr /mnt/test 
$ ls -lain /mnt/test
3 -rw-r--r-- 1 2319 1002 94208 2008-05-16 22:27 /mnt/test
Server:
fserver_get_client_data: thread: 3085847440, cmd: 8, id: 0, start: 2, size: 94, ext: 0.
fserver_transaction: thread: 3085847440, trans: 0, size: 94, sub: cmd: 10, id: 3, start: 0, size: 70, ext: 6.
fserver_inode_info: path: '/test', size: 94208, mode: 100644, uid: 2319, gid: 1002.
So, server now contains all metadata information about updated object on client, pohmelfs_setattr() is synchronous for remotely read inodes and for already synced indoes, created originally locally. It does nothing, if object is not yet synced to server, since syncing will provide that info itself.

The only missing thing is to asynchronously broadcast that data to other clients, which requires to create a cache of objects to be interesting for given client, each client will be automatically added into group of interests when it lookups object, so when attribute for given object is being set, update will be sent to interested parties. Client will be dropped from group of interests, when it drops appropriate inode locally (which will force sending a special message).

/devel/fs :: Link / Comments (0)


Thu, 15 May 2008

Meanwhile at appartment development side.

I installed vater system for the shower and thought to install the whole cabin, but found (as usual) that I do not have drills for the ceramic tiles. So, that will be postponed for a while.
Also I expect glue for ceramic tiles to be delivered today (as long as brick tiles), so that I can start hall granite covering. Although I'm a bit tired after water system installation, which took major part of the day.
It is actually simple task, but only when you have simple access to all parts. Now imagine a 10 sm thick wall, where you managed to drill two holes, each one about 2 sm in diameter (less than two fingers thick). In a meter below-left there is a bigger hole for sanitary (about 15x15 sm). Water system hatch is located 2.5 meters right to this.
Task is to put thin water tubes from water hatch to two small holes, but that splitter would be installed near bigger sanitary hole. Without direct access to any tube (you can only feel it, can not see) you have to connect them (also need to mention, that it is quite hard to put both hands into bigger hole for sanitary system) via different connectors using spanners.

I've completed the task, although not sure if it is really safe. That was challenging, and power sucking, so probably I will just slack this evening and hack some bits of captcha. Will also cover my table with the last colour level (yes, yes, it is still not done) and/or fill second varnish layer for x-shelves (they look really cool after mordant and varnish)...

/devel/flat :: Link / Comments (0)


POHMELFS distributed plans.

After healthy discussion started after my announcement of the second POHMELFS release, its time to highlight main ideas settled in the thread.

First, POHMELFS will be moved into parallel distributed filesystems, but still being very good as network filesystem. In particular, that will include ability to read data from one of the connected server (not particulary from currently active, how its done right now), writing will happen to all connected servers simultaneously (and transaction will be committed after all servers returned completion acknowledge).

Protocol will be extended to support dynamic addtion and removal of the servers to/from currently connected group. Probably there will be some kind of a status messages for servers (i.e. going offline, do not send me data, or I'm becoming slow, do not read from me and so on). It will be done in addition to cache coherency messages (I'm yet to implement, but because of other tasks, this was a bit postponed, probably to weekend), which will include two types of requests: page invalidation and inode update (that will also mean that POHMELFS will start supporting attributes (maybe even extended), right now it doesn't :). Such cache coherency protocol should scale better than classical MOSI (and its derivatives) and particulary better than pNFS spec proides (leases to operations for some servers), since it is still possible to work in parallel with the same file, especially without any overhead of data processing does not cross different client boundaries, but it has to be tested in practice.

POHMELFS server will be extended to support distributed facilities. Very likely it will be some kind of PAXOS algorithm, although probably in its very limited mode for the beginning. So far it will be really simple, so that I could touch all its corner cases and found optimal development strategy.

All client extensions are rather not that complex, although not always trivial, so that should not take too much time, so probably you will get something interesting soon.
Server extensions will be a bit slower, since I will start essentially from the distributed system ground and gradually move upstairs.

/devel/fs :: Link / Comments (0)


Tue, 13 May 2008

New POHMELFS release. Transactions, performance, failover.

Irish Jon Jameson (6 years of experience, really good stuff) brings us this new POHMELFS release.

Main features include:

  • Fast transactions. System will wrap all writings into transactions, which will be resent to different (or the same) server in case of failure.
  • Failover. It is now possible to provide number of servers to be used in round-robin fasion when one of them dies. System will automatically reconnect to others and send transactions to them.
  • Performance. Super fast (close to wire limit) metadata operations over the network. By courtesy of writeback cache and transactions the whole kernel archive can be untarred by 2-3 seconds (including sync) over GigE link (wire limit! Not comparable to NFS).
The nearest roadmap includes:
  • Full transaction support for all operations (only writeback is guarded by transactions currently, default network state just reconnects to the same server).
  • Data and metadata coherency extensions (in addition to existing commented object creation/removal messages).
  • Server redundancy.
One can check out POHMELFS homepage for more details. You can download latest release (against 2.6.25 kernel tree) from archive or GIT tree.

/devel/fs :: Link / Comments (0)


Mon, 12 May 2008

Meanwhile at appartment development side.

I moved to development shop and got zillions of stuff there including various colours for ceiling in kitchen and room's ceiling plinth, ordered brick-like tiles for kitchen (about one third of walls there will be covered with bricks), got some intrument (like rubber hummer for the tiles), ordered glue for the ceramic granite for hall, also got a shower (yumi, my shower cabin was delivered today too) and related stuff for water system installation.

By the original plan, I wanted to isntall shower cabin today, but getting into account current time, it is too late for loud work, so I will proceed with my table instead. It will be completed today, or call me a ... whatever you like (out of curiosity, is there an english undecent word dictionary? I know russian one exists).
If things will move fast, I will also cover with varnish my X-shelves, and probably will make some photos...

/devel/flat :: Link / Comments (2)


Fast POHMELFS transactions.

With new transactions and new waiting mechanism (see below) system now untars the whole kernel tree in less than 3 seconds over the GigE link (including subsequent sync, which takes less than second always), while async NFS (remote side is tmpfs in both cases) performs that in a bit more than 30 seconds. In addition POHMELFS write speed is 125 MB/s (wire limit) vs. less than 90 MB/s in NFS (dd from /dev/zero with 1 MB block size and 1000 blocks).

That's what I call a good result.

Transaction mechanism invoked in writeback path is now completely async too, i.e. it does not wait until remote side confirms that transaction was received and processed, but writeback does not drop transactions after sending function returned, instead it stores it in the in-flight storage and proceeds with the next one. Transaction can accumulate up to 90 pages in a single frame.
When reply is received, async thread searches for given transaction and complete it (unlocks page, although it can be done in writeback, since page is being copied, cleanup writeback bits, drops it from appropriate radix tree and drops reference counter). If transaction was not sent due to some error it will be tried to be sent to different servers, if some error was returned from the server, it will be resent to different ones. Since original writeback path does not know about transactions in-flight anymore, any timeout has to be checked by dedicated thread (or workqueue), which will detect too old transactions (by simply checking them from the beginning, since each new transaction has incrased id) and resend them to remote servers.

There is a small problem though - if object size is more than single transaction can accumulate (90 pages), it will be split into several transactions, where first one will contain object creation command and some data to be written, while others will contain only data. If server runs multiple threads per client (default is one though), it is possible that not first transaction will be processed first, so server will write some data into non-existent file, so transaction will fail. There are two ways to fix this isuue: either wait in writeback on client while creation transaction is completed, and then send all others like described above, or add creation command into every subsequent transactions until object is created on the server (special bit is set on local inode in that case). Likely the latter is better case.

/devel/fs :: Link / Comments (0)


Wed, 07 May 2008

Fast transactions in POHMELFS.

POHMELFS just switched to faster transactions allocated one-by-one with even smaller overhead (although it does not use kernel_sendpage() for page sending yet, it copies data).
System does not serialize after all transactions are completed (it waits after each one), but with new transaction allocation it is 1.5 times faster: 98MB/s vs. 64MB/s, note that without waiting for transaction completion it gets full wire speed of 125MB/s with 1500 byte MTU. And it is with highmem pages and thus slow kmap() of each one, and unmap after completion. I do not use ->sendpage() since it will force to split proper set of iovecs into mixed calls of kernel_sendmsg() and kernel_sendpage(), which I want to avoid so far. Now it is (again) faster than NFS, but I want to move further.
So, solution is rather trivial: wait until several transactions are completed. There is the whole infrastructure already there - in-flight transaction storage, per-transaction completion and destruction callbacks, proper reference counting and async completion.
Still only writing transactions are used (i.e. reading/lookup and others will not redirected to different servers).
There are some bugs of course, but that's the first development version after all.

/devel/fs :: Link / Comments (0)


Tue, 06 May 2008

New captcha solving problem.

Just in case you will notice some delay in filesystem or network development, reason is simple. I decided to devote some time to new captcha cracking problem, namely this ones:

Captcha problem

The reason is simple, I want to test my captcha breaking ideas on something which is real. And also I was frustrated by theirs abuse team, which was not able to fix spam filter based on messages I sent them (bounce and original, just like requested).
It is pretty unlikely though that something will appear anytime soon, but I do want to test some ideas...

/devel/captcha :: Link / Comments (0)


Mon, 05 May 2008

POHMELFS transaction support. Failover (re)connection to different servers.

POHMELFS just got full transaction support. So far it is only used in ->wrteipages() callback, which is invoked by writeback mechanism. POHMELFS uses lazy transaction support, namely it waits after each transaction, which includes header and data to be written for at most 14 pages, 14 is a magic number of pages, which corresponds to struct