- Shit! There are no more M8 screw-nuts.
- What? Use M12, bozon should pass through.
- We all will be fucked this Monday!
Good night. Actually as a former physicist I can say,
that at least two out of four killing theories are really
stupid, but nevertheless its interesting!
I implemented pool of crypto processing threads (number of them
is mount option parameter), each of which has pool of pages to
encrypt data into, so crypto thread is not released until server
returns acknowledge that data was successfully written, so one
should tune number of threads and page pool (number of pages
in each thread is maximum number of pages per transaction,
this limit has own mount option too) according to desired behaviour.
Testing shows that writing performance was reduced with this approach
noticebly: with 4 encryption threads and 4 receiving thread in server
perfromance dropped by around 30% from 65+ MB/s down to 46+ MB/s,
but I think it can be improved with larger number of encryption threads.
During iozone write/rewrite test each of 4 crypto threads ate about 20-30%
of CPU, while server ate about 130% (4 threads totally). In all previous iozone tests
the larger number of userspace was used, the worse results were
(this is somewhat expected, since iozone is singlethreaded benchmark,
so larger number of threads lead only to performance degradation),
so I will test different setups (namely larger number of crypto threads
and smaller number of server threads).
But this behaviour is not a problem, and I expect it to be tuned, real
problem is reading performance. Right now there is only single thread,
which reads from one socket: it was done intentionally, since reading
data from socket is longer operation than searching page in radix tree
or any other operation performed by that thread, so there is no way
to saturate its capabilities. Until we start encryption, which is slow,
so any subsequent data reading from the socket can not be done in parallel
with crypto processing, and overall reading performance drops to ground.
This problem has to be fixed, so I plan to use the same crypto
processing threads to decrypt and/or perform hash check for received data
and push it up to the VFS stack.
First,
POHMELFS
does need to have encryption. Because I plan to use
distributed hash table approach in server (well, consider POHMELFS
kernel client as a kind of bittorrent filesystem client), and as in any
non-centralized system, content transferred via uncontrolled data channels
has to be encrypted.
But... I'm incredibly stupid: I implemented encryption and decryption in place,
i.e. VFS page is being encrypted prior to be written to the servers, so
subsequent reading leads to... Yes, it reads encrypted content.
To fix this issue I plan to encrypt data into different pages and send them,
leaving VFS ones as is. There are two approaches I consider:
allocate and send pages at writeback time - we want to send 5 pages, so allocate
5 pages, encrypt data into them and broadcast them to all needed servers.
allocate (potentially large) pool of pages at mount time per crypto thread
and encrypt data into them. This will have about zero run-time overhead for VFS,
except slightly delayed because of encryption write completion.
I frequently hear that whatever server you implement, it has to
be non-blocking, since in case of parallel sending it allows to
send multiple requests to fast servers, while not-sending data to
slow server, since non-blocking socket will return EAGAIN.
This is only half-right solution: when we have to put given data to
all servers, and can not free it until all servers replied with acknowledge,
non-blocking mode can bring more damage than gain.
Mainly because it
allows to eat all the memory for requests, which are still in the queue
to be sent to slow server, and which was already sent to fast ones.
In this case higher-level application (consider simple application which generates
some data and writes it into the file in distributed filesystem, which writes
file to several servers) will never block since transfer
to fast servers completes quickly, and will provide more and more data,
which will consume all RAM.
It is possible to deadlock system in this case,
since to send some data to remote server we always have to allocate at least some
data to put network headers into. With non-blocking solution we will consume
all memory and kick itself into the coma.
I've updated OSF
modules to xtables, so you have to enable its support in kernel config and get
recent iptables (I tested with 1.4.1.1, which is the latest release to date).
OSF allows you to match incoming packets by different sets of SYN-packet and determine,
which remote system is on the remote end, so you can make decisions based on OS type
and even version at some degreee.
Installation instruction, example and source code can be found on
homepage.
I've also sent it to netfilter-devel@ and netdev@ maillists, since my previous mails never appeared
there likely because of spam filters.
Rumor number one. SWsoft
aka Parallels actively searches for Linux kernel hackers in
lead Moscow universities, namely MSU and MIPT. I saw theirs
posters, where among other (wanted) requirements there is
distributed filesystem knowledge.
Rumor number two. Alexey Kuznetsov (if you do not know,
its the guy who wrote major part of linux network stack,
namely TCP/UDP/IP and socket implementations, and although
there was lots of changes in the stack since then, I think it will not
be an exaggeration to call him the author), who also worked
on Virtuozzo and OpenVZ (and its interesting VFS parts, which
AFAICS are not in kernel, maybe yet), so he works on some
filesystem too. The last time we 'confronted' was couple
of years ago, when I first time implemented
netchannels
and tried to convince network community (and namely Alexey Kuznetsov
and David Miller)
that netchannel idea worth further investigation and implementation.
IIRC I did not succeed, although results were very
impressive.
Let's see what will happen with filesystems :)
Rumor number three. SWsoft recently started to actively search
for kernel hacker for 'new interesting open source project'. They
always searched for kernel programmers, but never told anything
about projects, now something changed.
Rumor number four. OpenVZ and Virtuozzo have serious problems with NFS
(especially when server dies), probably because of very ugly NFS protocol
(yes it is), so its hard to properly virtualize it (or not?). There are
no alternatives for NFS right now in major productions, but you all know about
POHMELFS
which right now can be used as really good replacement.
Rumor number five. SWsoft has long history of PHD defences (at least in MIPT) based on
theoretical FS called TorFS (namely Tormasov FileSystem), year ago it was still
not very alive project in practice,
but I heard that it was very impressive in theory. This rumor exists
really many years.
So, I have a quite clear picture, that SWsoft started development of the new
distributed filesystem, which is aimed at first to replace NFS in virtualized
environments. I can also imagine very interesting distributed parallel facilities
needed for virtualized systems. And they try to attract lots of people to the
project as long as really heavy artillery like Alexey Kuznetsov.
Which basically means, that sooner or later my development will meet strong
concurency from this company, which has lots of really good professionals.
And that's very interesting and cool :)
P.S. or it may be a complete bullshit and delirium of my fevered consciousness.
And one fact about
POHMELFS:
today I finished client support for padded crypto processing of all requests
and started to work out server bits, I expect to finish it in a day or around,
so new release is very close.
It was really interesting. Although it is very simple student
model, a friend produced very good sounds. He did not practice
many years already, but nevertheless it was not that bad.
My everyday half to hour exercises usually produce worse sound, although
sometimes I do find really cool notes. Unfortunately I still do not
know some magic bit about how to catch on that sound, it borns and
dissapears on its own, but I'm sure I will find it, and I think I'm close
to where it hides :)
1. Because of encryption problem - data to be encrypted has to be
blocksize aligned, so some informaion about padding has to
be added into network command as long as crypto data size.
2. IV generation. I decided to extend network command and put there
64 bit IV for given packet. using simple sequence number is enough
to protect against repeat message attack.
3. Encryption/hashing data. I decided not to ecnrypt/hash network headers,
and only do it for transmitted data. If transaction contains several
commands, data for all commands will be encrypted/hashed, in case of hash,
signle digest/hmac will be generated and placed into transaction header.
4. It is possible, that I will add strong header checksum, which will be generated
only for header and placed into special field. It will be calculated
assuming checksum field is zero. This step is optional so far, but network header
has 32 reserved bits, which can be used for it.
Right now hashing and encryption work, but are not checked on server (although generated),
because of crypto alignment ugliness I decided to rethink approach a bit.
Evolution process in action...
That was really suck - yes, we played bad. Just like it was before.
It is not somewhat surprising.
But what was the fucking ubnormal week ago agains Holland? That
was new, was cool, was bloody great, but not today. Tired or whatever...
What's the difference right now, we lose.
Yes, Spain played really good, my congratulations.
But our command showed, that it is possible.
That there is nothing impossible.
We can, when we want. You can, when you want.
POHMELFS server is able to handshake hash/cipher names and operation
modes, to initialize appropriate algorithms and perfrom basic operations
(like more generic hash_update() instead of different
functions with different arguments used to hash data depending on operation mode,
either simple digest or hmac: EVP_DigestUpdate()/HMAC_Update().
I'm working on the right way of doing crypto processing, since how it is done right now is a bit hairy,
i.e. without serious changes in the code.
I already hate OpenSSL API: EVP_get_cipherbyname(), EVP_MD_CTX, EVP_DigestFinal_ex().
It looks like above functions were written by three different persons and they
never actually talked to each other about how to make them look similar... But it is
a minor issue of course.
So, when things are settled down, I will make a new release, likely it will see the light this week.
My ISP again blocked my account and can not unblock it although there
are money on the deposit. There are serious problems in its billing
system which requires manual intervention of the operator. Unfortunately
it is a real challenge to call them, it already took more than half of a hour
yesterday, and without success.
So, I decided to implement an interesting idea on how to bypass its blocking.
It is based on the security 'hole' in its (and I think vast majority
of ISPs do the same) DNS configuration, which allows
to request any DNS record even if account is blocked. It will be fetched from
remote DNS server if there are no records in the IPSs cache.
Thus attack vector becomes visible: implement IP over DNS tunnel network device
and setup local routing to use it by default. One has to control at least one
remote machine which hosts DNS records for given domain name, since it is required
to parse incoming DNS requests and process them accordingly.
There are at least two known IP over DNS tunnel solutions:
NSTX
(howto) and
OzymanDNS
(howto). Both solutions require that you own one or another
server to run ip-over-dns tunnel server on it.
Unfortunately I have only single machine with static IP address, which is not protected
by lots of firewalls and allows incoming connections.
The simplest solution for this problem is to create iptables input target rule
for the server, which will parse incoming DNS requests and redirect usual queries up
the network stack to the userspace server, and handle 'poisoned' queries as tunnel.
Client can be TUN/TAP based, but can also be a tunnel network device.
I believe the more weird it looks, the more interesting it is, so likely will think
more about kernel based tunnels.
DNS queries are limited enough not to allow binary data (IIRC,
the most interesting is DNS TXT records), but it can be appropriately
encoded and enciphered. So, will put it into
todo list.
I even think that it is not that bad idea to have such modules in kernel :)
But testing can not be done without appropriate server support, which
is now the main task. POHMELFS uses lazy crypto engine - each network state
(it represents connection between client and one server) contains
number of fields used exclusively for semi-lockless input data processing
(it locks state when performs actual reading, but does not
hold that lock when processing incoming messages, since it is the only
path, which receives data), now it also has crypto information about
how to manage reply messages (they include read page reply for example),
so it does not queue work to be done by crypto threads, but does that itself
instead. It may or may not be the bottleneck of the input path, tests will
provide facts, so far I do not have plans to change it, but it can be done
of course if performance will suck.
After I finish crypto processing in both client (it has been written, but requires lots
of testing with server) and server (just have started to recall how to work with
OpenSSL. Well, I've read how HMAC works in OpenSSL, found it to be simple enough
and then started to read how to parse binary data in LISP :)
But anything which is interesting for me now, ends up in good results for all other
projects), I will switch to something different for a while.
Some voices in the brain ask to be spread it in lots of interesting directions :)
I've ran read/reread and write/rewrite tests as described
in previous run,
now with HMAC(SHA1) of all outgoing transactions (note, that reading response data is not yet
encrypted and does not contain digital signature, server also does not support neither operation),
essentially only writing should be affected by this, but I also ran reading tests for compelteness.
Results show zero performance overhead of the full data SHA1 hashing, but note that quite fast
machines were used (2 3Ghz Xeons (2 physical and 2 logical CPUs, HT enabled) with 1 GB of RAM). All the time only
two crypto threads were actively hashing data, since there are only two pdflush threads on this machine.
Writing is even faster with hashing, but results drifted around, so essentially performance is the same.
I've just known, that it is impossible to map the same page
twice: for example first time using kmap()/kunmap()
and second one via kmap_atomic()/kunmap_atomic().
Although mechanisms are a bit different in both mappings, it is
forbidden to do and system will panic like this:
This happend for exacly above case, when page was first mapped via
kmap() in POHMELFS and then via
kmap_atomic() in HMAC crypto processing code.
I wonder what will happen if we ever try to send kmapped pages
over IPsec tunnel. Likely it will ooops too...
This can happen for example when pages are mapped in
tcp_sendpage() when calling sendfile()
over the interface, which does not support hardware checksumming
and scater-gather: mapped pages are pushed down the network stack
where they will be eventually encrypted/hashed in IPsec, which
will in turn call kmap_atomic().
So, if you will find obscure oops in kmap_atomic()
and friends, first check that calling stack did not map page
earlier.
So far it only includes encryption and hash calculation for outgoing
transactions. System has (mount option) number of threads per superblock,
which are responsible for encryption/hashing (each thread has own crypto structure,
so there are no additional allocations in the fast path, although I think
they would not harm performance since should be small enough
fraction on top of crypto processing overhead) and subsequent data sending,
so original caller (like writeback/readahead code) will not block if there
are ready threads, otherwise it will wait until some thread finishes its current crypto work.
I decided to implement kind of continuation for such transactions, when network sending
code (which is supposed to be started after crypto processing) will be invoked from those threads,
which performed crypto operations, and not returning back to originall caller context.
For massively multiqueue NICs that should be a benefit, but so far I did not test its performance.
Next step is receiving crypto support and userspace changes.
If I did not miss something,
GNU TLS (I never worked with it)
supports very limited amount of ciphers and hashes, so it is not appropriate for
filesystem data protection layer.
According to its
documentation
GNU TLS only supports AES, RC4 and 3DES ciphers and SHA1 and MD5 hashes. There is also only CBC
chaining mode and several hash/cipher schemes.
So, POHMELFS server will use OpenSSL for data protection. Sooner or later OpenSSL
will get hardware crypto support on Linux too (well, Linux crypto stack should first
implement userspace API, which does not exist yet, although there is a
work
by Loc Ho from AMCC to add such support).
So far I decided to implement following protection scheme: checksumm or encryption
will cover full transaction data, but will be applied by chunks:
Transaction 'first-level' data, i.e. header and data immediately placed after transaction
header. For all commands except page writing it will be finish.
For write pages command, each header is generated dynamically and does not exist
until data is really being sent, so crypto code will run over all pages and update checksum
processing headers and data pages separately. Checkum update should be simple enough, since
there are crypto helpers to update and finalize checksum, but encryption is more complex:
I requires all chunks to be setup in advance in single scatterlist chain, with dynamic header
generation it is too big overhead (it requires not only scatterlist allocation, but also
header allocation just for encryption), so encryption will be done separately for headers and pages,
and I will have to create some IV propagation scheme (like last bytes of previous unencrypted chunk
will become IV for the next chunk, or something like that). I understand, that it may be not very
secure approach though.
Reading data back from server is simpler, since there are no transactions,
and data will be encrypted/checksummed like in the first step above. It is possible, that it will
force to increase network header structure a bit (32 or 16 bits to store size of the attached checksumm).
It is fucking unbelivable, but Russia plays with Holland
and score is 1:1. Not only its equal, we do play a cool football!
And Holland equaled score in a 87 minute, we were so close, but
it is not yet stopped. We can win. We will win!
I do not understand, how in the hell our team started to play that
good, we can. We fucking can, when we want. We play not for the goal, not
for the money, not for fucking anyhintg, we play just for the game.
And game wins!
Ended first half of the additional time. Russia vs Holland 1:1.
We can. Just because we can.
As I found with
distributed storage
project, any communication channels, which involve huge amount of data transfers,
have to have additional strong checksum embedded in the protocol, since TCP one is not
enough in some cases. There are some options, like TCP MD5 signatures or IPsec transformations,
but it is not always available.
POHMELFS
will include ability to both encrypt whole data channel and/or only digitally
sign all messages. This will be implemented on transaction level, so no higher layer code
(like reading/writing data functions) will ever be affected.
POHMELFS will also have mount time self-configuration, i.e. client will send to server
information about supported capabilities, requested by administrator, and if server does not
support some of them (for example it can only do HMAC and not encryption, and both operations were
requested at mount time), they will be dropped (and mount failed optionally).
In the future it will be possible to extend it with additional flags if needed.
mount is not very convenient command to transfer crypto information (like binary keys)
to kernel, so I use the same infrastructure as initial server group initialization (i.e. using
POHMELFS existing configuration utility).
Support for HMAC and encryption will force server to depend on OpenSSL,
but I do not think it is a problem. In some future time I can write autoconfiguration, which will
allow to compile server without crypto support (and thus do not accept encrypted clients and
do not check signatures) if there is no OpenSSL.
After crypto operations are implemented (I expect it to be finished this week), I will release as promised
new netchannel
version (and will remove unneded functionality like NAT), and add some interesting bits (like async
processing) into distributed storage,
so expect its new release soon too.
Excellent documentation with examples.
I expect that it is implementation (i.e. CLISP) specific and will not work with SBCL or Allegro
for example, but nevertheless I want to learn and somewhat use it.
If it will be good for my usage cases, what my next userspace server will be written with? :)
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs)
3Ghz 32 bit Xeons with 1gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with SEAGATE
ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed is about 90 MB/s.
Software used in testing: 2.6.25 kernels (on server and client), in-kernel async NFS server,
userspace POHMELFS server.
Tests were performed with 8gb files (amount of ram was reduced to 1gb to eliminate caching
influence) with different (from 8 to 1024 KB) record size. I ran write/rewrite, read/reread and
random read and write tests.
Zach Brown has
committed
cache coherency support into CRFS repository.
Cache coherency protocol works by broadcasting special messages from
server, and each client invalidates appropriate inodes (and dentries if needed)
before sending back a reply. POHMELFS
uses a bit different mechanism: client does not send acks back to server,
so all such messages are kind of advisory-only, but I did not yet complete (well,
I did not even think about this problem this week) locking design, so it can change.
Main problem with sync cache coherency support is its absolute non-scalability.
While number of sage cases might require such behaviour, I expect that if not major,
but noticeble part of users do not want perfromance degradation as a price for
posix-like coherency expectation. This approach is worse that write-through cache,
since there is whole round-trip of the cache coherency request instead of just
data sending during its writing. Single direction sending is faster than sending+waiting,
so for me it is still a questionable approach.
I will think a lot of this problem later this week(end), so that solution would
satisfy both high-perfomance and safety camps (although at some degree only I think).
(defmacro with-output-dir ((out pos dir flags) &body form)
`(let ((,pos 2))
(dolist (operation (nthcdr 2 *iozone-tests*))
(let* ((dir (pathname-as-directory dir))
(output-file (make-pathname
:directory (pathname-directory ,dir)
:name operation
:type "gnuplot")))
(with-open-file (,out output-file :direction :output :if-exists ,flags)
,@form))
(incf pos))))
(defun write-gnuplot-headers (dir)
(with-output-dir (out pos dir :supersede)
(format out "set title \"Iozone performance: ~a, KB/s\"~%" operation)
(format out "set terminal png small size 450 350~%")
(format out "set logscale x~%")
(format out "set xlabel \"Record size in KBytes\"~%")
(format out "set ylabel \"Kbytes/sec\"~%")
(format out "set output \"~a.png\"~%" (elt *iozone-tests* pos))
(format out "plot ")))
(defun update-gnuplot-headers (dir file)
(with-output-dir (out pos dir :append)
(unless *first-file-p*
(format out ", "))
(let* ((fstype (pathname-name file))
(name (make-output-name file)))
(format out "\"~a\" using 1:~d title \"~a\" with lines" name (1+ pos) fstype))))
Macros are really the coolest feature of the LISP. Now I believe I started to understand LISP kung-fu.
Iozone parser is essentially ready. I was a bit pessimistic yesterday: it took only half of the day and several
hours today, and code itself is rather ugly (and frequently really ugly, likely far from the LISP way), but it works:
it runs over given dir, searches there for files with given extensions, parses them (removes unneded iozone information),
writes result to specified directory. Also runs over iozone test strings and generate gnuplot scripts for them, which
will build a graph based on filesystem info it gathered traversing the tree above, so results looks like this:
$ ./parser.lisp
Processing: /tmp/iozone/tmpfs/nfs.out ... done
Processing: /tmp/iozone/tmpfs/pohmelfs.out ... done
$ cat /tmp/iozone/tmpfs/out/read.gnuplot
set title "Iozone performance: read, KB/s"
set terminal png small size 450 350
set logscale x
set xlabel "Record size in KBytes"
set ylabel "Kbytes/sec"
set output "read.png"
plot "/tmp/iozone/tmpfs/nfs.out.data" using 1:5 title "nfs" with lines,
"/tmp/iozone/tmpfs/pohmelfs.out.data" using 1:5 title "pohmelfs" with lines
(defun string_to_list (str)
(let ((num 0) (ret '()) (string_len (length str)))
(dotimes (i string_len)
(let ((sym (elt str i)))
(cond
((not (char-number-p sym))
(unless (eql num 0)
;(format t ": ~d~%" num)
(push num ret)
(setf num 0)))
(t (setf num (+ (* num 10) (to_number sym)))
(when (eql i (- string_len 1))
(push num ret))))))
(nreverse ret)))
Which is a part of my LISP parser for iozone output files. So far it is able to convert its output numbers (performance in KB/sec)
into LISP lists (one list per record), so single line of iozone output becomes a single list of numbers
(ugh, I was forced to write string-to-number conversion function).
It is not that serious achievement likely, and it took the whole day, but nevertheless I like it,
although I would write the same in C much faster :)
Main problem with Lisp for me is its functional-conditioning system. Converted to C it looks like:
if (a) {
if (b) {
if (c) {
do_stuff()
}
}
}
While I would write:
if (!a)
return;
if (!b)
return;
if (!c)
return;
do_stuff()
So far I did not use macros at all, and all the time looked into
Practical Common Lisp book
(and frankly got from there directory processing functions, although
modified it a bit), but what would you expect from the first project. Tomorrow I will extend it to
write gnuplot-compatible file and finally generate some graphs (I do not know
how to call external programms from LISP though).
Frankly, I'm not yet excited about how cool LISP is, but I like it, since it is different.
Just like I like my neverendingappartment development process.
Ugh, and with proper automatic vim highlightning I am not afraid of parenthesis.
Interested reader can grab my sources
and comment on ugliness.
Decided to work on completely different than usual
area today, so neverending appartment development.
Today I painted whole ceiling in the kitched and I want to belive,
that it is the last time. It was not that quick, but took noticebly smaller
amount of day.
Main task was floor in the hall. I finially covered it with ceramic granite.
It was supposed to be seamless granite installation, but... tiles have so precise
dimensions, that difference between them was never more than half of santimeter
in each side, so I was forced to make small seams and move tiles around quite
for a while before they formed somewhat straight lines, although there are
lots of non-straight crosses.
Nevertheless it looks cool, I'm glad I finished this part.
Ever dreamt to block all Linux users in your network from accessing
internet and allow full bandwidth to Windows worm? We have to care about
our smaller brothers, so this iptables extension module allows you to do
so.
OSF stands for OS Fingerprint allows you to build usual iptables
decision on incoming TCP packets, only initial handhsake containing SYN
bit is enough to understand what remote OS is. Original idea belongs to
Michal Zalewski.
This iptables module was
imlemented almost 5 years ago and lived in patch-o-matic (userspace
library is still there) iptables tree. Now I've updated it to Xtables
and send for review.
Installation steps are described on the
homepage,
but are trivial and include usual make/make lib building and loading rules into the module
via procfs file.
Fixed bug found by Salvatore Del Popolo (delpopolo_dit.unitn.it)
in TCP implementation, when system checked sending window and determined,
that packet was not allowed to be sent and nevertheless tried to do so in some
cases.
Userspace network stack
is a very fast (if working on top of
netchannels,
also supported packet socket) and very small network stack (TCP/UDP/IP/ethernet) implemeneted
entirely in userspace. Because of it lives near the very the end of the peer (i.e. very close
or even embedded into application), it allows much faster processing of some workloads, namely
small packet sending and receiving, where
itoutperforms
vanilla Linux TCP/IP stack 3 times in performance and 4 times CPU usage (sending and receiving vary).
Comapre netchannels+unetstack versus Linux sockets (2006 year numbers).
It is not about problems in the Linux stack, but overhead of syscalls, which are in turn
results of too separate data sending and reply processing in the existing model.
I've finally made a new release of the
CARP
for Linux kernel.
CARP is an improved version of the Virtual Router Redundancy Protocol (VRRP) standard.
The latest protocol to help provide high availability and network redundancy, it was
developed because router giant Cisco Systems believes that its Hot Standby Router
Protocol (HSRP) patent covers some of the same technical areas as VRRP.
This project allows you to build high-available clusters of multiple machines with
balanced master selection between them. Installation and setup are pretty trivial:
$ tar -zxf carp_latest.tar.gz
$ cd carp
$ make
# insmod ip_carp.ko
# modprobe cn
# insmod carp_conn.ko
# ifconfig carp0 up
# carp_conn_daemon -m master.sh -b backup.sh
And the same on all other machines.
Each script as you got from its name is executed when node becomes master or backup one,
you can put there firewall rule changes, traffic shaping setup, network daemon start/stop
scripts and whatever you like.
Its main advantage over any other existing open (well, it behaves much more robust than Cisco VRRP though)
master/backup solutions (like Hearbeat or userspace CARP) is ability to setup multicast address (via usual
/sbin/ifconfig command) and thus do not confuse some crappyCisco
hardware, which will not understand that node changed.
One can get the latest sources from CARP homepage.
Enjoy!
POHMELFS write speed about 10% faster, read speed 3-3.5 times faster
(essentially disk/local fs IO limit, see below).
POHMELFS random read speed is smaller, and that is task with the highest priority now,
especially compared to local FS results.POHMELFS random write is slightly faster than NFS.
For comparison, local filesystem, used for tests. mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/:
Write requests are sent to multiple servers and completed only when all of them sent an ack.
Ability to add and/or remove servers from working set at run-time from userspace (via netlink,
so the same command can be processed from real network though, but since server does not support it
yet, I dropped network part).
Documentation (overall view and protocol commands)!
Rename command (oops, forgot it in previous releases :)
Several new mount options to control client behaviour instead of hardcoded numbers.
Bug fixes.
I will complete documentation in a few moments and send this release to the mail lists.
Very likely it is last non-bug-fixing release of the kernel client side, next release will incorporate
features, needed for distributed parallel data processing (like ability to add new servers via network
command from another servers), so most of the work will be devoted to server code.
Essnetially that's it, I belive really most of the features I wanted
from network distributed parallel filesystem, which should live
in client, are already implemented in POHMELFS.
Client has following (if did not forget something interesting,
listed only interesting from parallel point of view) features:
Automatic failover reconnect to the same server.
Run-time addition/removal of the servers from the working set
(only via userspace command, since server does not support that yet,
but addition is trivial).
Transactions support. Full failover for all operations. Resending transactions to different servers on timeout or error.
Load balancing of reading (directory reading and lookups inclusive) requests and
simultaneous writing to all servers in current working set.
It is damn fast (but remember, that random reading
is no yet optimal enough, and in
the last tests it was slower NFS).
Userspace server meantime does not support lots of features it has to support
to be called complete parallel distributed solution, and main work should now
be concentrated on it.
Main missing (and the most complex) features are:
Distributed data coherency protocol like PAXOS for server data, stored on multiple machines.
Ability to mirror data itself on multiple machines.
So, likely release will see the light tomorrow or Friday.
And yes, very likely Linux kernel community lost me (and I do believe
none cares as long as me).
But not Linux kernel, it is definitely the place I like.
People, who want to hack on Linux kernel will do that without all
that empty talks and brilliant ideas, all of which are only aimed in
a single direction: do what we will ask you to do for us. Be fair and
admit that you do not want new ideas implemented, you want old bugs (introduced
by someone else) fixed only, so that kernel got more respect without
possible additional work for you.
It is not how interested people work, instead they just decide themself
how and what to do. That's why kernel janitor project did not succeed:
it is not interesting for anyone. The same applies to its refocus to bugfixes.
And I do know what is kernel janitorial: I started with that not long time ago: fixed
trivial error checks like request_region()/check_region() code
and other minor things like PCI remap errors.
That was hell of crap. Frequently there was a situation,
when I fixed lots (like 20 or more) drivers in one go and submitted a patch,
instead I was asked to split it to separate patches, to add each driver maintainer
into the copy, wait for theirs ACK, resubmit and so on. And frequently happend
(especially when new feature was introduced and lot of small code has to be changed
a little), that while I did that, some other known kernel hacker did the same, and his
patch was immediately applied.
Janitorial and all hypocrisy about 'we want more developers' just suck.
My advice for those who really want to hack on kernel: just do what you like,
try yourself in whatever subsystem you want, implement your ideas, be creative and do
whatever you like with kernel and not what all those kernel heads tell you to do.
The only way to succeed is to move forward!
Argh, and do not listen for any such kind of advices at all :)
POHMELFS
got ability to add/remove servers in run-time (although not via network command,
since I do not know, how to test it yet), but via netlink interface. The same
message can be passed via network though, so it will be simple to extend.
Also, POHMELFS got readahead support via ->readpages()
callback. I removed AIO reading from POHMELFS in favour of readahead
and got excellent result in sequential reading: 3-3.5 times faster than NFS
and essentially reaching disk IO bandwidth (a bit less though),
but random reading dropped to miserable numbers.
Also rewritten reading method should provide better balanced between multiple servers
capabilities for the system, but it will not show any benefit in single-threaded
iozone benchmark, since it reads data via single call to read(),
which gets sequential data access, which in turn is faster than network bandwidth.
So multithreaded load should greatly benefit from read balancing, but I did not
yet test that.
I ran sequential read/reread, write/rewrite and random read/write tests for
XFS, Ext4, NFS (over XFS) and POHMELFS (over XFS) with 1Gb of RAM and 8Gb
of test files (to eliminate VFS caching influence) with 8Kb to 1Mb record size.
Results exist in text files in standard iozone output format, but since I'm learning
LISP I decided to write a graph generator (via gnuplot) using my very basic
knowledge of this language, so nice graph results can take a while...
Also, tomorrow morning I will flight away to my friends marriage and will only
return monday 9. I will not have internet access there, only lots of fun.
Now they eat less memory, and single writing transaction can accumulate
up to 1024 pages. This can be further tuned especially for small requests
mixed with sync. Currently writing transaction is allocated for its maximum
size, and then pages pointers are written to the allocated area, so
if number of dirty pages requiring writeback is small, quite lots of
space will be wasted.
It is a task for the next optimization, nevertheless currently sequential
writing is only limited by disk throughput or network bandwidth in case of
multiple servers, since link
is shared between machines, so effective bandwidth becomes equal
to GigE/number of servers, or about 60 MB/s in my environment with two servers
and single client.
Also, reading path was not changed at all (only transaction
internals) - there is still no readahead
and new transaction is allocated for each page to be read. Nevertheless,
see how reading was improved: POHMELFS not only outperformed NFS again,
but reached disk bandwidth limit already for 16Kb requsts (almost two
times faster than NFS). Table shows IO throughput in KB/s.
I will create nice graphs out of this tables and also will include
optimized reading tests (tomorrow likely) and two data server results.
What also should be done, is testing with either bigger files or smaller
amount of ram and thus smaller VFS cache size. As you saw in all tests, when
lots of reads start to hit the cache, picture becomes completely non-informative
for filesystem behaviour. So I want to limit all three testing machines
to 1Gb of RAM (booting with mem=1G parameter) and perform the same iozone
bench for 8Gb file. Results should be more realistic.
In parallel I will implement userspace run-time server addition/removal
command, which will also be used as-is for network message from one
or another server, connected before. With optimized reading transactions
it will be a good ground for the next POHMELFS release. So I plan to schedule
it to thursday or middle of the next week, since I will be on small vacation
jun 6-9.
- So again, can you offer an alternative?
- Just give up on this dumb idea completely.
It is not about AppArmor in general (although maybe about it too), but about security hooks which provide
path information into inode callbacks. There are pros and cons for this decision,
but things look like path based security hooks will not be accepted.
There is a really trivial way to fix it. No kidding, it is simple: create own
name cache and do not bind it to dentries, but instead index it by inode number.
This allows you to have whatever you want callbacks and information in stricktly
bound VFS operations. Need to have path info in ->inode_create()?
Put it into own tree indexed by inode number for parent inode, lookup that data in
security hook and make a decision. Yes, it is slower, but active security was never
a fast solution. It is still against the rules others created for security based
systems, but still formally it in the all boundaries of the created (maybe ugly
for someone) interfaces.
And I will not point to project, which already uses such approach in different area
though :)
It is interesting to implement your ideas not by breaking something (although sometimes
it is need, but that's likely an exeption or when you are hacking deeply internal kernel
part), but instead by hacking around existing limitations.
I think I found a way to have a progress in my trumpet playing exercises
(read: ear scratching screaming, it sounds much worse than wrong note on piano).
It is of course practice, but even without whole tube and using only mouthpiece I can train
breathing path. Musicians have several hours per day exercises, kernel hackers about half of an hour each morning
in parallel with listening to ACDC and Metallica as an alarm clock. Mouthpiece is rather quite (noticebly
louder than usual talk though), but produces about the same resistance for air flow, so I think
it is a good training. When embouchure will be stable enough I will attach trumpet, since currently
sound frequently drops and jumps. Nevertheless I got a big progress (I think so at least) after started
such trainings recently.
My home guardian Socket although does not have ears, looks like do not like it. Alhtough he only likes
to eat.
- If you haven't noticed, I don't take "no" for an answer,
- And now please tell us step 2 in your secret plan to win friends and influence.
- WTF are you getting at?
Fun thread :)
There is actually a serious problem in kernel community, when some new idea is being implemented,
and it moves against something which sits in mind of one or another big kernel hacker out there.
When such person replies, that this is bad idea (sometimes without technical arguments), people
just stop looking at replies and do not follow arguments of the author just because they frequently do
not know area in question enough to make decision and thus rely on others.
This only works when 'others', i.e. core kernel maintainers, are good and do not base theirs decisions
on personal feeling and only get technical side into assumption. Unfortunately it is not always the case,
and political methods are used. Sometimes even only political methods are used...
Usually you will not see bad benchmark results for developing
technology, but any such result is actually a _very_ good result
for work-in-progress and not yet completed system. It allows
to see how new proof-of-concept code can be comparable
with already completed tuned and optimized system.
Conclusions from such test results in a really superior decisions.
Let's compare iozone read/reread, write/rewrite and random
read and write for POHMELFS and NFS with 8Gb test files
different record size (from 8Kb to 1Mb) on XFS over the GigE link.
I described hardware and local iozone benchmark results in details
previously.
Now its time for network tests.
Async NFS in-kernel server results.
Sequential writing is 10-15% faster for POHMELFS (and limited by underlying
fs speed), while random writing
is essentially the same and is limited by disk speed. But sequential reading
is _much_ worse for small requests. THe reason is simple: POHMELFS does not support readahead,
since it does not have ->readpages() callback, so any
sequential access ends up with set of ->readpage() callbacks,
which waits for theirs completion, which is slow, so currently readahead
is not invoked from reading path.
I could not resist to highlight, that big
sized requests are 1.5-2 times faster for POHMELFS than NFS :)
and is also limited by underlying filesystem.
One can note, that
NFS random reading results are actually better than local filesystem behaviour,
and its is better very noticebly. Why does local filesystem behave worse than
being mounted via NFS in random reading?
I believe that's because in a network case we actually have double buffering:
on client, where the most active pages are in RAM, and on server, where
readahead populated pages, which are not active (since active pages are being
read from client's cache, so they will be evicted from server's page cache,
since client will not try to read them from server), but those server pages,
which are not active currently will be accessed soon by client, when it will read
next portion of the random data, and it will be very fast access to RAM.
So we have really good caching scheme, where the most actively used pages are
in client RAM, and they are flushed to disk on server, and isntead server populated
other less active pages via readahead.
This reading behaviour is just a result of yet not completed VFS callback implementation
of the POHMELFS. With ->readpages() in place it will be faster than
NFS even in this bench. Also POHMELFS has multiple-server parallel read balancing and
simultaneous writing to them, but there are no results yet.
I already created a mind model of the optimized read and write transactions (based
on memory pools for the maximum OOM-robustness and small memory usage overhead), so
in a day or two it will be implemented in code.
Stay tuned, now its time for excellent POHMELFS results!
Screaming, drinking, cheering...
Although I was not there today, since
some friends became ill and others moved to their tasks,
it still was really cool (yesterday).
My congratulations to the team and department itself :)
Match of the century - 24 hours of footbal in my
Alma Mater.
Today Department of Quantum and Physic Electronic (which I finished
do not even remember when, but I started studying in MIPT 10 years ago) play with
axes, or theirs another name: Department of General and Applied Physics.
After about half of the match we won with +18 goals (31:13).
This happens once per year and usually I tried to move to MIPT and
watch part of the game like this year. Tomorrow will move there too of course
to met with old friends and celebrate the win!
I promised to publish POHMELFS parallel processing results yesterday,
even if they are miserable. Unfortunately there are no interesting results
at all. In the released version POHMELFS is 32bit only, since it does
not have special ->open() callback which forces to open files
with O_LARGEFILE flag to support more than 4Gb (actually only 2Gb,
since kernel uses signed size_t, which is only 31 bit large) sizes and
superblock maximum size is set to 32 bits,
so all 32 bit results are not very interesting, since having 2Gb/s random
read speed is really stupid sentence, since all reading happend from the cache.
While results with more than 2Gb are... Let me first show you how XFS and Ext3 behave
in case of random writes.
A short preface.
Hardware used in testing: 4-way Intel E7520 system (two logical and two physical CPUs)
3Ghz 32 bit Xeons with 8gb of ram, Adaptec AIC7902 Ultra320 SCSI adapter with
SEAGATE ST3300007LC 10k rpm 300 Gb testing disk. Its linear reading speed
is about 90 MB/s. Dmesg:
Kernel version is 2.6.25 (and 2.6.24 for the first ext3 test).
I used two such machines as servers for iozone
read/reread, write/rewrite and random read/write testing. File size is limited to 8Gb only,
since it is the only interesting fair case, record size varies from 8Kb to 1Mb.
Before I started 8Gb POHMELFS testing, I decided to check how local filesystem behave in such scenario.
XFS was tuned this way: (mkfs.xfs -d agcount=75 -l size=64m /dev/sdc1;
mount -o logbufs=8,nobarrier,noatime,nodiratime,osyncisdsync /dev/sdc1 /mnt/)
Ext3 was created and mounted with default options on machine with only 4Gb of RAM though.
So, testing.
Here is a results table from iozone (before I interrupted it) with read/reread, write/rewrite
and random read/write tests for XFS (either default, or tuned like on link above).
Do you really want to know ext3 speed? Pregnant kids and women should skip next paragraph.
I interrupted test after almost 2 (!) hours or random writing
of 8Gb file with 8Kb records on default ext3. Test was not completed and I do not really
know its performance (note, that this machine has only 4Gb of ram, other hardware details were
described above), but it will be less than 1 MB/s.
Ext4 behaves much better in this aspect (ount options: rw,noatime,data=writeback,extents):
I hate laziness, but sometimes drop into that hole... So last couple of days
I just stupidly wasted by time (well, I read Lisp and failed to find GTK binding for CLISP,
made some code and kernel bug fix, but that does not count).
Today lazyness started to be really boring, so I made some small progress in
POHMELFS
parallel processing.
It got ability to send transactions to multiple servers by default and balance reading
between them (so far it does it always from the first server, in case of error it switches
to second, but it is trivial to change). This was implemented via special routes for each
transaction, which are stored per network state, so if one of the servers did not answer,
we would not resend data to others. It also makes trees smaller, which should allow faster
reading in case of lots pending writing transactions.
Code is in testing stage currently, I will complete read balancing tomorrow and test it against
multiple servers on different machines, when data is placed on disk, so that random access
would be slow. Having two servers I exect to get linear speed increase. If test will be disk
IO bound, it is possible to add multiple servers on the same machine, so that each server would
run on its own disk (I have two resonable fast SCSI disks on each testing machine).
Results will be published here of course (well, even if they are miserable :).
#!/usr/bin/clisp
(defun f (m)
(do ((k 0 (1+ k))
(c 0 n)
(n 1 (+ c n)))
((eql k m)
(format t "~r" c))))
(f 317)
Guess the result:seven hundred and ninety-three vigintillion, five hundred and ninety-one novemdecillion, four hundred and seven octodecillion, eight hundred and four septendecillion, one hundred and fifty-one sexdecillion, nine hundred and twenty-six quindecillion, five hundred and ninety-three quattuordecillion, seven hundred and ninety-three tredecillion, forty-two duodecillion, one hundred and twenty-six undecillion, eight hundred and ninety-one decillion, one hundred and twenty-eight nonillion, eight hundred and nineteen octillion, six hundred and ten septillion, seven hundred and ten sextillion, one hundred and forty quintillion, one hundred and forty-five quadrillion, thirty-seven trillion, nine hundred and fifty-eight billion, two hundred and seventy-three million, seven hundred and seventy-seven thousand, three hundred and ninety-seven
Irish Tullamore Dew helped this
POHMELFS
release to see the light.
Short changelog:
Full transaction support for all operations (object creation/removal, data reading and writing).
Data reading transactions are not optimal yet and will be improved in the next release (although fast).
Data and metadata cache coherency support. More details on how this is implemented
one can find in appropriate
section.
Transaction timeout based resending. If given transaction did not receive reply after specified
timeout, transaction will be resent (possibly to different server).
Switched writepage path to ->sendpage() which improved performance and robustness
of the writing.
Preliminary support for parallel data processing. Code to write data to multiple servers in parallel
and balance reading between them was imported, but is not used right now.
Fair number of bugfixes.
Next release is scheduled for the beginning of the next month, and will likely include following features:
Improved reading transactions.
Server redundancy extensions (ability to store data in multiple locations according to regexp rules,
like '*.txt' in /root1 and '*.jpg' in /root1 and /root2.
Client parallel extensions: ability to write to multiple servers and balance reading between them.
Code was imported to the current version, but not enabled yet.
Client dynamical server reconfiguration: ability to add/remove servers from working set by server command
and from userspace.
Start generic server distribution development.
As usual one can grab the latest source from
archive or
GIT tree.
But no, it is scheduled for tomorrow because of the very interesting way I decided
to implement reading transactions. The way it works right now is quite miserable,
so I want to clean things up and make a really good patch.
Page reading code will create single transaction for the bunch of pages and will schedule
next one if pages are not yet received instead of waiting for transaction to be completed,
and only wait at the very end (if needed). With addition of
async copy
from receiving kernel thread into reading userspace via copy_to_user() (in todo),
this will became the fastest possible way of doing reading over the net I think.
So far changelog contains following items:
Full transaction support for all operations (object creation/removal, data reading and writing).
Data reading transactions are not optimal yet and will be improved in the next release.
Data and metadata cache coherency support. More details on how this is implemented
one can find in devel
section.
Transaction timeout based resending. If given transaction did not receive reply after specified
timeout, transaction will be resent (possibly to different server).
Switched writepage path to ->sendpage() which improved performance and robustness
of the writing.
iput() is a very tricky call in Linux VFS,
besides the fact that it drops inode when its reference counter
reached zero, it also waits until all associated pages are
flushed to storage too.
POHMELFS uses singler per network state (network connection structure)
thread, which only reads async replies from the server, so it is possible,
that reply which requres iput() (for example create command
reply) will happend in parallel with object removal, so inode will be deleted,
but yet not freed. When reply is received and iput() called,
it will try to free inode and wait until all associated to its mapping pages
are synced. But page sync happens on reply to another command (consider for
example several writeback transactions), which can not be processed, since thread
is waiting them to be completed. This problem can not be fixed by introducing
multiple threads, since each one can be exactly in the same situation simultaneously.
In turn we should not allow to grab inode and free it in the receiving path.
This is ok for writeback transactions, since inode can not be freed until pages are synced,
so just by holding pages we are able not to lock, but object creation for empty files
or directories does not have pages attached, so they have to be synced with special
transaction. There still can be a problem with empty file though - some pages can be
attached and it can be removed while system waits for creation transaction complete,
but actually we do not need to know about that - we shuold not grab inode it all,
since transaction already contains all needed into, namely inode number, so we can lookup
inode (if it still exist) and mark it as created without need for lock-prone grab/put.
This bit took me last three days, during which POHMELFS moved to non-blocking receiving and
timeout-based sending (and returned back), it got scanning 'watchdog' which resends trasactions
if they were not acked after some time and eventually dropes them if they still does not get
a reply, POHMELFS got couple of new operations supported and likely something else to existing set
of features implemented to date (full transaction support for all operations
and data and metadata coherency protool were added for the next release).
New release is scheduled for the end of the week, and there is no readpage transaction support yet...
So, stay tuned!
$ clisp
i i i i i i i ooooo o ooooooo ooooo ooooo
I I I I I I I 8 8 8 8 8 o 8 8
I \ `+' / I 8 8 8 8 8 8
\ `-+-' / 8 8 8 ooooo 8oooo
`-__|__-' 8 8 8 8 8
| 8 o 8 8 o 8 8
------+------ ooooo 8oooooo ooo8ooo ooooo 8
Welcome to GNU CLISP 2.42 (2007-10-16)
Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2007
Type :h and hit Enter for context help.
[1]> (defun test-func () (format t "It's a test func"))
TEST-FUNC
[2]> (test-func)
It's a test func
NIL
[3] (exit)
Bye.
This one has, imho, the less ugly command line... And I'm against SLIME
and Emacs. Also tried SBCL, GNU CL and something else, but likely CLIPS will
stay.
Instead of sleeping (it will be time to wake up soon in Moscow slums) or at least
catching
POHMELFS
bugs (last several days were solely devoted to this task and fair
number of them were fixed as long as some interesting features introduced (probably new),
so likely new release will see the light later this week),
I'm drinking some beer and making first steps into this. So far looks quite new and probably
interesting, but every entrance article about it I read told, that if you are after 25 years old,
it is likely impossible to change something in your perception. I'm after, but think that
it will be fun and probably will become a really good tool for me.
The more I think about it, the more interesting tasks (as long as those I'm already thinking
about like
CAPTCHA) I find...
It was rather simple task due to async event processing support.
Each time client creates, reads or writes object to server, information about
its interest is stored on server. When any other client updates the same
object (like changing attributes or writes data), all interested clients
get notifications with new data (new attributes, or in case of writing
possibly new size and flag, which page has to be fetched from the server,
since it is not valid anymore). Writing happens during writeback as before,
so commands like "echo Some_message > /mnt/file" immediately
syncs size of the file to zero and after some time writes there actual data,
when system will decide to start writeback.
Also ported all but one commands to transaction mechanism, which means
they all will be resent if currently active network connection goes down.
Although most of the commands are not synchronous, and thus will not be resent after
timeout, this can be trivially changed if there will be major demand on that.
Only reading has not yet been ported to transaction model, which is a next task
to complete. This transactions have to be synchronous, since we do want to read
data, while do not actually care about full directory content.
This changes have to be seriously tested and all problematic places to be resolved,
for example they slow metadata operations noticebly, since now system
sends a message each time new object is created, although kernel archive
untarring now takes about 5 seconds against previous 2-3 including sync
on 4-way machine with 8gb of RAM and it is still not comparable to 30+ seconds
for async NFS, it has to be investigated further.
After full move to transaction model and cache coherency testing (that model
may be not complete for some usage, since locks are not yet supported),
POHMELFS
will make its first steps into distributed area...
So, server now contains all metadata information about updated object on client,
pohmelfs_setattr() is synchronous for remotely read inodes
and for already synced indoes, created originally locally. It does nothing,
if object is not yet synced to server, since syncing will provide that info
itself.
The only missing thing is to asynchronously broadcast that data to other clients, which requires
to create a cache of objects to be interesting for given client, each client will be automatically
added into group of interests when it lookups object, so when attribute for given object is being
set, update will be sent to interested parties. Client will be dropped from group of interests, when
it drops appropriate inode locally (which will force sending a special message).
I installed vater system for the shower and thought to install
the whole cabin, but found (as usual) that I do not have drills
for the ceramic tiles. So, that will be postponed for a while.
Also I expect glue for ceramic tiles to be delivered today (as long
as brick tiles), so that I can start hall granite covering. Although
I'm a bit tired after water system installation, which took major
part of the day.
It is actually simple task, but only when you have simple access to
all parts. Now imagine a 10 sm thick wall, where you managed to drill
two holes, each one about 2 sm in diameter (less than two fingers thick).
In a meter below-left there is a bigger hole for sanitary (about
15x15 sm). Water system hatch is located 2.5 meters right to this.
Task is to put thin water tubes from water hatch to two small holes,
but that splitter would be installed near bigger sanitary hole. Without
direct access to any tube (you can only feel it, can not see) you have to
connect them (also need to mention, that it is quite hard to put
both hands into bigger hole for sanitary system) via different connectors
using spanners.
I've completed the task, although not sure if it is really safe. That was
challenging, and power sucking, so probably I will just slack this evening
and hack some bits of captcha.
Will also cover my table with the last colour level (yes, yes, it is still
not done) and/or fill second varnish layer for x-shelves (they look really cool
after mordant and varnish)...
After healthy discussion
started after my announcement of the second POHMELFS release,
its time to highlight main ideas settled in the thread.
First, POHMELFS will be moved into parallel distributed filesystems, but still
being very good as network filesystem. In particular, that will include ability
to read data from one of the connected server (not particulary from currently active,
how its done right now), writing will happen to all connected servers simultaneously
(and transaction will be committed after all servers returned completion acknowledge).
Protocol will be extended to support dynamic addtion and removal of the servers to/from
currently connected group. Probably there will be some kind of a status messages for servers
(i.e. going offline, do not send me data, or I'm becoming slow, do not read from me
and so on). It will be done in addition to cache coherency messages (I'm yet to implement,
but because of other tasks, this was a bit postponed, probably to weekend), which
will include two types of requests: page invalidation and inode update (that will
also mean that POHMELFS will start supporting attributes (maybe even extended),
right now it doesn't :). Such cache coherency protocol should scale better
than classical MOSI (and its derivatives) and particulary better than pNFS spec
proides (leases to operations for some servers), since it is still possible to work in
parallel with the same file, especially without any overhead of data processing
does not cross different client boundaries, but it has to be tested in practice.
POHMELFS server will be extended to support distributed facilities. Very likely it will
be some kind of PAXOS algorithm, although probably in its very limited mode for the beginning.
So far it will be really simple, so that I could touch all its corner cases and found
optimal development strategy.
All client extensions are rather not that complex, although not always trivial,
so that should not take too much time, so probably you will get something interesting
soon.
Server extensions will be a bit slower, since I will start essentially from the distributed
system ground and gradually move upstairs.
Irish Jon Jameson (6 years of experience, really good stuff)
brings us this new POHMELFS release.
Main features include:
Fast transactions. System will wrap all writings into transactions, which will
be resent to different (or the same) server in case of failure.
Failover. It is now possible to provide number of servers to be used in round-robin
fasion when one of them dies. System will automatically reconnect to others and send
transactions to them.
Performance. Super fast (close to wire limit) metadata operations over the network.
By courtesy of writeback cache and transactions the whole kernel archive can be untarred by 2-3 seconds
(including sync) over GigE link (wire limit! Not comparable to NFS).
The nearest roadmap includes:
Full transaction support for all operations (only writeback is guarded by transactions currently,
default network state just reconnects to the same server).
Data and metadata coherency extensions (in addition to existing commented object creation/removal messages).
Server redundancy.
One can check out POHMELFS homepage
for more details. You can download latest release (against 2.6.25 kernel tree) from
archive or
GIT tree.
I moved to development shop and got zillions of stuff there
including various colours for ceiling in kitchen and room's ceiling
plinth, ordered brick-like tiles for kitchen (about one third of
walls there will be covered with bricks), got some intrument
(like rubber hummer for the tiles), ordered glue for the
ceramic granite for hall, also got a shower (yumi, my shower
cabin was delivered today too) and related stuff for water
system installation.
By the original plan, I wanted to isntall shower cabin today, but getting
into account current time, it is too late for loud work,
so I will proceed with my table instead. It will be completed
today, or call me a ... whatever you like (out of curiosity,
is there an english undecent word dictionary? I know russian
one exists).
If things will move fast, I will also cover with varnish my X-shelves,
and probably will make some photos...
With new transactions and new waiting mechanism (see below)
system now untars the whole kernel tree in less than 3 seconds
over the GigE link (including subsequent sync, which
takes less than second always), while async NFS (remote side is tmpfs in both cases)
performs that in a bit more than 30 seconds.
In addition POHMELFS write speed is 125 MB/s (wire limit) vs. less
than 90 MB/s in NFS (dd from /dev/zero
with 1 MB block size and 1000 blocks).
That's what I call a good result.
Transaction mechanism invoked in writeback path is now completely
async too, i.e. it does not wait until remote side confirms that
transaction was received and processed, but writeback does not drop
transactions after sending function returned, instead it stores it
in the in-flight storage and proceeds with the next one.
Transaction can accumulate up to 90 pages in a single frame.
When reply is received, async thread searches for given transaction and
complete it (unlocks page, although it can be done in writeback,
since page is being copied, cleanup writeback bits, drops it from
appropriate radix tree and drops reference counter). If transaction
was not sent due to some error it will be tried to be sent to different
servers, if some error was returned from the server, it will be resent
to different ones. Since original writeback path does not know about
transactions in-flight anymore, any timeout has to be checked by
dedicated thread (or workqueue), which will detect too old transactions
(by simply checking them from the beginning, since each new transaction has
incrased id) and resend them to remote servers.
There is a small problem though - if object size is more than single
transaction can accumulate (90 pages), it will be split into several
transactions, where first one will contain object creation command
and some data to be written, while others will contain only data.
If server runs multiple threads per client (default is one though),
it is possible that not first transaction will be processed first,
so server will write some data into non-existent file, so transaction
will fail. There are two ways to fix this isuue: either wait in writeback
on client while creation transaction is completed, and then send all others
like described above, or add creation command into every subsequent transactions
until object is created on the server (special bit is set on local inode
in that case). Likely the latter is better case.
POHMELFS
just switched to faster transactions allocated one-by-one with
even smaller overhead (although it does not use kernel_sendpage()
for page sending yet, it copies data).
System does not serialize after all transactions are completed
(it waits after each one), but with
new transaction allocation it is 1.5 times faster: 98MB/s vs. 64MB/s,
note that without waiting for transaction completion it gets full wire speed of 125MB/s
with 1500 byte MTU. And it is with highmem pages and thus slow kmap()
of each one, and unmap after completion. I do not use ->sendpage()
since it will force to split proper set of iovecs into mixed
calls of kernel_sendmsg() and kernel_sendpage(),
which I want to avoid so far. Now it is (again) faster than NFS, but I want to move further.
So, solution is rather trivial: wait until several transactions
are completed. There is the whole infrastructure already there - in-flight transaction
storage, per-transaction completion and destruction callbacks, proper reference counting
and async completion.
Still only writing transactions are used (i.e. reading/lookup and others will not
redirected to different servers).
There are some bugs of course, but that's the first development version after all.
Just in case you will notice some delay in filesystem or network development,
reason is simple. I decided to devote some time to new captcha cracking problem, namely this
ones:
The reason is simple, I want to test my captcha breaking
ideas on something which is real.
And also I was frustrated by theirs abuse team, which was not able to
fix spam filter based on messages I sent them (bounce and original, just like requested).
It is pretty unlikely though that something will appear anytime soon, but I do want to test some ideas...
POHMELFS
just got full transaction support. So far it is only used in ->wrteipages()
callback, which is invoked by writeback mechanism. POHMELFS uses lazy transaction support,
namely it waits after each transaction, which includes header and data to be written for at most
14 pages, 14 is a magic number of pages, which corresponds to struct