Zbr's days.
April
Sun Mon Tue Wed Thu Fri Sat
   
     
2008
Months
Apr
Aug Sep
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Wed, 30 Apr 2008

B.B. King bar.



A bit more in gallery.

Cool place with interesting people... Although it was a bit loud and not very convenient to see the band, but nevertheless it was fun.

/life :: Link / Comments (0)


Tue, 29 Apr 2008

New and old toys.

Real enlargement...



The only thing missing is photo skills...
But I work on it.

After I've spent quite a lot of money I suddenly decided that it is a really good feeling - to have what you want, no matter what the price is. I can not afford some things, but looking really closely I've decided that having lots of smaller really cool stuff is better (for now) than collecting for a (really) long time to get something really big. I already did that, now its time for smaller every-day fun :)

So, no bike for now. I was torn between Honda CBR 400-600, BMW K1200 or around, or classical chopper models, no Harley of course, but... Anyway I'm not able to register it and get bike numbers, and I do not have a bike driving license.
The same applies to cars (what I already had I really do not want to get again, but what I want requires some). So, my simple stuff.

/other :: Link / Comments (2)


POHMELFS transactions and ACID.

POHMELFS just got initial transactions support and ability to connect to multiple master servers. Master servers are those, which will say, where data is placed. Essentially they are the same severs which may provide that data, but main server addresses are provided during pre-mount configuration time, and data server addresses will be provided by main servers (if main ones will not want to return data) in run-time.
Also main servers can be used to request data in parallel or to switch between them, when curently active one has failed.

So far it is a theory, practice is rather miserable: POHMELFS client connects to multiple servers, but works with only one. Errors are detected, and switch to the next server can happen, but it is not done. Since there is a serious problem with this approach: neither server nor client support ACID for data being written.

Here we come to transaction introduction: it is multiple commands wrapped into single atomic operation. In case of error during transaction write, the whole one will be resent to different server (or the same one after reconnect). This is rather simple (although transactions are not supported by server and client does not wrap any command into it yet), but it still does not solve ACID problem.

Since POHMELFS has writeback cache, all its writes never reach server, instead writeback is scheduled by the system, and it starts writing pages to the server. Current POHMELFS implementation uses only ->writepage() method, which is invoked for each page. It does not require server to return explicit acknowledge, that page was written, instead it relies to underlying transport protocol (like TCP) to handle guaranteed delivery, so data can be queued somewhere when connection was dropped, so POHMELFS client does not know if data was really written or not. Having per-page acknowledge can fix ACID problem realy trivially, but that may (or may not) end up with severe performance degradataion. As a better solution I consider own ->writepages() implementation, where each transaction will contain multiple pages to be written and thus smaller amount of explicit acks from server to be received, and thus smaller performance degradataion. In case of failure whole transaction has to be resent to different server of course.

Server does not support data mirroring to multiple root directories yet, so actually not too much is implemented from above description, but transactions and multiple server connections exist and soon client will get support for reconnection and proper transaction processing.

/devel/fs :: Link / Comments (0)


Sun, 27 Apr 2008

Detailed POHMELFS roadmap.

Transaction support will be added into kernel client. It is possible that it will be exported to userspace (thus it will be synchronous write-through operations).
Also kernel client will get locking support (fcntl() ones first, then more fine-grained ones), this is different from byte-range read/write locking, which will be done on server. It is possible to export it to client too (and will be part of POHMELFS locking API actually, which will be used for fcntl() too).
The simplest case is data invalidation in client's cache (i.e. if one client issued a writeback for given page, it has to be marked as not up-to-date on other clients). Likely it will be done at the beginning of the next week. So far it will be the last cache coherency item. Task is relly simple because of asynchronous processing of all data in kernel client. Server will have to store not only index of directories to watch for object changes there, but also per-object set of pages, read by client, so that appropriate users could be notified, that page is no longer up-to-date and has to be refreshed.

Userspace server will get parallel and distributed facilities. Parallel processing will be done first by allowing lookup and readdir callbacks return inormation about objects, which will contain address of the server where object is actually located, so that server could read, write or check status there. So far the whole file will be stored on a server, i.e. for the first implementation there will not be a possibility to store half of the file on one server and another half on different one. Then it can be extended.
Server will get ability to store data on different root directories (so that client was not able to see shadow copies). There will be simple regexp policies for data storing, for example '*.jpg' has to be stored in root1 and root2, '*.txt' only in root1 and so on. Each root directory can be local or remote mounted one, userspace does not care about this issues.

Main part is already completed: I have a vision of what system has to provide and how it will look like, so with good design of the low-level mechanisms it becomes a doable task for the predictible timeframe.

Stay tuned!

/devel/fs :: Link / Comments (0)


Fri, 25 Apr 2008

POHMELFS release.

Vodka and beer together are glad to provide a new POHMELFS release for you.

POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.

This is a high performance network filesystem with local coherent cache of data and metadata.
Its main goal is distributed parallel processing of data. Network filesystem is a client transport.
POHMELFS protocol was proven to be superior to NFS in lots (if not all, then it is in a roadmap) operations.

Basic POHMELFS features:

  • Local coherent (notes 1 and 2) cache for data and metadata.
  • Completely async processing of all events (hard and symlinks are the only exceptions) including object creation and data reading.
  • Flexible object architecture optimized for network processing. Ability to create long pathes to object and remove arbitrary huge directoris in single network command.
  • High performance is one of the main design goals.
  • Very fast and scalable multithreaded userspace server. Being in userspace it works with any underlying filesystem and still is much faster than async ni-kernel NFS one.
Roadmap includes:
  • Server extension to allow storing data on multiple devices (like creating mirroring), first by saving data in several local directories (think about server, which mounted remote dirs over POHMELFS or NFS, and local dirs).
  • Client/server extension to report lookup and readdir requests not only for local destination, but also to different addresses, so that reading/writing could be done from different nodes in parallel.
  • Strong authentification and possible data encryption in network channel.
  • Extend client to be able to switch between different servers (if one goes down, client automatically reconnects to second and so on).
  • Async writing of the data from receiving kernel thread into userspace pages via copy_to_user() (check development tracking blog for results).
One can grab sources from archive or check a homepage.

Enjoy!

P.S. Moved to listen blues and drink a beer.

/devel/fs :: Link / Comments (0)


Solaris vs 'Have you ever kissed a girl?'

As started by Ted Tso.

We forgot the answer:

No, but I can kiss the sky
He was 22 that days? :)

From my developer's point of view Solaris first sucks because of its contributor agreement. There is no way I can devote my time to organization, which will get my work for free and do whatever they want with it without my opinion as author (Actually the same applies to BSD-style at some degree. Yes, that can be trivial greediness).

It is not _that_ bad OS, but there is no known practice in modern medcine of deadman awakening.
Slolaris has its niche, but that's it, although Linux can be tuned to be faster (or if it has some bugs, they can be fixed) in that areas, but that does not matter, people who make decisions already know that they want.

Pseudo openness of the Solaris is just a marketing noise. Those who want to hear it will hear just that, no matter how things are in real life.

/devel/other :: Link / Comments (1)


Thu, 24 Apr 2008

Second POHMELFS release.

Is scheduled for tomorrow, today I have to prepare myself for it. The whole idea and implementation started during fun new year vacations, so I have to repeat process at least at some degree...

This release will not include direct writing to userspace from async thread, since this approach happend to be really non-trivial. What I described for the page fault handling works only for the first fault, when page is populated into the table, it can be referenced and written into and thigs just work. Problem happens when the same page used for the second read (i.e. new try from the userspace, for example if to increase size of written data to more than two pages, 'cat' will use the same two pages to read data). With the second write from the kernel there will be page fault again, although page exists in table, and fault can not be handled (at least its reason will not be removed, since it will happen again and again), since page table entry looks really good for the system, but not for the CPU.
I checked two cases: usual copy_to_user() from kernel on behalf of userspace thread invoked a read syscall, and the same code, but copy was performed from the different thread. Page table entry (pte) looks very similar in both cases (in regards of all flags at least), but fault happens for the second write into the same page always, when thread's mm context was changed to point to original userspace one.
This does not change if userspace thread was or was not scheduled away from its CPU.
Difference from get-user_pages() in this part is mainly the fact, that resulted page is locked in the kernel (by increasing its reference counter at least), but I still want to produce the same behaviour as usual page fault during copy on behalf of userspace thread.
So, I stuck with this problem, but since it is very interesting I will find a solution.

Meanwhile, this release will include following things:

  • POHMELFS client. Full client side caching. Async operations for all major events (not including copy_to_user() hack described previously, but just async notifications an copy on behalf of original userspace thread). Support for usual files and directories only, special files like device files or pipes are not interesting at this point, and are quite simple to implement, but so far there is no need for that. Client has support for object creation/removal cache coherency messages.
  • POHMELFS userspace server. Onject creation/removal cache coherency messsage broadcasting will be commented out, no locking.
Stay tuned!

/devel/fs :: Link / Comments (0)


Tue, 22 Apr 2008

Debunked copy_to_user() from kernel thread problem.

It happend to be really trivial. Even no VM hacking :(

First, some background on how copy_to_user() works on x86.
Its asm looks pretty simple (and it is very small, check arch/x86/lib/usercopy_32.c:__copy_user()), so I always wondered how it can handle missing-page-exception, when userspace page was swapped out.

Things live in small part of the function: .section __ex_table, this table contains two values: place where exception happend, and fixup address (it is just instruction positions). Linker puts this table into special section, accessible by page fault handler do_page_fault(). In some cases page fault path is never executed, code just searches for page and locks it, even if it is already in the table (that is why get_user_pages() is at best as fast as copy_to_user()). This happens when WP bit is not set and does not work (a speculation only though, derived from __copy_to_user_ll() and Intel F00F bug errata).

When WP bit works, we have usual copy_to_user(), which will fault if there is no destination page, and do_page_fault() eventually will be called. After number of checks system determines that it is exception in kernel mode and if there is above exception table (which is true for copy_to_user()), it tries to fix things up.

Here we come to essentially the same code, what is called in get_user_pages(): we locate VMA for failed address and insert new page into page table, this involves allocation of all those strange 3-letters abbreviations: pgd, pud, pmd and pte ('and' is not VMM abbreviation yet), I know what two or three of them mean, but completely forgot pud, on 4 level page table it is hard to recall which two are the same, since iirc x86 has only 3 levels.
If page was swapped out, it will be brought back and eventually fault handler will try to fix things up via fixup_exception(), which will replace EIP with appropriate value from the section table described above, so that CPU will return back to __copy_user() code and continue (or not, depending on fact that page exists or not) its execution.

So, how to hook into above mechanism and allow completely different process to write data into userspace? Quite trivially: above fixup (VMA searching and 3-letters abbreviation allocations) happens for particular mm_struct, which contains VMA list, page table lock and other (likely very) essential information to handle memory management. This structure is obtained from the curent thread executed on the CPU, so by replacing mm_struct in our kernel thread with userspace thread's one, we can safely copy data to and from userspace. There is a race of course, when userspace thread will want to access its own mm_struct (copied to kernel thread) for example calling mmap() or copy_*_user() from kernel, so we have to be careful and properly guard against that.

Example code which does copy to userspace from kernel thread can be found in archive. Just replace kernel path in Makefile to your own, call make and insert module.
Each reading from /dev/tcopy file will end up with copy of data from kernel to userspace in dedicated kernel thread.

/devel/other :: Link / Comments (2)


Cache coherency in POHMELFS. Continue.

While moving home I thought a lot about cache coherency issues. While we belive that NFS has coherent cache, since it is somewhat write-through, its cache actually is not synchronous, since between object creation and moment when other clients see new object really lot of time can run, for example when client, which create an object, has slow link... So, object creation and removal should not be synced to other clients during writeback on one of them, instead clients which are interested in object perform a lookup, which may or may not return object, this is not a race or cache non-coherency, this is usual multithreaded environment without client's synchronization.

What we really care about, is data consistency on the server. When we have multipage write, which overlaps with another write from different client, we should not read data back from the middle of the transactions. Locking the whole file is not an issue, instead proper byte-range (page-range actually) locking has to be implemented. I already have a prototype, but have to check it in real life.

So, other competing projects may or may not follow my way and drop creation/removal/stat coherency from the TODO list (afacs, no one implemented that yet :) based on my analysis and concentrate on server read/write locking.

And I will start some bits of VM hacking: plan is to implement generic enough (well, working on x86 for start :) mechanism to copy data from different (i.e. not that one which started a syscall) thread to userspace, while original one sleeps in syscall, via copy_to_user(). Likely it will be somewhat similar to what I did for zero-copy userspace sniffer and how get_user_pages() work.
Result, which has to be as fast as usual copy_to_user(), otherwise it is not interesting solution, will be used in POHMELFS client and its async reading.

/devel/fs :: Link / Comments (6)


Mon, 21 Apr 2008

Cache coherency in POHMELFS.

Example:

Client 1			Client 2
# ls -a /mnt/
. ..
				ls -a /mnt
				. ..
				echo qwe > /mnt/asdasd
				sync
ls -a /mnt/
. .. asdasd
rm -f /mnt/asdasd
sync
ls -a /mnt/
. ..
				dmesg | tail -n1
				pohmelfs_remove_response: parent: 2, path: '//asdasd'.
				ls -a /mnt
				. .. asdasd
As you might noticed, when one client creates an object and it is written back to server (during writeback), it is broadcasted to all clients, which read the same directory before. This information is stored on server in binary tree, so it takes (M-1)*O(log(N)) time, where M is total number of clients and N is number of directories they read. This can be further optimized though.

Objects are not removed from clients, when one of them remove it (and this is synced to server via writeback), since so far I can not call sys_unlink() directly from module, and I did not yet wrote code to deal with dentry cache (that will be siple), instead you can see in dmesg, that another clients received a command and just need to drop inode and dentry.

Also inode information is not broadcasted yet (for example when file size increases or access rights are changed), so new files have always zero size. This informaion should be broadcasted during writing, and since server is heavily multithreaded, this should not hurt performance.

There is different opinion though: we do not need cache coherency at all, since the last writer will overwrite data anyway, and when we open new object, we first look it up on server, so if it was created there, it will be opened, but if it exists only in cache on some other client, we do not know about it anyway. We can broadcast above messages during object creation on clients, but this will be effectively write-through cache, since we can create object on server that time.

Anyway, I will proceed with either remove/stat messages, or with ability to copy data to userspace from different thread. The latter looks like very interesting hack.

/devel/fs :: Link / Comments (2)


Sun, 20 Apr 2008

Meanwhile at appartment development side.

Moved to the 'Leroy Merlin' development shop to get lots of stuff and found so huge crowd of people, that decided to run away as quickly as possible.
While walking there found couple of interesting things:

  • wood plates from small to 2000x600, which are perfectly polished, have acceptible for table/shelf/cupboard development thicknes, and have too small price to resist to buy. When I started my table developemnt there were no such things in broad usage at all, but I will not stop, just because I found materials, which allow to build it much faster and simpler. But for usual shelves I will definitely get it there and will not implement things myself from real wood plates (those ones are made using glueing technology from much smaller plates, I used similar to implement thick enough part of my 'L'-style table).
  • bath cabins are incredibly ugly and unacceptibly bad-made. I knew that before though.
  • found bar installation, which want to setup at home - likely it will be my only table in kitchen, I like it very much.
  • my kitchen (actually right now it is heavily used as joinery only) has only 3/4 of walls covered with wallpapers, today I've known, that my wallpaper model will not be sold anymore, so either I will have to reglue most of the kitchen, or I will create some interesting installation on the remaining part of the walls, and I think I already know what to put there: just like my blue wall in the room, I will put some brick-like elements in the kitchen. As a ceiling light I will install a huge wood beam hinged on chains with attached small lights. Or maybe not, who knows...
At home I attached '_' part of my 'L' table (it is not exactly 'L', but rounded very much), and started the last painting layer. Also attached holders to the walls where table will be located (my huge 2000x1500 or so table has only single leg close to the end of the longer part of letter 'L', other parts are attached to the walls). Maybe i will even install it today's night if colour will be ready.

/devel/flat :: Link / Comments (0)


Real Jedi does not use kernel.

He writes new or extends existing, but it is from different serie.

This one will tell you how one will be able to build a distributed and then parallel filesystem using POHMELFS.

Headline says it all: POHMELFS server will not be placed into kernel so far, since it is already very fast (compared to in-kernel async NFS server), and userspace programming is a bit easier and mostly because there is no need to wait about 10 minutes while servers come up after ipmi reboot, since they are located somewhere I do not know where and there is no posibility to quickly reboot them by hand, so servers have lots of things to bring themself up even if something was really screwed, like network boot, add here scsi probing, possible fsck, initial bios memtest (8GB)...

So, planned POHMELFS server updates:

  • PMCC - poor man cache coherency protocol. Scheduled for the first half of the next week, btw.
  • server extension to allow storing data on multiple devices (like creating mirroring), first by saving data in several local directories (think about server, which mounted remote dirs over POHMELFS or NFS, and local dirs).
  • client/server extension to report lookup and readdir requests not only for local destination, but also to different addresses, so that reading/writing could be done from different nodes in parallel.
Somewhere at the beginning there is also a task to extend client to be able to switch between different servers (if one goes down, client automatically reconnects to second and so on).

And the most complex task is server parallelization, i.e. ability to have multiple servers, which handle the same metadata, to work in parallel and be coherent. AFAIK, there are no such (at least open) solutions, neither Lustre, nor PVFS2, nor Ceph, nor glusterfs, nor whatever. There are solutions to have master-slave setup (IIRC, Lustre works that way), Ceph has ability to spread metadata between multiple servers, but they do not handle the same sets of objects, so there is no metadata server redundancy.
So far I consider this as the most complex part, and I have not yet come to solution.

/devel/fs :: Link / Comments (0)


Sat, 19 Apr 2008

hbukittbd: Andrew Morton proposes new userspace/kernelspace interface.

Rusty Russel is an author of the vringfd() (name says it all) new interface for the event ring buffer management.
Quotation from Andrew Morton:

This is may be our third high-bandwidth user/kernel interface to transport bulk data ("hbukittbd") which was implemented because its predecessors weren't quite right. In a year or two's time someone else will need a hbukittbd and will find that the existing three aren't quite right and will give us another one. One day we need to stop doing this ;)
...
So I think it would be good to plonk the proposed interface on the table and have a poke at it. Is it compat-safe? Is it extensible in a backward-compatible fashion? Are there future-safe changes we should make to it? Can Michael Kerrisk understand, review and document it? etc.

You know what I'm saying ;) What is the proposed interface?
Just for the reference, I've filled it under kevent tag :)

/devel/kevent :: Link / Comments (0)


Fri, 18 Apr 2008

Poor man's cache coherency protocol design for POHMELFS.

As you might know, POHMELFS is a network filesystem with client's cache of data and metadata. Any place with cache has to provide cache-coherency algorithm to sync data with other users.

There are two common cases when caches become non-coherent:

  • client created/removed/modified object, which is not shared with other clients (i.e. this object does not exist in theirs caches and no object with the same name was created on different clients)
  • object being handled by one client exists in other caches
Poor man's solution for the above problems resolves quite easily: client will flush its changes to whatever objects it wants during local writeback, this changes are then propagated to all other clients, which worked with parent object (this information will be stored in server each time client read dir or perform a lookup). For the first non-coherent case above client will just receive a new object from the server, which will be easily imported into existing tree (because of async nature of the POHMELFS it is trivial task, which right now works out of the box, although only on client). For the latter case there might be problem if local object was modified: in this case we can either replace its context with new data, or (better) to rename local object to something different (like old name plus sync time), so that user could merge data manually.

So far there will be no locks, which will be implemented next.

/devel/fs :: Link / Comments (0)


POHMELFS AIO reading benchmark vs async NFS.

After I spent two days implemententing real AIO for POHMELFS, following things happened:

  • Implemented 3 different AIO schemes, two of which could be zero-copy. Here is a brief description of them.
    First, POHMELFS ->aio_read() callback schedules number of pages to be read from the server (if page is already up-to-date, it is copied to userspace, otherwise network request is being sent), then it waits...
    • when async data is received from remote side, appropriate inode and pages are found, then (physical) userspace page is locked in memory and data is either received into that page, or received into VFS cache page and then copied into userspace one. Then userspace page is unlocked.
    • when async data is received (note that it is received completely asynchronous in different thread) into VFS cache page, received thread copies data into userspace via copy_to_user(). Since receiver thread has completely different virtual memory layout, it can not simply copy data to provided userspace address, first it has to setup page tables to be equal to userspace thread layout, in theory setting CR3 register on x86 should be enough, but that's only theory, I was not able to fully complete this method, since eventually thread crashed (obviously: userspace thread could be still active on different CPU, so installing the same CR3 register for different CPUs pointing to the same page tables lead to crappy things). This interesting hack can be finished though.
    • when async data is received, pages are marked as ready and placed into list, so userspace thread can copy them back via copy_to_user(). The simplest method. And it works great (graphs below).
  • found a bug in 2.6.25-rc7 shmem when removing 1gb file from it:
    Bad page state in process 'rm'
    page:c49948c0 flags:0xf7d4a600 mapping:00000000 mapcount:0 count:0
    Trying to fix it up, but a reboot is needed
    Backtrace:
    Pid: 9454, comm: rm Not tainted 2.6.25-rc7 #11
    [] bad_page+0x52/0x7a
    [] free_hot_cold_page+0x5e/0x15a
    [] __pagevec_free+0x18/0x22
    [] release_pages+0xfb/0x142
    [] __pagevec_release+0x15/0x1d
    [] truncate_inode_pages_range+0xea/0x29f
    [] __link_path_walk+0xa7e/0xb28
    [] truncate_inode_pages+0x9/0xc
    [] shmem_delete_inode+0x26/0xac
    [] shmem_delete_inode+0x0/0xac
    [] generic_delete_inode+0x88/0xec
    [] iput+0x60/0x62
    [] do_unlinkat+0xb7/0xf9
    [] do_page_fault+0x2b6/0x6c2
    [] do_page_fault+0x31e/0x6c2
    [] sys_ioctl+0x2c/0x43
    [] sysenter_past_esp+0x5f/0x85
    [] pci_scan_single_device+0x377/0x446
    Did not try to investigate (this is my testing server, not tainted with POHMELFS code).
  • Ran multiple tests...
Test details for the second round of POHMELFS vs NFS fight.
Hardware and software was already described in the first round, I need to note, that server (2.6.25-rc7) has all debugging options turned off.

Tests performed: kernel tree reading (find linux-2.6.24.4 -type f | xargs cat > /dev/null) from disk over the net (XFS filesystem, cold server and client caches) and big file reading from the tmpfs (to eliminate server disk latencies). Graph was added to the previous round results.

POHMELFS vs NFS

Note that async NFS and POHMELFS behave very similar with operations which involve reading from the disk, that is because of disk latencies (although 10krpm SCSI disk used allows about 80 MB/s sequential read, XFS behaves quite badly with lots of small files), tmpfs comparison shows advantages of the POHMELFS network protocol.

Reading from huge remote tmpfs file is about 2 times faster for POHMELFS because of its AIO implementation, although it is not main reason - server was almost always capable of handling requests from the POHMELFS client one-by-one using one thread, which saturated bandwidth for about 70% (add here all debug options turned on on client). One of the main factors I think is readahead being turned off - sync readahead has zero advantage in asynchronous network filesystem, since while it waits for readahead to complete, it could schedule new requests, while ->readpage() method used in readahead waits until page is transferred, and only then readahead code schedules new request. One can implement ->readpages() though.

Kernel tree reading micro-benchmark was also performed: POHMELFS has 2-times win because of its network protocol, which batches (via TCP_CORK only though, I think I need to implement better directory reading command) server replies.

Another solution is to correctly implement transactional model, which is next task now.

/devel/fs :: Link / Comments (0)


Wed, 16 Apr 2008

Massively multithreaded POHMELFS server.

Because of completely asynchronous POHMELFS nature it is possible to implement mulithreaded server, where not only requests from different clients are processed in parallel, but also async requests from the same users are handled simultaneously by pool of threads.
Such multithreading requires to introduce transactional model of the communications, for example object creation and writing data, right now this race is handled by sending a reply after creation, so the whole writeback sleeps waiting for that, which drops performance (to NFS level). Transaction contrary will contain both operations, which will be processed by the same thread without race. It can also handle other problematic places with multiple server threads.

So far userspace server can run several or one processing thread per client, but there is no transactions implemented. I just started AIO reading implementation, which should provide great speedup for any reading workload.

Stay tuned!

/devel/fs :: Link / Comments (0)


Meanwhile at appartment development side.

Nothing happend, but I made couple of photos. Will setup a full gallery somewhat later, when finish things, so far couple room photos and x-shelves.





/devel/flat :: Link / Comments (0)


Mon, 14 Apr 2008

Initial network filesystem benchmark. POHMELFS vs NFS. Round 1.

Hardware (both client and server have the same hardware).
4-way (2 logical (HT) + 2 physical cpus) 3.00 Xeon (32 bits with PAE :), 8 GB of RAM, Intel 82541GI gbit adapters, Seagate ST3300007LC 10k rpm scsi disk on Adaptec AIC7902 PCI-X Ultra320 SCSI adapter.

Software.
Server: 2.6.25-rc7 kernel, in-kernel NFS server, userspace POHMELFS server.
Client: 2.6.25-rc8 kernel, in-kernel clients.
Both have all kernel debugging turned on.

Round 1. Huge directory (linux-2.6.24.4.tar archive) untarring over the network.
Picture shows it all.

POHMELFS vs NFS

Notice, that there is no test for POHMELFS reading (that is why it is only first round), since it is miserable. And I know the reason: I'm lazy, so I use generic reading function (generic_file_aio_read()), but actually Linux does not have AIO reading from usual files, so it is very synchronous and requires to read data page-by-page, so we have a pretty broken system in regards to network performance.

Since reading is not async, so I will reimplement generic_file_aio_read() as pohmelfs_aio_read(), which will be a real AIO reading function. That will be second round, where POHMELFS will win.

But it can not win the game. Because things are changing. Today I've known, that if filesystem has only 20 users over the world, then it should not be merged, since burden of changing something generic in VFS (and thus propagate it to filesystems) is too high.

What has happend? Linux kernel maintainers started to be afraid of changes? Afraid of more code? Afraid of something new they do not want?..

Eh, and they tell they want more developers... They want monkeys who will do only what was asked them to do.

POHMELFS will be sent for review of course, but it is highly unlikely I will push it upstream.

/devel/fs :: Link / Comments (6)


A hypocrisy.

When user fills the bug, developer is supposed to fix it. That is obvious and of course true.

But interesting things start showing in details.
If user piss developer off, it is ok. If developer throws something back - it is bad.
If user does not answer, it is ok. If developer keeps silence - he is a bastard.
If user fills bug, it is ok. If developer asks user for some help - developer is a fucking monster.

Yes, there are real jerks in development community as long as in users, and getting simple numbers: user community is much bigger than development one, so number of crappy people scales as well. And nevertheless, people like to blame developers and pray to users. This comes down to absurd, when developer asks for help, and then he is blamed for not devoting time to solving a problem.

People like to look at others. I like to look at others too of course. And we frequently like to forget that we behave exactly like those who we blame to be jerks. Exactly like them. We just forgot that, or do not pay attention, or do not want to think about, since when things come to us, this becomes a hypocrisy.

/devel/other :: Link / Comments (1)


Fri, 11 Apr 2008

Unhashed inodes can not be synced during writeback. Debunked.

Problem happend to be quite simple: writeback happens for inodes in sb->s_io superblock list. They are placed there from sb->s_dirty list, which contains dirty inodes. Dirty inodes can be placed into that list via mark_inode_dirty(), which checks if inode is hashed, if it is not, then it will not be placed into dirty list.
Hashed has a synonym in comments: valid...

There is sb->s_op->dirty_inode() superblock operation callback, which is invoked first, so one can still implement own inode cache, do not use inode hash tables, do not hash inodes and still put inodes into dirty list and thus be able to run writeback on them.

/devel/fs :: Link / Comments (0)


Thu, 10 Apr 2008

Busy inodes after unmount.

VFS: Busy inodes after unmount of pohmel. Self-destruct in 5 seconds.  Have a nice day...
After removing private cache of inodes I found, that objects, which were sent by the server and which were never attached to directory entry (dentry), will never be freed.
So, essentially this does not work with Linux VFS:
iget()/iget_locked()
...
umount
Inodes, created by iget()/iget_locked() will be placed into at least three different lists:
  • inode_in_use - global list of ever created inodes, which have i_count and i_nlink more than 0
  • s_inodes - per superblock list, which contains every inode, created for this superblock
  • inode_hashtable - hash table indexed by inode number. If you want to work with writeback, your inodes have to be there. Did not yet investigate why.
So, essentially all inodes, which you created, are accessible by VFS and will be checked during umount via generic_shutdown_super()->invalidate_inodes(), where system will notice that if inode in s_inodes list has non-zero reference counter (or course, otherwise it would be already freed by filesystem), then this inode can not be freed. Thus we have a leak.

Above lists can only be accessed under global inode lock, so it is not a good idea to destroy inodes traversing them in for example ->put_super() callback or in any other filessytem callback, so I had to add a list of all inodes into POHMELFS superblock. Ugly.

/devel/fs :: Link / Comments (0)


get_user_pages() sclability.

Just found an article at LWN about get_user_pages(). Main problems happend to be a locking between multiple threads...

Out of curiosity, was this scalability problem fixed (for the busy reader: this is my more than 2-years old testing of the get_user_pages() performance with single thread, ran to find bottlenecks in kevent AIO).

Here is a graph (perfomance vs. number of pages):
get_user_page() scalability

/devel/other :: Link / Comments (0)


POHMELFS development status.

It has developed very rapidly last couple of days, so essentially I rewrote it. I think it is ready for the next release, which I will announce in a day or so.
Right now all first-milestone features except cache-coherency (check below), which I planned, are completed (although maybe not in the most optimal way sometimes).
Because of name cache usage it is now possible to create huge pathes with multiple directories via single command. The same applies to directory removal, although it is because of different design issue.
It would be possible to rewrite generic read/write helpers and provide set of pages into POHMELFS network stack (which is page based for data now), but I decided that for the first step it is not needed.
POHMELSF has now fully async processing of all operations except link creation (I just decided that it is a bit simpler to make them write-through, it was done because of laziness and not some fundamental arch problems). It was achieved by serious (read: from scratch) changes in the arch, which had own problematic places, namely error report. Because of this move it becomes really simple to implement any kind of protocol, if it obeys async rules, namely sending of the message never requires sync reply, and where it is needed, reply comes as an independent incoming message, which is processed asynchronously from waiting and via common state machine.
Such arch allows to have simple cache coherency algorithm, when server just sends a missed entries or commands to remove some objects and client's core handles that just fine since its reciving code does not depend on sending one. This is not 100% correct way to handle collisions (collisions thus became new objects in the filesystem tree, like old name plus some suffix), but it is what lots of the users need, but not real cache-coherency.
Writeback cache does not play very well with cache-coherency, since every metadata changes (like object creation or removal) has to be checked against server state, since different clients can do the same with the same object. Level of paranoidality has to be thought of in advance.

First cache-coherency step is implementation of the trivial scheme, when every object is synced during its writeback time and changes being broadcasted by server to other clients. If another client has the same object being processed it can either be renamed to collision or just overwritten. Having locks and thus real states is a next step.

Also, POHMELFS does not have authentification and strong checksums right now, and although this is a simple task to implement, its priority is questionable. There is also possibility to implement cryptographically strong encryption of the communication channels.

So, lots of ideas, but main part is ready - async data processing design was definitely a right choice to implement, so all other features become very simple to complete.
New release will be announced very soon, stay tuned!

/devel/fs :: Link / Comments (0)


Tue, 08 Apr 2008

Social hacking.

How to survive in office (in russian) - report about office life seen by a reporter, who worked as a secretary several weeks.

A real social hacking imho.

/other :: Link / Comments (0)


Mon, 07 Apr 2008

A Theory. Antisocial theory.

Quite for a while a have quite interesting but very antisocial theory in my mind. A hacker's behavioural theory.
Lets talk about male here (frankly, I never saw a female hacker with similar behavioural aspects).

Main theory key is about the fact, that when person has something really interesting for himself, he does not want to spend his (limited) time with others. Just because he is so selfish (in a good meaning of this word), that he just does not need any one near to spend time with, just because he can create or get real problem for himself and devote all the time to it. There can be lots of other people around, but eventually (if they are not a really good friends, who understand that immediate timeframe does not matter) all they understand that theirs time does not return back.
Such people do not really think about others, they think about the problem, which lives in mind right now. This does not always mean that they do not like other people (which actually can be true), but only that they are somewhere in another place with another thoughts.

Maybe they will return back to usual life and devote theirs time to other people not to some problems, or maybe not...

Such a theory... Created by looking around.

/other :: Link / Comments (0)


Sun, 06 Apr 2008

The is only one way: asynchronous.

This is a new motto for POHMELFS. It is a completely new filesystem now.

POHMELFS got new page processing code (sending side: commands and data), new lookup, which is based on the Linux VFS inode cache without reinventing the wheel (comment says it is very smp-friendly, although I do not quite understand how it is possible with global inode_lock), it also got completely new object creation and referencing path. It is possible to create a huge path (up to 4k, but can be easily extended if there will be such demand) with multiple objects in it with only single network command.
But the main feature of new POHMELFS is its name cache. I did not find how to hook into VFS dentry cache, so invented own. It is fast to travers from child to the highest level parent, which is actively used in POHMELFS writeback path. Although it is not 100% the best storage, but a simple RB-tree (and thus requires smp-unfriendly mutex), the whole idea shows its gains already. Eventually it will be replaced with faster and more scalable approach protected by RCU (even properly sized hash table will show better scalability, although dynamic resizing of hash tables prevents RCU usage), but I started from the simplest ground.

POHMELFS already outperforms async NFS during untarring and completely saturates my testing Xen domains (both network and disk speed), while NFS is almost two times slower. Testing machines have 256 Mb of RAM, maximum 3 MB/s interconnect speed (something is broken in Xen setup likely, since it is supposed to be 100 mbit/s and there is no high load), which is very unfriendly (read: in such scenario POHMELFS will show its worse results) for POHMELFS, but nevertheless it is fast.

It became not only much faster, but also simpler. Its userspace server has two times less lines of code (816 vs. 1613), kernel side is smaller and simpler too: mainly there are no zillions of different trees indexed by any possible keys, so far only per-inode tree of child names for readdir and per-superblock path entry cache.

There are drawbacks of course: there is no receiving code (at all). It will be a dedicated thread, which will asynchronously process all incoming packets (mostly readdir async return, read page content and cache-coherency messages). First two are really simple. The last one will be implemented as a full MOSI/MSI library for inode content. Likely it will be possible to use in my other projects.

P.S. I frequently think that I'm very good vapourware seller :)
Stay tuned!

/devel/fs :: Link / Comments (0)


Thu, 03 Apr 2008

Codying style stupid talks.

Yet another one...

Blah-blah-blah, I like spaces, blah-blah-blah, I do not like spaces...

Here are just two examples (one from the thread), decide yourself, which is easier to read:

Becauseitmoreeasilyallowsyoureyestoseethedifferentoperators.
B e c a u s e i t m o r e e a s i l y a l l o w s y o u r e y e s t o s e e t h e d i f f e r e n t o p e r a t o r s .
The same applies to more common:
for (i=0; i<10; ++i) vs 
for (i = 0; i < 10; ++i)
The latter just wastes lots of space and forces eyes to move out of orbits.
That is my own opinion, obviously the more people involved, more opinions strike.

So, never kick someone when he is on the edge forcing him to change simple stuff in codying style, he can return and kick you back, when you will be on own edge...

Ugh, and forgot likely the favourite one:
for (i=0; i<10; ++i) vs 
for (i=0; i<10; i++)
Update: Oh holy crap: I recall people compared theirs uptimes to show which dick is longer who is more cool, but comparing number of whitespaces-instead-of-tabs-errors per subsystem is a real winner of the modern cruel reality! Hope you have a sense of humor, lets convert number of errors per 1000 lines of code into length (100*kloc/errors):
kernel/ maintainer has this big: ===========D
arch/alpha maintainer has this big: =D
arch/arm maintainer has this big: ==D
arch/avr32 maintainer has this big: ============D
arch/blackfin maintainer has this big: ===================================D
arch/cris maintainer has this big: =D
arch/frv maintainer has this big: ====D
arch/h8300 maintainer has this big: =D
arch/ia64 maintainer has this big: ==D
arch/m32r maintainer has this big: ====D
arch/m68k maintainer has this big: ==D
arch/m68knommu maintainer has this big: =====D
arch/mips maintainer has this big: ====D
arch/parisc maintainer has this big: D
arch/powerpc maintainer has this big: ==D
arch/ppc maintainer has this big: =D
arch/s390 maintainer has this big: =D
arch/sh maintainer has this big: ====D
arch/sparc maintainer has this big: ==D
arch/sparc64 maintainer has this big: ===D
arch/um maintainer has this big: ==D
arch/v850 maintainer has this big: ===D
arch/x86 maintainer has this big: =D
arch/xtensa maintainer has this big: ==D
And couple of my projects:
fs/pohmelfs maintainer has this big: =======D
drivers/block/dst/ maintainer has this big: ============D
drivers/connector maintainer has this big: ===D
drivers/w1 maintainer has this big: =======D
Not bad, will put it near the mirror...

/devel/other :: Link / Comments (8)


Wed, 02 Apr 2008

Climbing evening.

That was quite short although quite hard training. After number of warming traverses I started jumping again - now I created a 'trace' myself of the huge horizontal negative slope, so some times I fell to the back from about meter, which was even fun. Eventually I managed to complete own jumping holds, which resulted in a very rubbed fingers both on feet and arms, so essentially rest of the training was predefined to be something trivial. Nevertheless I managed to try some old complex start couple of times, fell of course, but it was worth it.
Usual finish of the training - sauna - today was exceptionally dry and hot - about 99 degres Centigrade and it was even hard to breath, since air was so dry.
Anyway, excellent time!

/life :: Link / Comments (0)


Unhashed inodes can not be synced during writeback.

So essentially there is no way to implement own inode cache tied to system's writeback mechanism, which is a bad news. POHMELFS in its current reincarnation does not use system's inode cache and all its indeas are unhashed, which results in a fact, that they are never synced, since writeback mechanism just does not see them.
So I will fallback to hashed inodes, which will be used just for that, and writeback for single inode will end up creating directory structure for the all upper layer objects.

Another idea is to implement own writeback, which would be scheduled from the main one or after memory notifications, this approach has lots of advantages actually, but let's first complete simpler part with hased inodes.

This is called learning curve - I'm essentially where I was before, but with extended baggage of knowledge.

/devel/fs :: Link / Comments (0)


Tue, 01 Apr 2008

Fix for the fundamental network/block layer race in sendfile().

Summary of the previous series with this pompous header: when sendfile() returns, pages which it sent can still be queued in tcp stack or hardware, so subsequent write into them will endup in corrupting data which will be eventually sent. This concerns all ->sendpage() users namely sendfile() and splice().

We can only safely reuse that pages only when ack is received from the remote side, which will force network stack to release pages. My simple extension allows to hook into data releasing path and perform any actions we want. This is achieved by replacing skb->destructor with own callback registerd by interested user, for example splice/sendfile code. Splice (pipe info structure) in turn is extended to hold atomic counter of the pages in flight (without structure size change because of alignment issues it has right now), so splice code will sleep when full pipe info (->nrbufs pages) have been sent, it will wait until number of pages in flight hits zero, which is decremented in private splice callback.

Patch was tested with simple send and recv applications, which can be found in archive.

One has to run them on different machines, since loopback uses a bit different scheme (namely page is _never_ copied, so when it is received by 'remote' side, it still exists on the 'local' side, so modifications will endup in data corruption).

devfs1# ./recv -a 0.0.0.0 -p 1025 -c 1024
devfs2# ./send -a devfs1 -p 1025 -f /tmp/test -c 1024
In case of failure you will get this:
Connected to devfs1:1025.
/tmp/test/1024 -> devfs1:1025
Data was corrupted: ab.
after short period of time, where above 'ab' is a hex byte writen into mapped file, which has been sent, immediately after senfile() returns to userspace. Data is supposed to be always zero, and applications should run forever.
-c parameter specifies number of bytes to be sent in each run of the sendfile(). It has to be the same on both machines.

This idea was first thought as soft barriers in distributed storage.

/devel/networking :: Link / Comments (0)


I believe Firefox as is can pass Turing test.

It is real artificial life on my desktop:

gettimeofday({1207056215, 592745}, NULL) = 0
gettimeofday({1207056215, 592792}, NULL) = 0
gettimeofday({1207056215, 592858}, NULL) = 0
gettimeofday({1207056215, 592909}, NULL) = 0
gettimeofday({1207056215, 592957}, NULL) = 0
gettimeofday({1207056215, 593005}, NULL) = 0
gettimeofday({1207056215, 593064}, NULL) = 0
gettimeofday({1207056215, 593139}, NULL) = 0
gettimeofday({1207056215, 593237}, NULL) = 0
gettimeofday({1207056215, 593292}, NULL) = 0
gettimeofday({1207056215, 593346}, NULL) = 0
gettimeofday({1207056215, 593382}, NULL) = 0
gettimeofday({1207056215, 593431}, NULL) = 0
gettimeofday({1207056215, 593491}, NULL) = 0
gettimeofday({1207056215, 593541}, NULL) = 0
gettimeofday({1207056215, 593589}, NULL) = 0
gettimeofday({1207056215, 593638}, NULL) = 0
gettimeofday({1207056215, 593696}, NULL) = 0
gettimeofday({1207056215, 593762}, NULL) = 0
gettimeofday({1207056215, 593843}, NULL) = 0
gettimeofday({1207056215, 593897}, NULL) = 0
gettimeofday({1207056215, 593951}, NULL) = 0
gettimeofday({1207056215, 593987}, NULL) = 0
gettimeofday({1207056215, 594034}, NULL) = 0
gettimeofday({1207056215, 594093}, NULL) = 0
Suddenly it started to eat my CPU by getting time every 50ms... I can not say why it is needed, except some sign of AI calibrating its ion cannon. Fortunately it was killed before any damage (except screaming cooler on the processor) was made.

/devel/other :: Link / Comments (4)