Zbr's days.
April
Sun Mon Tue Wed Thu Fri Sat
   
22
     
2008
Months
Apr
Oct Nov Dec

About TODO Blog RSS Old blog Projects Gallery Notes

Tue, 22 Apr 2008

Debunked copy_to_user() from kernel thread problem.

It happend to be really trivial. Even no VM hacking :(

First, some background on how copy_to_user() works on x86.
Its asm looks pretty simple (and it is very small, check arch/x86/lib/usercopy_32.c:__copy_user()), so I always wondered how it can handle missing-page-exception, when userspace page was swapped out.

Things live in small part of the function: .section __ex_table, this table contains two values: place where exception happend, and fixup address (it is just instruction positions). Linker puts this table into special section, accessible by page fault handler do_page_fault(). In some cases page fault path is never executed, code just searches for page and locks it, even if it is already in the table (that is why get_user_pages() is at best as fast as copy_to_user()). This happens when WP bit is not set and does not work (a speculation only though, derived from __copy_to_user_ll() and Intel F00F bug errata).

When WP bit works, we have usual copy_to_user(), which will fault if there is no destination page, and do_page_fault() eventually will be called. After number of checks system determines that it is exception in kernel mode and if there is above exception table (which is true for copy_to_user()), it tries to fix things up.

Here we come to essentially the same code, what is called in get_user_pages(): we locate VMA for failed address and insert new page into page table, this involves allocation of all those strange 3-letters abbreviations: pgd, pud, pmd and pte ('and' is not VMM abbreviation yet), I know what two or three of them mean, but completely forgot pud, on 4 level page table it is hard to recall which two are the same, since iirc x86 has only 3 levels.
If page was swapped out, it will be brought back and eventually fault handler will try to fix things up via fixup_exception(), which will replace EIP with appropriate value from the section table described above, so that CPU will return back to __copy_user() code and continue (or not, depending on fact that page exists or not) its execution.

So, how to hook into above mechanism and allow completely different process to write data into userspace? Quite trivially: above fixup (VMA searching and 3-letters abbreviation allocations) happens for particular mm_struct, which contains VMA list, page table lock and other (likely very) essential information to handle memory management. This structure is obtained from the curent thread executed on the CPU, so by replacing mm_struct in our kernel thread with userspace thread's one, we can safely copy data to and from userspace. There is a race of course, when userspace thread will want to access its own mm_struct (copied to kernel thread) for example calling mmap() or copy_*_user() from kernel, so we have to be careful and properly guard against that.

Example code which does copy to userspace from kernel thread can be found in archive. Just replace kernel path in Makefile to your own, call make and insert module.
Each reading from /dev/tcopy file will end up with copy of data from kernel to userspace in dedicated kernel thread.

/devel/other :: Link / Comments (2)


Cache coherency in POHMELFS. Continue.

While moving home I thought a lot about cache coherency issues. While we belive that NFS has coherent cache, since it is somewhat write-through, its cache actually is not synchronous, since between object creation and moment when other clients see new object really lot of time can run, for example when client, which create an object, has slow link... So, object creation and removal should not be synced to other clients during writeback on one of them, instead clients which are interested in object perform a lookup, which may or may not return object, this is not a race or cache non-coherency, this is usual multithreaded environment without client's synchronization.

What we really care about, is data consistency on the server. When we have multipage write, which overlaps with another write from different client, we should not read data back from the middle of the transactions. Locking the whole file is not an issue, instead proper byte-range (page-range actually) locking has to be implemented. I already have a prototype, but have to check it in real life.

So, other competing projects may or may not follow my way and drop creation/removal/stat coherency from the TODO list (afacs, no one implemented that yet :) based on my analysis and concentrate on server read/write locking.

And I will start some bits of VM hacking: plan is to implement generic enough (well, working on x86 for start :) mechanism to copy data from different (i.e. not that one which started a syscall) thread to userspace, while original one sleeps in syscall, via copy_to_user(). Likely it will be somewhat similar to what I did for zero-copy userspace sniffer and how get_user_pages() work.
Result, which has to be as fast as usual copy_to_user(), otherwise it is not interesting solution, will be used in POHMELFS client and its async reading.

/devel/fs :: Link / Comments (6)