|
|
About
TODO
Blog
RSS
Old blog
Projects
Gallery
Notes
Tue, 22 Apr 2008
Debunked copy_to_user() from kernel thread problem.
It happend to be really trivial. Even no VM hacking :(
First, some background on how copy_to_user() works on x86.
Its asm looks pretty simple (and it is very small, check
arch/x86/lib/usercopy_32.c:__copy_user()),
so I always wondered how it can handle missing-page-exception,
when userspace page was swapped out.
Things live in small part of the function: .section __ex_table,
this table contains two values: place where exception happend, and fixup address
(it is just instruction positions). Linker puts this table into special section,
accessible by page fault handler do_page_fault(). In some
cases page fault path is never executed, code just searches for page and locks it,
even if it is already in the table (that is why get_user_pages()
is at best as fast as copy_to_user()). This happens when
WP bit is not set and does not work
(a speculation only though, derived from __copy_to_user_ll()
and Intel F00F bug errata).
When WP bit works, we have usual copy_to_user(), which will
fault if there is no destination page, and do_page_fault() eventually
will be called. After number of checks system determines that it is exception
in kernel mode and if there is above exception table (which is true for
copy_to_user()), it tries to fix things up.
Here we come to essentially the same code, what is called in get_user_pages():
we locate VMA for failed address and insert new page into page table, this involves allocation
of all those strange 3-letters abbreviations: pgd, pud, pmd and pte ('and' is not VMM abbreviation yet),
I know what two or three of them mean, but completely forgot pud, on 4 level page table
it is hard to recall which two are the same, since iirc x86 has only 3 levels.
If page was swapped out, it will be brought back and eventually fault handler will
try to fix things up via fixup_exception(), which will
replace EIP with appropriate value from the section table described above, so that
CPU will return back to __copy_user() code and continue (or not, depending
on fact that page exists or not) its execution.
So, how to hook into above mechanism and allow completely different process to write data
into userspace? Quite trivially: above fixup (VMA searching and 3-letters abbreviation allocations)
happens for particular mm_struct, which contains VMA list, page table lock
and other (likely very) essential information to handle memory management. This structure is obtained
from the curent thread executed on the CPU, so by replacing mm_struct in our kernel thread with
userspace thread's one, we can safely copy data to and from userspace. There is a race of course,
when userspace thread will want to access its own mm_struct (copied to kernel thread) for example
calling mmap() or copy_*_user() from kernel, so we have to be careful and
properly guard against that.
Example code which does copy to userspace from kernel thread can be found in
archive. Just
replace kernel path in Makefile to your own, call make and insert module.
Each reading from /dev/tcopy file will end up with copy of data from kernel
to userspace in dedicated kernel thread.
/devel/other :: Link / Comments (2)
Cache coherency in POHMELFS. Continue.
While moving home I thought a lot about cache coherency issues.
While we belive that NFS has coherent cache, since it is somewhat
write-through, its cache actually is not synchronous, since between
object creation and moment when other clients see new object really lot
of time can run, for example when client, which create an object, has
slow link... So, object creation and removal should not be synced to other
clients during writeback on one of them, instead clients which are interested
in object perform a lookup, which may or may not return object, this is not a
race or cache non-coherency, this is usual multithreaded environment without
client's synchronization.
What we really care about, is data consistency on the server. When we have
multipage write, which overlaps with another write from different client,
we should not read data back from the middle of the transactions. Locking the
whole file is not an issue, instead proper byte-range (page-range actually)
locking has to be implemented. I already have a
prototype,
but have to check it in real life.
So, other competing projects may or may not follow my way and drop
creation/removal/stat coherency from the TODO list (afacs, no one implemented
that yet :) based on my analysis and concentrate on server read/write locking.
And I will start some bits of VM hacking: plan is to implement generic enough
(well, working on x86 for start :)
mechanism to copy data from different (i.e. not that one which
started a syscall) thread to userspace, while original one sleeps in syscall,
via copy_to_user(). Likely it will be somewhat similar to what
I did for zero-copy userspace sniffer
and how get_user_pages() work.
Result, which has to be as fast as usual copy_to_user(), otherwise it is not
interesting solution, will be used in POHMELFS client and its async reading.
/devel/fs :: Link / Comments (6)
|