Zbr's days.
April
Sun Mon Tue Wed Thu Fri Sat
   
22
     
2008
Months
Apr
Nov Dec

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Tue, 22 Apr 2008

Debunked copy_to_user() from kernel thread problem.

It happend to be really trivial. Even no VM hacking :(

First, some background on how copy_to_user() works on x86.
Its asm looks pretty simple (and it is very small, check arch/x86/lib/usercopy_32.c:__copy_user()), so I always wondered how it can handle missing-page-exception, when userspace page was swapped out.

Things live in small part of the function: .section __ex_table, this table contains two values: place where exception happend, and fixup address (it is just instruction positions). Linker puts this table into special section, accessible by page fault handler do_page_fault(). In some cases page fault path is never executed, code just searches for page and locks it, even if it is already in the table (that is why get_user_pages() is at best as fast as copy_to_user()). This happens when WP bit is not set and does not work (a speculation only though, derived from __copy_to_user_ll() and Intel F00F bug errata).

When WP bit works, we have usual copy_to_user(), which will fault if there is no destination page, and do_page_fault() eventually will be called. After number of checks system determines that it is exception in kernel mode and if there is above exception table (which is true for copy_to_user()), it tries to fix things up.

Here we come to essentially the same code, what is called in get_user_pages(): we locate VMA for failed address and insert new page into page table, this involves allocation of all those strange 3-letters abbreviations: pgd, pud, pmd and pte ('and' is not VMM abbreviation yet), I know what two or three of them mean, but completely forgot pud, on 4 level page table it is hard to recall which two are the same, since iirc x86 has only 3 levels.
If page was swapped out, it will be brought back and eventually fault handler will try to fix things up via fixup_exception(), which will replace EIP with appropriate value from the section table described above, so that CPU will return back to __copy_user() code and continue (or not, depending on fact that page exists or not) its execution.

So, how to hook into above mechanism and allow completely different process to write data into userspace? Quite trivially: above fixup (VMA searching and 3-letters abbreviation allocations) happens for particular mm_struct, which contains VMA list, page table lock and other (likely very) essential information to handle memory management. This structure is obtained from the curent thread executed on the CPU, so by replacing mm_struct in our kernel thread with userspace thread's one, we can safely copy data to and from userspace. There is a race of course, when userspace thread will want to access its own mm_struct (copied to kernel thread) for example calling mmap() or copy_*_user() from kernel, so we have to be careful and properly guard against that.

Example code which does copy to userspace from kernel thread can be found in archive. Just replace kernel path in Makefile to your own, call make and insert module.
Each reading from /dev/tcopy file will end up with copy of data from kernel to userspace in dedicated kernel thread.

/devel/other :: Link / Comments (2)

Zbr wrote at 2008-04-22 21:14:

Main problem with this approach will shot you when system will schedule away your thread in the middle of the operation. With broken mm context it will explode. We could extend do_page_fault() to get mm context not from current thread, which is used for scheduling, but from special value in task_struct.

Marnix wrote at 2008-04-23 11:56:

If memory serves me right, pud stands for 'page upper directory'. It was introduced for Sparc with its 4 level lookup mechanism. For Intel and others it does nothing special.

Please solve this captcha to be allowed to post (need to reload in a minute): 6 - 31

Comments are closed for this story.