On Dec 20, 2020, at 1:54 AM, Yu Zhao yuzhao@google.com wrote:
On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
On Dec 19, 2020, at 10:05 PM, Yu Zhao yuzhao@google.com wrote:
On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
[ cc’ing some more people who have experience with similar problems ]
On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli aarcange@redhat.com wrote:
Hello,
On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range(). This might
Never having to take the mmap_sem for writing, and in turn never blocking, in order to modify the pagetables is quite an important feature in uffd that justifies uffd instead of mprotect. It's not the most important reason to use uffd, but it'd be nice if that guarantee would remain also for the UFFDIO_WRITEPROTECT API, not only for the other pgtable manipulations.
Consider the following scenario with 3 CPUs (cpu2 is not shown):
cpu0 cpu1
userfaultfd_writeprotect() [ write-protecting ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ defer TLB flushes] userfaultfd_writeprotect() mmap_read_lock() change_protection() [ write-unprotect ] ... [ unprotect PTE logically ] ... [ page-fault] ... wp_page_copy() [ set new writable page in PTE]
I don't see any problem in this example -- wp_page_copy() calls ptep_clear_flush_notify(), which should take care of the stale entry left by cpu0.
That being said, I suspect the memory corruption you observed is related this example, with cpu1 running something else that flushes conditionally depending on pte_write().
Do you know which type of pages were corrupted? file, anon, etc.
First, Yu, you are correct. My analysis is incorrect, but let me have another try (below). To answer your (and Andrea’s) question - this happens with upstream without any changes, excluding a small fix to the selftest, since it failed (got stuck) due to missing wake events. [1]
We are talking about anon memory.
So to correct myself, I think that what I really encountered was actually during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). The problem was that in this case the “write”-bit was removed during unprotect.
Thanks. You are right about when the problem happens: UFD write- UNprotecting. But it's not UFD write-UNprotecting that removes the writable bit -- the bit can only be removed during COW or UFD write-protecting. So your original example was almost correct, except the last line describing cpu1.
The scenario is a bit confusing, so stay with me. The idea behind uffd unprotect is indeed only to mark the PTE logically as uffd-unprotected, and not to *set* the writable bit, allowing the #PF handler to do COW or whatever correctly upon #PF.
However, the problem that we have is that if a page is already writable, write-unprotect *clears* the writable bit, making it write-protected (at least for anonymous pages). This is not good from performance point-of-view, but also a correctness issue, as I pointed out.
In some more detail: mwriteprotect_range() uses vm_get_page_prot() to compute the new protection. For anonymous private memory, at least on x86, this means the write-bit in the protection is clear. So later, change_pte_range() *clears* the write-bit during *unprotection*.
That’s the reason the second part of my patch - the change to preserve_write - fixes the problem.
The problem is how do_wp_page() handles non-COW pages. (For COW pages, do_wp_page() works correctly by either reusing an existing page or make a new copy out of it.) In UFD case, the existing page may not have been properly write-protected. As you pointed out, the tlb flush may not be done yet. Making a copy can potentially race with the writer on cpu2.
Just to clarify the difference - You regard a scenario of UFFD write-protect, while I am pretty sure the problem I encountered is during write-unprotect.
I am not sure we are on the same page (but we may be). The problem I have is with cow_user_page() that is called by do_wp_page() before any TLB flush took place (either by change_protection_range() or by do_wp_page() which does flush, but after the copy).
Let me know if you regard a different scenario.
Should we fix the problem by ensuring integrity of the copy? IMO, no, because do_wp_page() shouldn't copy at all in this case. It seems it was recently broken by
be068f29034f mm: fix misplaced unlock_page in do_wp_page() 09854ba94c6a mm: do_wp_page() simplification
I haven't study them carefully. But if you could just revert them and run the test again, we'd know where exactly to look at next.
These patches regard the wp_page_reuse() case, which makes me think we are not on the same page. I do not see a problem with wp_page_reuse() since it does not make a copy of the page. If you can explain what I am missing, it would be great.