On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
On Dec 19, 2020, at 1:34 PM, Nadav Amit nadav.amit@gmail.com wrote:
[ cc’ing some more people who have experience with similar problems ]
On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli aarcange@redhat.com wrote:
Hello,
On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range(). This might
Never having to take the mmap_sem for writing, and in turn never blocking, in order to modify the pagetables is quite an important feature in uffd that justifies uffd instead of mprotect. It's not the most important reason to use uffd, but it'd be nice if that guarantee would remain also for the UFFDIO_WRITEPROTECT API, not only for the other pgtable manipulations.
Consider the following scenario with 3 CPUs (cpu2 is not shown):
cpu0 cpu1
userfaultfd_writeprotect() [ write-protecting ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ defer TLB flushes] userfaultfd_writeprotect() mmap_read_lock() change_protection() [ write-unprotect ] ... [ unprotect PTE logically ]
Is the uffd selftest failing with upstream or after your kernel modification that removes the tlb flush from unprotect?
} else if (uffd_wp_resolve) { /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); }
Upstraem this will still do pages++, there's a tlb flush before change_protection can return here, so I'm confused.
... [ page-fault] ... wp_page_copy() [ set new writable page in PTE]
Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL is set and do an explicit (potentially spurious) tlb flush before write-unprotect?
There is a concrete scenario that I actually encountered and then there is a general problem.
In general, the kernel code assumes that PTEs that are read from the page-tables are coherent across all the TLBs, excluding permission promotion (i.e., the PTE may have higher permissions in the page-tables than those that are cached in the TLBs).
We therefore need to both: (a) protect change_protection_range() from the changes of others who might defer TLB flushes without taking mmap_sem for write (e.g., try_to_unmap_one()); and (b) to protect others (e.g., page-fault handlers) from concurrent changes of change_protection().
We have already encountered several similar bugs, and debugging such issues s time consuming and these bugs impact is substantial (memory corruption, security). So I think we should only stick to general solutions.
So perhaps your the approach of your proposed solution is feasible, but it would have to be applied all over the place: we will need to add a check for mm_tlb_flush_pending() and conditionally flush the TLB in every case in which PTEs are read and there might be an assumption that the access-permission reflect what the TLBs hold. This includes page-fault handlers, but also NUMA migration code in change_protection(), softdirty cleanup in clear_refs_write() and maybe others.
[ I have in mind another solution, such as keeping in each page-table a “table-generation” which is the mm-generation at the time of the change, and only flush if “table-generation”==“mm-generation”, but it requires some thought on how to avoid adding new memory barriers. ]
IOW: I think the change that you suggest is insufficient, and a proper solution is too intrusive for “stable".
As for performance, I can add another patch later to remove the TLB flush that is unnecessarily performed during change_protection_range() that does permission promotion. I know that your concern is about the “protect” case
You may want to check flush_tlb_fix_spurious_fault for other archs before proceeding with your patch later, assuming it wasn't already applied.
but I cannot think of a good immediate solution that avoids taking mmap_lock for write.
Thoughts?
On a second thought (i.e., I don’t know what I was thinking), doing so — checking mm_tlb_flush_pending() on every PTE read which is potentially
Note the part "if MM_CP_UFFD_WP_ALL is set" and probably just MM_CP_UFFD_WP.
dangerous and flushing if needed - can lead to huge amount of TLB flushes and shootodowns as the counter might be elevated for considerable amount of time.
So this solution seems to me as a no-go.
I don't share your concern. What matters is the PT lock, so it wouldn't be one per pte, but a least an order 9 higher, but let's assume one flush per pte.
It's either huge mapping and then it's likely running without other tlb flushing in background (postcopy snapshotting), or it's a granular protect with distributed shared memory in which case the number of changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't matter if it's deferred.
I agree it may require a larger tlb flush review not just mprotect though, but it didn't sound particularly complex. Note the UFFDIO_WRITEPROTECT is still relatively recent so backports won't risk to reject so heavy as to require a band-aid.
My second thought is, I don't see exactly the bug and it's not clear if it's upstream reproducing this, but assuming this happens on upstream, even ignoring everything else happening in the tlb flush code, this sounds like purely introduced by userfaultfd_writeprotect() vs userfaultfd_writeprotect() (since it's the only place changing protection with mmap_sem for reading and note we already unmap and flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears the dirty bit etc..). Flushing tlbs with mmap_sem for reading is nothing new, the only new thing is the flush after wrprotect.
So instead of altering any tlb flush code, would it be possible to just stick to mmap_lock for reading and then serialize userfaultfd_writeprotect() against itself with an additional mm->mmap_wprotect_lock mutex? That'd be a very local change to userfaultfd too.
Can you look if the rule mmap_sem for reading plus a new mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever wrprotecting ptes, is enough to comply with the current tlb flushing code, so not to require any change non local to uffd (modulo the additional mutex).
Thanks, Andrea