From: Nadav Amit namit@vmware.com
Userfaultfd self-tests fail occasionally, indicating a memory corruption.
Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range(). This might cause the TLBs to be in an inconsistent state due to the deferred batched TLB flushes.
Consider the following scenario with 3 CPUs (cpu2 is not shown):
cpu0 cpu1 ---- ---- userfaultfd_writeprotect() [ write-protecting ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ defer TLB flushes] userfaultfd_writeprotect() mmap_read_lock() change_protection() [ write-unprotect ] ... [ unprotect PTE logically ] ... [ page-fault] ... wp_page_copy() [ set new writable page in PTE]
At this point no TLB flush took place. cpu2 (not shown) might have a stale writable PTE, which was cached in the TLB before cpu0 called userfaultfd_writeprotect(), and this PTE points to a different page.
Therefore, write-protecting of memory, even using userfaultfd, which does not change the vma, requires to prevent any concurrent reader (#PF handler) from reading PTEs from the page-tables if any CPU might still hold in it TLB a PTE with higher permissions for the same address. To do so mmap_lock needs to be taken for write.
Surprisingly, memory-unprotection using userfaultfd also poses a problem. Although memory unprotection is logically a promotion of PTE permissions, and therefore should not require a TLB flush, the current code might actually cause a demotion of the permission, and therefore requires a TLB flush.
During unprotection of userfaultfd managed memory region, the PTE is not really made writable, but instead marked "logically" as writable, and left for the #PF handler to be handled later. While this is ok, the code currently also *removes* the write permission, and therefore makes it necessary to flush the TLBs.
To resolve these problems, acquire mmap_lock for write when write-protecting userfaultfd region using ioctl's. Keep taking mmap_lock for read when unprotecting memory, but keep the write-bit set when resolving userfaultfd write-protection.
Cc: Peter Xu peterx@redhat.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Pavel Emelyanov xemul@openvz.org Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: stable@vger.kernel.org Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit namit@vmware.com --- mm/mprotect.c | 3 ++- mm/userfaultfd.c | 15 +++++++++++++-- 2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..c08c4055b051 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write = prot_numa && pte_write(oldpte); + bool preserve_write = (prot_numa || uffd_wp_resolve) && + pte_write(oldpte);
/* * Avoid trapping faults against the zero or KSM diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9a3d451402d7..7423808640ef 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -652,7 +652,15 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, /* Does the address range wrap, or is the span zero-sized? */ BUG_ON(start + len <= start);
- mmap_read_lock(dst_mm); + /* + * Although we do not change the VMA, we have to ensure deferred TLB + * flushes are performed before page-faults can be handled. Otherwise + * we can get inconsistent TLB state. + */ + if (enable_wp) + mmap_write_lock(dst_mm); + else + mmap_read_lock(dst_mm);
/* * If memory mappings are changing because of non-cooperative @@ -686,6 +694,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
err = 0; out_unlock: - mmap_read_unlock(dst_mm); + if (enable_wp) + mmap_write_unlock(dst_mm); + else + mmap_read_unlock(dst_mm); return err; }