Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

20 Dec 2020


      On Sat, Dec 19, 2020 at 1:34 PM Nadav Amit nadav.amit@gmail.com wrote:
...
[ cc’ing some more people who have experience with similar problems ]
...
On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli aarcange@redhat.com wrote:
Hello,
On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
...
Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range(). This might
Never having to take the mmap_sem for writing, and in turn never
blocking, in order to modify the pagetables is quite an important
feature in uffd that justifies uffd instead of mprotect. It's not the
most important reason to use uffd, but it'd be nice if that guarantee
would remain also for the UFFDIO_WRITEPROTECT API, not only for the
other pgtable manipulations.
...
Consider the following scenario with 3 CPUs (cpu2 is not shown):
cpu0                         cpu1

userfaultfd_writeprotect()
[ write-protecting ]
mwriteprotect_range()
mmap_read_lock()
change_protection()
 change_protection_range()
  ...
  change_pte_range()
  [ defer TLB flushes]
                             userfaultfd_writeprotect()
                              mmap_read_lock()
                              change_protection()
                              [ write-unprotect ]
                              ...
                               [ unprotect PTE logically ]
                             ...
                             [ page-fault]
                             ...
                             wp_page_copy()
                             [ set new writable page in PTE]
Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL
is set and do an explicit (potentially spurious) tlb flush before
write-unprotect?
There is a concrete scenario that I actually encountered and then there is a
general problem.
In general, the kernel code assumes that PTEs that are read from the
page-tables are coherent across all the TLBs, excluding permission promotion
(i.e., the PTE may have higher permissions in the page-tables than those
that are cached in the TLBs).
We therefore need to both: (a) protect change_protection_range() from the
changes of others who might defer TLB flushes without taking mmap_sem for
write (e.g., try_to_unmap_one()); and (b) to protect others (e.g.,
page-fault handlers) from concurrent changes of change_protection().
We have already encountered several similar bugs, and debugging such issues
s time consuming and these bugs impact is substantial (memory corruption,
security). So I think we should only stick to general solutions.
So perhaps your the approach of your proposed solution is feasible, but it
would have to be applied all over the place: we will need to add a check for
mm_tlb_flush_pending() and conditionally flush the TLB in every case in
which PTEs are read and there might be an assumption that the
access-permission reflect what the TLBs hold. This includes page-fault
handlers, but also NUMA migration code in change_protection(), softdirty
cleanup in clear_refs_write() and maybe others.
I missed the beginning of this thread, but it looks to me like
userfaultfd changes PTEs with not locking except mmap_read_lock().  It
also calls inc_tlb_flush_pending(), which is very explicitly
documented as requiring the pagetable lock.  Those docs must be wrong,
because mprotect() uses the mmap_sem write lock, which is just fine,
but ISTM some kind of mutual exclusion with proper acquire/release
ordering is indeed needed.  So the userfaultfd code seems bogus.
I think userfaultfd either needs to take a real lock (probably doesn't
matter which) or the core rules about PTEs need to be rewritten.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect