Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

21 Dec 2020

      ...
On Dec 20, 2020, at 9:25 PM, Nadav Amit nadav.amit@gmail.com wrote:
...
On Dec 20, 2020, at 9:12 PM, Yu Zhao yuzhao@google.com wrote:
On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
...
...
On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli aarcange@redhat.com wrote:
On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
...
...
On Dec 19, 2020, at 1:34 PM, Nadav Amit nadav.amit@gmail.com wrote:
[ cc’ing some more people who have experience with similar problems ]
> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli aarcange@redhat.com wrote:
> 
> Hello,
> 
> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>> Analyzing this problem indicates that there is a real bug since
>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> 
> Never having to take the mmap_sem for writing, and in turn never
> blocking, in order to modify the pagetables is quite an important
> feature in uffd that justifies uffd instead of mprotect. It's not the
> most important reason to use uffd, but it'd be nice if that guarantee
> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> other pgtable manipulations.
> 
>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
>> 
>> cpu0				cpu1
>> ----				----
>> userfaultfd_writeprotect()
>> [ write-protecting ]
>> mwriteprotect_range()
>> mmap_read_lock()
>> change_protection()
>> change_protection_range()
>> ...
>> change_pte_range()
>> [ defer TLB flushes]
>> 				userfaultfd_writeprotect()
>> 				 mmap_read_lock()
>> 				 change_protection()
>> 				 [ write-unprotect ]
>> 				 ...
>> 				  [ unprotect PTE logically ]
Is the uffd selftest failing with upstream or after your kernel
modification that removes the tlb flush from unprotect?
Please see my reply to Yu. I was wrong in this analysis, and I sent a
correction to my analysis. The problem actually happens when
userfaultfd_writeprotect() unprotects the memory.
...
} else if (uffd_wp_resolve) {
   			/*
   			 * Leave the write bit to be handled
   			 * by PF interrupt handler, then
   			 * things like COW could be properly
   			 * handled.
   			 */
   			ptent = pte_clear_uffd_wp(ptent);
   		}
Upstraem this will still do pages++, there's a tlb flush before
change_protection can return here, so I'm confused.
You are correct. The problem I encountered with userfaultfd_writeprotect()
is during unprotecting path.
Having said that, I think that there are additional scenarios that are
problematic. Consider for instance madvise_dontneed_free() that is racing
with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
removing the PTEs, but still did not flush, change_pte_range() will see
non-present PTEs, say a flush is not needed, and then
change_protection_range() will not do a flush, and return while
the memory is still not protected.
...
I don't share your concern. What matters is the PT lock, so it
wouldn't be one per pte, but a least an order 9 higher, but let's
assume one flush per pte.
It's either huge mapping and then it's likely running without other
tlb flushing in background (postcopy snapshotting), or it's a granular
protect with distributed shared memory in which case the number of
changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
matter if it's deferred.
I agree it may require a larger tlb flush review not just mprotect
though, but it didn't sound particularly complex. Note the
UFFDIO_WRITEPROTECT is still relatively recent so backports won't
risk to reject so heavy as to require a band-aid.
My second thought is, I don't see exactly the bug and it's not clear
if it's upstream reproducing this, but assuming this happens on
upstream, even ignoring everything else happening in the tlb flush
code, this sounds like purely introduced by userfaultfd_writeprotect()
vs userfaultfd_writeprotect() (since it's the only place changing
protection with mmap_sem for reading and note we already unmap and
flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
nothing new, the only new thing is the flush after wrprotect.
So instead of altering any tlb flush code, would it be possible to
just stick to mmap_lock for reading and then serialize
userfaultfd_writeprotect() against itself with an additional
mm->mmap_wprotect_lock mutex? That'd be a very local change to
userfaultfd too.
Can you look if the rule mmap_sem for reading plus a new
mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
wrprotecting ptes, is enough to comply with the current tlb flushing
code, so not to require any change non local to uffd (modulo the
additional mutex).
So I did not fully understand your solution, but I took your point and
looked again on similar cases. To be fair, despite my experience with these
deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
getting confused (e.g., can’t we somehow combine tlb_flush_batched and
tlb_flush_pending ?)
As I said before, my initial scenario was wrong, and the problem is not
userfaultfd_writeprotect() racing against itself. This one seems actually
benign to me.
Nevertheless, I do think there is a problem in change_protection_range().
Specifically, see the aforementioned scenario of a race between
madvise_dontneed_free() and userfaultfd_writeprotect().
So an immediate solution for such a case can be resolve without holding
mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
change_protection_range():
  /*
* Only flush the TLB if we actually modified any entries
* or if there are pending TLB flushes.
*/
  if (pages || mm_tlb_flush_nested(mm))
          flush_tlb_range(vma, start, end);

To be fair, I am not confident I did not miss other problematic cases.
But for now, this change, with the preserve_write change should address the
immediate issues. Let me know if you agree.
Let me know whether you agree.
The problem starts in UFD, and is related to tlb flush. But its focal
point is in do_wp_page(). I'd suggest you look at function and see
what it does before and after the commits I listed, with the following
conditions
PageAnon(), !PageKsm(), !PageSwapCache(), !pte_write(),
page_mapcount() = 1, page_count() > 1 or PageLocked()
when it runs against the two UFD examples you listed.
Thanks for your quick response. I wanted to write a lengthy response, but I
do want to sleep on it. I presume page_count() > 1, since I have multiple
concurrent page-faults on the same address in my test, but I will check.
Anyhow, before I give a further response, I was just wondering - since you
recently dealt with soft-dirty issue as I remember - isn't this problematic
COW for non-COW page scenario, in which the copy races with writes to a page
which is protected in the PTE but not in all TLB, also problematic for
soft-dirty clearing?
Stupid me. You hold mmap_lock for write, so no, it cannot happen when clear
soft-dirty.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect