Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

21 Dec 2020


      ...
On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli aarcange@redhat.com wrote:
On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
...
...
On Dec 19, 2020, at 1:34 PM, Nadav Amit nadav.amit@gmail.com wrote:
[ cc’ing some more people who have experience with similar problems ]
...
On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli aarcange@redhat.com wrote:
Hello,
On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
...
Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range(). This might
Never having to take the mmap_sem for writing, and in turn never
blocking, in order to modify the pagetables is quite an important
feature in uffd that justifies uffd instead of mprotect. It's not the
most important reason to use uffd, but it'd be nice if that guarantee
would remain also for the UFFDIO_WRITEPROTECT API, not only for the
other pgtable manipulations.
...
Consider the following scenario with 3 CPUs (cpu2 is not shown):
cpu0				cpu1

userfaultfd_writeprotect()
[ write-protecting ]
mwriteprotect_range()
mmap_read_lock()
change_protection()
change_protection_range()
...
change_pte_range()
[ defer TLB flushes]
   			userfaultfd_writeprotect()
   			 mmap_read_lock()
   			 change_protection()
   			 [ write-unprotect ]
   			 ...
   			  [ unprotect PTE logically ]
Is the uffd selftest failing with upstream or after your kernel
modification that removes the tlb flush from unprotect?
Please see my reply to Yu. I was wrong in this analysis, and I sent a
correction to my analysis. The problem actually happens when
userfaultfd_writeprotect() unprotects the memory.
...
	} else if (uffd_wp_resolve) {
		/*
		 * Leave the write bit to be handled
		 * by PF interrupt handler, then
		 * things like COW could be properly
		 * handled.
		 */
		ptent = pte_clear_uffd_wp(ptent);
	}


Upstraem this will still do pages++, there's a tlb flush before
change_protection can return here, so I'm confused.
You are correct. The problem I encountered with userfaultfd_writeprotect()
is during unprotecting path.
Having said that, I think that there are additional scenarios that are
problematic. Consider for instance madvise_dontneed_free() that is racing
with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
removing the PTEs, but still did not flush, change_pte_range() will see
non-present PTEs, say a flush is not needed, and then
change_protection_range() will not do a flush, and return while
the memory is still not protected.
...
I don't share your concern. What matters is the PT lock, so it
wouldn't be one per pte, but a least an order 9 higher, but let's
assume one flush per pte.
It's either huge mapping and then it's likely running without other
tlb flushing in background (postcopy snapshotting), or it's a granular
protect with distributed shared memory in which case the number of
changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
matter if it's deferred.
I agree it may require a larger tlb flush review not just mprotect
though, but it didn't sound particularly complex. Note the
UFFDIO_WRITEPROTECT is still relatively recent so backports won't
risk to reject so heavy as to require a band-aid.
My second thought is, I don't see exactly the bug and it's not clear
if it's upstream reproducing this, but assuming this happens on
upstream, even ignoring everything else happening in the tlb flush
code, this sounds like purely introduced by userfaultfd_writeprotect()
vs userfaultfd_writeprotect() (since it's the only place changing
protection with mmap_sem for reading and note we already unmap and
flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
nothing new, the only new thing is the flush after wrprotect.
So instead of altering any tlb flush code, would it be possible to
just stick to mmap_lock for reading and then serialize
userfaultfd_writeprotect() against itself with an additional
mm->mmap_wprotect_lock mutex? That'd be a very local change to
userfaultfd too.
Can you look if the rule mmap_sem for reading plus a new
mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
wrprotecting ptes, is enough to comply with the current tlb flushing
code, so not to require any change non local to uffd (modulo the
additional mutex).
So I did not fully understand your solution, but I took your point and
looked again on similar cases. To be fair, despite my experience with these
deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
getting confused (e.g., can’t we somehow combine tlb_flush_batched and
tlb_flush_pending ?)
As I said before, my initial scenario was wrong, and the problem is not
userfaultfd_writeprotect() racing against itself. This one seems actually
benign to me.
Nevertheless, I do think there is a problem in change_protection_range().
Specifically, see the aforementioned scenario of a race between
madvise_dontneed_free() and userfaultfd_writeprotect().
So an immediate solution for such a case can be resolve without holding
mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
change_protection_range():
/*
     * Only flush the TLB if we actually modified any entries
     * or if there are pending TLB flushes.
     */
        if (pages || mm_tlb_flush_nested(mm))
                flush_tlb_range(vma, start, end);
To be fair, I am not confident I did not miss other problematic cases.
But for now, this change, with the preserve_write change should address the
immediate issues. Let me know if you agree.
Let me know whether you agree.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect