Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

12 Jan 2021

On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
...
...
On Jan 12, 2021, at 11:02 AM, Laurent Dufour ldufour@linux.vnet.ibm.com wrote:
Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
...
On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
...
Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
...
Possibility of race against other PTE modifiers

Fork - We have seen a case of SPF racing with fork marking PTEs RO and that

is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
Right, that's exactly the kind of thing I was worried about.
...
...

mprotect - change_protection in mprotect which does the deferred flush is

marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
on those VMAs.
Sure, mprotect also changes vm_flags, so it really needs that anyway.
...
...

userfaultfd - mwriteprotect_range is not protected unlike in (2) above.

But SPF does not take UFFD faults.
4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
(2) above.
5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
What happened to shared/file-backed stuff? ISTR I had that working.
File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
...
...
...
Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
The tricky one is demotion, specifically write to non-write.
...
...
I could not see a case where speculative path cannot see a PTE update done via
a fault on another CPU.
One you didn't mention is the NUMA balancing scanning crud; although I
think that's fine, loosing a PTE update there is harmless. But I've not
thought overly hard on it.
That's a good point, I need to double check on that side.
...
...
You explained it fine. Indeed SPF is handling deferred TLB invalidation by
marking the VMA through vm_write_begin/end(), as for the fork case you
mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
values read are valid.
That should indeed work, but are we really sure we covered them all?
Should we invest in better TLBI APIs to make sure we can't get this
wrong?
That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
I will send an RFC soon for per-table deferred TLB flushes tracking.
The basic idea is to save a generation in the page-struct that tracks
when deferred PTE change took place, and track whenever a TLB flush
completed. In addition, other users - such as mprotect - would use
the tlb_gather interface.
Unfortunately, due to limited space in page-struct this would only
be possible for 64-bit (and my implementation is only for x86-64).
I don't want to discourage you but I don't think this would end up
well. PPC doesn't necessarily follow one-page-struct-per-table rule,
and I've run into problems with this before while trying to do
something similar.
I'd recommend per-vma and per-category (unmapping, clearing writable
and clearing dirty) tracking, which only rely on arch-independent data
structures, i.e., vm_area_struct and mm_struct.
...
It would still require to do the copying while holding the PTL though.
IMO, this is unacceptable. Most archs don't support per-table PTL, and
even x86_64 can be configured to use per-mm PTL. What if we want to
support a larger page size in the feature?
It seems to me the only way to solve the problem with self-explanatory
code and without performance impact is to check mm_tlb_flush_pending
and the writable bit (and two other cases I mentioned above) at the
same time. Of course, this requires a lot of effort to audit the
existing uses, clean them up and properly wrap them up with new
primitives, BUG_ON all invalid cases and document the exact workflow
to prevent misuses.
I've mentioned the following before -- it only demonstrates the rough
idea.

diff --git a/mm/memory.c b/mm/memory.c
index 5e9ca612d7d7..af38c5ee327e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
    	goto unlock;
    }
    if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
+			if (mm_tlb_flush_pending(vmf->vma->vm_mm))
+				flush_tlb_page(vmf->vma, vmf->address);
    		return do_wp_page(vmf);
+		}
    	entry = pte_mkdirty(entry);
    }
    entry = pte_mkyoung(entry);

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect