Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

21 Dec 2020


      ...
On Dec 21, 2020, at 2:30 PM, Peter Xu peterx@redhat.com wrote:
On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
...
BTW: In general, I think that you are right, and that changing of PTEs
should not require taking mmap_lock for write. However, I am not sure
cow_user_page() is not the only one that poses a problem and whether a more
systematic solution is needed. If cow_user_pages() is the only problem, do
you think it is possible to do the copying while holding the PTL? It works
for normal-pages, but I am not sure whether special-pages pose special
problems.
Anyhow, this is an enhancement that we can try later.
AFAIU mprotect() is the only one who modifies the pte using the mmap write
lock.  NUMA balancing is also using read mmap lock when changing pte
protections, while my understanding is mprotect() used write lock only because
it manipulates the address space itself (aka. vma layout) rather than modifying
the ptes, so it needs to.
You are correct about NUMA balancing in general. Yet in practice it is not
an issue in our “use-case” since NUMA balancing preserves the write-bit.
...
At the pte level, it seems always to be the pgtable lock that serializes things.
So it's perfectly legal to me for e.g. a driver to modify ptes with the read
lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
taken when doing so.
If there's a driver that manipulated the ptes, changed the content of the page,
recover the ptes to origin, and all these happen right after wp_page_copy()
unlocked the pgtable lock but before wp_page_copy() retakes the same lock
again, we may face the same issue finding that the page got copied contains
corrupted data at last.  While I don't know what to blame on the driver either
because it seems to be exactly following the rules.
The driver would have to do so without flushing the TLB. Having said that,
the driver could have used inc_tlb_flush_pending() and batch flushes.
...
I believe changing into write lock would solve the race here because tlb
flushing would be guaranteed along the way, but I'm just a bit worried it's not
the best way to go..
It might be too big of a hammer. But the question that comes to my mind is,
if it is ok to change the PTEs without mmap_lock held for write, why
wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
mprotect change of PTE would not be done with the write-lock)? If you did
so, you would have the same problem as the one we encountered (concurrent
protect-unprotect allow concurrent cow-#PF copying the wrong data).
So as an alternative solution, I can do copying under the PTL after
flushing, which seems to solve the problem. First copying (without a lock)
and then comparing seems to me as suboptimal - I do not think the benefit
(if there is one) of shortening the time in which the lock is taken - worth
the additional compare (and the complexity with special pages).
There are 2 problems in doing so:
1. I think that copy_user_highpage() and __copy_from_user_inatomic() can be
called while holding the PTL, but I am not sure.
2. For special pages we would need 2 TLB flushes: one to ensure the
write-bit is cleared, and a second one after we clear the PTE. We
can limit ourselves to soft-dirty/UFFD VMAs, but if we have your
hypothetical driver, this would not be good enough.

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect