On Dec 21, 2020, at 2:30 PM, Peter Xu peterx@redhat.com wrote:
On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
BTW: In general, I think that you are right, and that changing of PTEs should not require taking mmap_lock for write. However, I am not sure cow_user_page() is not the only one that poses a problem and whether a more systematic solution is needed. If cow_user_pages() is the only problem, do you think it is possible to do the copying while holding the PTL? It works for normal-pages, but I am not sure whether special-pages pose special problems.
Anyhow, this is an enhancement that we can try later.
AFAIU mprotect() is the only one who modifies the pte using the mmap write lock. NUMA balancing is also using read mmap lock when changing pte protections, while my understanding is mprotect() used write lock only because it manipulates the address space itself (aka. vma layout) rather than modifying the ptes, so it needs to.
You are correct about NUMA balancing in general. Yet in practice it is not an issue in our “use-case” since NUMA balancing preserves the write-bit.
At the pte level, it seems always to be the pgtable lock that serializes things.
So it's perfectly legal to me for e.g. a driver to modify ptes with the read lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is taken when doing so.
If there's a driver that manipulated the ptes, changed the content of the page, recover the ptes to origin, and all these happen right after wp_page_copy() unlocked the pgtable lock but before wp_page_copy() retakes the same lock again, we may face the same issue finding that the page got copied contains corrupted data at last. While I don't know what to blame on the driver either because it seems to be exactly following the rules.
The driver would have to do so without flushing the TLB. Having said that, the driver could have used inc_tlb_flush_pending() and batch flushes.
I believe changing into write lock would solve the race here because tlb flushing would be guaranteed along the way, but I'm just a bit worried it's not the best way to go..
It might be too big of a hammer. But the question that comes to my mind is, if it is ok to change the PTEs without mmap_lock held for write, why wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so mprotect change of PTE would not be done with the write-lock)? If you did so, you would have the same problem as the one we encountered (concurrent protect-unprotect allow concurrent cow-#PF copying the wrong data).
So as an alternative solution, I can do copying under the PTL after flushing, which seems to solve the problem. First copying (without a lock) and then comparing seems to me as suboptimal - I do not think the benefit (if there is one) of shortening the time in which the lock is taken - worth the additional compare (and the complexity with special pages).
There are 2 problems in doing so:
1. I think that copy_user_highpage() and __copy_from_user_inatomic() can be called while holding the PTL, but I am not sure.
2. For special pages we would need 2 TLB flushes: one to ensure the write-bit is cleared, and a second one after we clear the PTE. We can limit ourselves to soft-dirty/UFFD VMAs, but if we have your hypothetical driver, this would not be good enough.