On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, Dec 21, 2020 at 2:30 PM Peter Xu peterx@redhat.com wrote:
AFAIU mprotect() is the only one who modifies the pte using the mmap write lock. NUMA balancing is also using read mmap lock when changing pte protections, while my understanding is mprotect() used write lock only because it manipulates the address space itself (aka. vma layout) rather than modifying the ptes, so it needs to.
So it's ok to change the pte holding only the PTE lock, if it's a *one*way* conversion.
That doesn't break the "re-check the PTE contents" model (which predates _all_ of the rest: NUMA, userfaultfd, everything - it's pretty much the original model for our page table operations, and goes back to the dark ages even before SMP and the existence of a page table lock).
So for example, a COW will always create a different pte (not just because the page number itself changes - you could imagine a page getting re-used and changing back - but because it's always a RO->RW transition).
So two COW operations cannot "undo" each other and fool us into thinking nothing changed.
Anything that changes RW->RO - like fork(), for example - needs to take the mmap_lock.
Ugh, this is unpleasantly complicated. I will admit that any API that takes an address and more-or-less-blindly marks it RO makes me quite nervous even assuming all the relevant locks are held. At least userfaultfd refuses to operate on VM_SHARED VMAs, but we have another instance of this (with mmap_sem held for write) in x86: mark_screen_rdonly(). Dare I ask how broken this is? We could likely get away with deleting it entirely.
--Andy