On Mon, Dec 21, 2020 at 2:30 PM Peter Xu peterx@redhat.com wrote:
AFAIU mprotect() is the only one who modifies the pte using the mmap write lock. NUMA balancing is also using read mmap lock when changing pte protections, while my understanding is mprotect() used write lock only because it manipulates the address space itself (aka. vma layout) rather than modifying the ptes, so it needs to.
So it's ok to change the pte holding only the PTE lock, if it's a *one*way* conversion.
That doesn't break the "re-check the PTE contents" model (which predates _all_ of the rest: NUMA, userfaultfd, everything - it's pretty much the original model for our page table operations, and goes back to the dark ages even before SMP and the existence of a page table lock).
So for example, a COW will always create a different pte (not just because the page number itself changes - you could imagine a page getting re-used and changing back - but because it's always a RO->RW transition).
So two COW operations cannot "undo" each other and fool us into thinking nothing changed.
Anything that changes RW->RO - like fork(), for example - needs to take the mmap_lock.
NUMA balancing should be ok wrt COW, because it doesn't do that RW->RO thing, it uses the present bit.
I think that you are right that NUMA balancing itself might cause other issues, because it can cause that "pte changed and then came back" (for numa protectoipn and then a numa fault) all with just the mmap lock for reading.
However, even that shouldn't matter for COW, because the write protect bit is the one that proptects the *contents* of the page, so even if NUMA balancing caused that "load original PTE, then re-check later" to succeed (despite the PTE actually changing in the middle), the _contents_ of the page cannot have changed, so COW is ok. NUMA balancing won't be making a read-only page temporarily writable.
Linus