On Dec 21, 2020, at 7:19 PM, Andy Lutomirski luto@kernel.org wrote:
On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Mon, Dec 21, 2020 at 2:30 PM Peter Xu peterx@redhat.com wrote:
AFAIU mprotect() is the only one who modifies the pte using the mmap write lock. NUMA balancing is also using read mmap lock when changing pte protections, while my understanding is mprotect() used write lock only because it manipulates the address space itself (aka. vma layout) rather than modifying the ptes, so it needs to.
So it's ok to change the pte holding only the PTE lock, if it's a *one*way* conversion.
That doesn't break the "re-check the PTE contents" model (which predates _all_ of the rest: NUMA, userfaultfd, everything - it's pretty much the original model for our page table operations, and goes back to the dark ages even before SMP and the existence of a page table lock).
So for example, a COW will always create a different pte (not just because the page number itself changes - you could imagine a page getting re-used and changing back - but because it's always a RO->RW transition).
So two COW operations cannot "undo" each other and fool us into thinking nothing changed.
Anything that changes RW->RO - like fork(), for example - needs to take the mmap_lock.
Ugh, this is unpleasantly complicated. I will admit that any API that takes an address and more-or-less-blindly marks it RO makes me quite nervous even assuming all the relevant locks are held. At least userfaultfd refuses to operate on VM_SHARED VMAs, but we have another instance of this (with mmap_sem held for write) in x86: mark_screen_rdonly(). Dare I ask how broken this is? We could likely get away with deleting it entirely.
If you only look at the function in isolation, it seems broken. It should have flushed the TLB before releasing the mmap_lock. After the mmap_write_unlock() and before the actual flush, a #PF on another thread can happen, and a similar scenario to the one that is mentioned in this thread (copying while a stale PTE in the TLBs is not-writeprotected) might happen.
Having said that, I do not know this code and the context in which this function is called, so I do not know whether there are other mitigating factors.
Funny, I had a deja-vu and indeed you have already raised (other) TLB issues with mark_screen_rdonly() 3 years ago. At the time you said "I'd like to delete it.” [1]