On Tue, Dec 22, 2020 at 3:50 PM Linus Torvalds torvalds@linux-foundation.org wrote:
The rule is that the TLB flush has to be done before the page table lock is released.
I take that back. I guess it's ok as long as the mmap_sem is held for writing. Then the TLB flush can be delayed until just before releasing the mmap_sem. I think.
The stale TLB entries still mean that somebody else can write through them in another thread, but as long as anybody who actually unmaps the page (and frees it - think rmap etc) is being careful, mprotect() itself can probably afford to be a bit laissez-faire.
So mprotect() itself should be ok, I think, because it takes things for writing.
Even with the mmap_sem held for writing, truncate and friends can see the read-only page table entries (because they can look things up using the file i_mmap thing instead), but then they rely on the page table lock and they'll also be careful if they then change that PTE and will force their own TLB flushes.
So I think a pending TLB flush outside the page table lock is fine - but once again only if you hold the mmap_sem for writing. Not for reading, because then the page tables need to be synchronized with the TLB so that other readers don't see the not-yet-synchronized state.
It once again looks like it's just userfaultfd that would trigger this due to the read-lock on the mmap_sem. And mprotect() itself is fine.
Am I missing something?
But apparently Nadav sees problems even with that lock changed to a write lock. Navad?
Linus