On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote:
Perhaps any change to PTE in a page-table should increase a page-table generation that we would save in the page-table page-struct (next to the
The current rule is that by the time in the page fault we find a pte/hugepmd in certain state under the "PT lock", the TLB cannot be left stale with a more permissive state at that point.
So if under "PT lock" the page fault sees !pte_write, it means no other CPU can possibly modify the page anymore.
That must be enforced at all times, no sequence number is required if you enforce that and the whole point of the fix is to enforce that.
This is how things always worked in the page fault and it's perfectly fine.
For the record I never suggested to change anything in the page fault code.
So the only way we can leave stale TLB after dropping PT lock with the mmap_read_lock and concurrent page faults is with a marker:
1) take PT lock 2) leave in the pte/hugepmd a unique identifiable marker 3) change the pte to downgrade permissions 4) drop PT lock and leave stale TLB entries with the upgraded permissions
In point 3) the pte is built in a way that is guaranteed to trigger a deterministic path in the page fault. And in the way of that deterministic path you put the marker check for 2) to send the fault to a dead-end where the stale TLB is actually irrelevant and harmless.
No change to the page fault is needed if you only enforce the above.
Even if we'd take the mmap_write_lock in userfaultfd_writeprotect, you will still have to deal with the above technique because of change_prot_numa().
Would you suggest to add a sequence counter and to change all pte_same in order to make change_prot_numa safer or is it safe enough already using the above technique that checks the marker in the page fault?
To be safe NUMA balancing is using the same mechanism above of sending the page fault into a dead end, in order to call the very same function (change_permissions) with the very same lock (mmap_read_lock) as userfaultfd_writeprotect.
What happens then in do_numa_page then is different than what happens in handle_userfault, but both are a dead ends as far as the page fault code is concerned and they will never come back to the page fault code. That's how they avoid breaking the page fault.
The final rule for the above logic to work safe for uffd, is that the marker cannot be cleared until after the deferred TLB flush is executed (that's where the window for the race existed, and the fix closed such window).
do_numa_page differs in clearing the marker while the TLB flush still pending, but it can do that because it puts the same exact pte value (with the upgraded permissions) that was there before it put the marker in the pte. In turn do_numa_page makes the pending TLB flush becomes a noop and it doesn't need to wait for it before removing the marker. Does that last difference in how the marker is cleared, warrant to consider what NUMA balancing radically different so to forbid userfaultfd_writeprotect to use the same logic in the very same function with the very same lock in order to decrease the VM complexity? In my view no.
Thanks, Andrea