On Wed, Dec 23, 2020 at 10:52:35AM -0500, Peter Xu wrote:
On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote:
In your patch, do we need to take wrprotect_rwsem in handle_userfault() as well? Otherwise, it seems userspace would have to synchronize between its wrprotect ioctl and fault handler? i.e., the fault hander needs to be aware that the content of write- protected pages can actually change before the iotcl returns.
The handle_userfault() thread should be sleeping until another uffd_wp_resolve fixes the page fault for it. However when the uffd_wp_resolve ioctl comes, then rwsem (either the group rwsem lock as Andrea proposed, or the mmap_sem, or any new rwsem lock we'd like to introduce, maybe per-uffd rather than per-mm) should have guaranteed the previous wr-protect ioctls are finished and tlb must have been flushed until this thread continues.
And I don't know why it matters even if the data changed - IMHO what uffd-wp
The data will change indeed and it's fine.
wants to do is simply to make sure after wr-protect ioctl returns to userspace, no change on the page should ever happen anymore. So "whether data changed" seems matter more on the ioctl thread rather than the handle_userfault() thread. IOW, I think data changes before tlb flush but after pte wr-protect is always fine - but that's not fine anymore if the syscall returns.
Agreed.
From the userland point of view all it matters is that the writes
through the stale TLB entries will stop in both the two cases:
1) before returning from the UFFDIO_WRITEPROTECT(mode_wp = true) ioctl syscall
2) before a parallel UFFDIO_WRITEPROTECT(mode_wp = false) can clear the _PAGE_UFFD_WP marker in the pte/hugepmd under the PT lock, assuming the syscall at point 1) is still in flight
Both points are guaranteed at all times by the group lock now, so userland cannot even measure or perceive the existence of any stale TLB at any given time in the whole uffd-wp workload.
So it's perfectly safe and identical as NUMA balancing and requires zero extra locking in handle_userfault().
handle_userfault() is a dead end that simply waits and when it's the right time it restarts the page fault. It can have occasional false positives after f9bf352224d7d4612b55b8d0cd0eaa981a3246cf, false positive as in restarting too soon, but even then it's perfectly safe since it's equivalent of one more CPU hitting the page fault path. As long as the marker is there, any spurious userfault will re-enter handle_userfault().
handle_userfault() doesn't care about the data and in turn it cannot care less about any stale TLB either. Userland cares but userland cannot make any assumption about writes being fully stopped, until the ioctl returned anyway and by that time the pending flush will be done and in fact by the time userland can make any assumption also the mmap_write_lock would have been released with the first proposed patch.
Thanks, Andrea