 
            On Tue, Dec 22, 2020 at 05:02:37PM -0500, Andrea Arcangeli wrote:
On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote:
This works but I don't prefer this option because 1) this is new way of making pte_wrprotect safe and 2) it's specific to ufd and can't be applied to clear_soft_dirty() which has no "catcher". No
I didn't look into clear_soft_dirty issue, I can look into that.
To avoid having to deal with a 3rd solution it will have to use one of the two:
- avoid deferring tlb flush and enforce a sync flush before dropping
the PT lock even if mm_mm_tlb_flush_pending is true -> write_protect_page in KSM
- add its own new catcher for its own specific marker (for UFFD-WP the marker is _PAGE_UFFD_WP, for change_prot_numa is PROT_NONE on a vma->vm_pgprot not PROT_NONE, for soft dirty it could be _PAGE_SOFT_DIRTY) to send the page fault to a dead end before the pte value is interpreted.
matter how good the documentation about this new way is now, it will be out of date, speaking from my personal experience.
A common catcher for all 3 is not possible because each catcher depends on whatever marker and whatever pte value they set that may lead to a different deterministic path where to put the catcher or multiple paths even. do_numa_page requires a catcher in a different place already.
Or well, a common catcher for all 3 is technically possible but it'd perform slower requiring to check things twice.
But perhaps the soft_dirty can use the same catcher of uffd-wp given the similarity?
I'd go with what Nadav has -- the memory corruption problem has been there for a while and nobody has complained except Nadav. This makes me think people is less likely to notice any performance issues from holding mmap_lock for write in his patch either.
uffd-wp is a fairly new feature, the current users are irrelevant, keeping it optimal is just for the future potential.
But I can't say I have zero concern with the potential performance impact given that I have been expecting the fix to go to stable, which I care most. So the next option on my list is to have a
Actually stable would be very fine to go with Nadav patch and use the mmap_lock_write unconditionally. The optimal fix is only relevant for the future potential, so it's only relevant for Linus's tree.
However the feature is recent enough that it won't require a deep backport so the optimal fix is likely fine for stable as well, generally stable prefers the same fix as in the upstream when there's no major backport rejection issue.
The alternative solution for uffd is to do the deferred flush under mmap_lock_write if len is > HPAGE_PMD_SIZE, or to tell change_protection not to defer the flush and to take the mmap_lock_read for <= HPAGE_PMD_SIZE. That will avoid depending on the catcher and then userfaultfd_writeprotect(mode_wp=true) userfaultfd_writeprotect(mode_wp=false) can even run in parallel at all times. The cons is large userfaultfd_writeprotect will block for I/O and those would happen at least in the postcopy live snapshotting use case.
The main cons is that it'd require modification to change_protection so it actually looks more intrusive, not less.
Overall anything that allows to wrprotect 1 pte with only the mmap_lock_read exactly like KSM write_protect_page, would be enough for uffd-wp.
What isn't ok in terms of future potential is unconditional mmap_lock_write as in the original suggested patch in my view. It doesn't mean we can take mmap_lock_write when the operation is large and there is actually more benefit from deferring the flush.
common "catcher" in do_wp_page() which singles out pages that have page_mapcount equal to one and reuse them instead of trying to
I don't think the page_mapcount matters here. If the wp page reuse was more accurate (as it was before) we wouldn't notice this issue, but it still would be a bug that there were stale TLB entries. It worked by luck.
Can you please correct my understanding below? Thank you.
Generally speaking, we have four possible combinations relevant to this discussion: 1) anon, COW 2) anon, non-COW 3) file, COW 4) file, non-COW
Assume we pte_wrprotect while holding mmap_lock for read, all four combinations can be routed to do_wp_page(). The difference is that 1) and 3) are guaranteed to be RO when they become COW, and what we do on top of their existing state doesn't make any difference.
2) is the false positive because of what we do, and it's causing the memory corruption because do_wp_page() tries to make copies of pages that seem to be RO but may have stale RW tlb entries pending flush.
If we grab mmap_lock for write, we block the fault that tries to copy before we flush. Once it's done, we'll have two faulting CPUs, one that had the stale entry and would otherwise do the damage and the other that tries to copy. Then there is the silly part: we make a copy and swap it with the only user who already has it...
We are talking about non-COW anon pages here -- they can't be mapped more than once. So why not just identify them by checking page_mapcount == 1 and then unconditionally reuse them? (This is probably where I've missed things.)
4) may also be false positive (there are other legit cases relying on it), and it's correctly identified by !PageAnon() + VM_WRITE|VM_SHARED. For this category, we always reuse and we hitch a free ride.