On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
I assume you mean "in" mprotect_fixup, after change_protection.
If you would downgrade the mmap_lock to read there, then it'd severely slowdown the non contention case, if there's more than vma that needs change_protection.
You'd need to throw away the prev->vm_next info and you'd need to do a new find_vma after droping the mmap_lock for reading and re-taking the mmap_lock for writing at every iteration of the loop.
To do less harm to the non-contention case you could perhaps walk vma->vm_next and check if it's outside the mprotect range and only downgrade in such case. So let's assume we intend to optimize with mmap_write_downgrade only the last vma.
The problem is once you had to take mmap_lock for writing, you already stalled for I/O and waited all concurrent page faults and blocked them as well for the vma allocations in split_vma, so that extra boost in SMP scalability you get is lost in the noise there at best.
And the risk is that at worst that extra locked op of mmap_write_downgrade() will hurt SMP scalability because it would increase the locked ops of mprotect on the hottest false-shared cacheline by 50% and that may outweight the benefit from unblocking the page faults half a usec sooner on large systems.
But the ultimate reason why mprotect cannot do mmap_write_downgrade() while userfaultfd_writeprotect can do mmap_read_lock and avoid the mmap_write_lock altogether, is that mprotect leaves no mark in the pte/hugepmd that allows to detect when the TLB is stale in order to redirect the page fault in a dead end (handle_userfault() or do_numa_page) until after the TLB has been flushed as it happens in the the 4 cases below:
/* * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB * can be stale. We cannot allow do_wp_page to proceed or * it'll wrongly assume that nobody can still be writing to * the page if !pte_write. */ if (userfaultfd_pte_wp(vma, *vmf->pte)) { /* * STALE_TLB_WARNING: while the uffd_wp bit is set, * the TLB can be stale. We cannot allow wp_huge_pmd() * to proceed or it'll wrongly assume that nobody can * still be writing to the page if !pmd_write. */ if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd)) /* * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can * be stale. */ if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) /* * STALE_TLB_WARNING: if the pmd is NUMA * protnone the TLB can be stale. */ if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
Thanks, Andrea