On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote:
I think this is not against Linus's example - where cpu2 does not have tlb cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify it. So IMHO there's no problem here.
But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit. Note that if it's uffd-wp wr-protection it's always applied along with removing of the write bit in change_pte_range():
if (uffd_wp) { ptent = pte_wrprotect(ptent); ptent = pte_mkuffd_wp(ptent); }
So instead of being handled as COW page do_wp_page() will always trap userfaultfd-wp first, hence no chance to race with COW.
COW could only trigger after another uffd-wp-resolve ioctl which could remove the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen after all wr-protect completes, which guarantees that when reaching the COW path the tlb must has been flushed anyways. Then no one should be modifying the page anymore even without pgtable lock in COW path.
So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to work, but it just may cause more tlb flush than Andrea's proposal especially when the protection range is large (it's common to have a huge protection range for e.g. VM live snapshotting, where we'll protect all guest rw ram).
My understanding of current issue is that either we can take Andrea's proposal (although I think the group rwsem may not be extremely better than a per-mm rwsem, which I don't know... at least not worst than that?), or we can also go the other way (also as Andrea mentioned) so that when wr-protect:
for <=2M range (pmd or less), we take read rwsem, but flush tlb within pgtable lock
for >2M range, we take write rwsem directly but flush tlb once
I fully agree indeed.
I still have to fully understand how the clear_refs_write was fixed.
clear_refs_write has not even a "catcher" in the page fault but it clears the pte_write under mmap_read_lock and it doesn't flush before releasing the PT lock. It appears way more broken than userfaultfd_writeprotect ever was.
static inline void clear_soft_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *pte) { /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the * Documentation/admin-guide/mm/soft-dirty.rst for full description * of how soft-dirty works. */
https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...
"Although this is fine when simply aging the ptes, in the case of clearing the "soft-dirty" state we can end up with entries where pte_write() is false, yet a writable mapping remains in the TLB.
Fix this by avoiding the mmu_gather API altogether: managing both the 'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB invalidation for the sort-dirty path, much like mprotect() does already"
NOTE: about the above comment, that mprotect takes mmap_read_lock. Your above code change in the commit above, still has mmap_read_lock, there's still no similarity with mprotect whatsoever.
Did you test what happens when clear_refs_write leaves writable stale TLB and you run do_wp_page copies the page while the other CPU still is writing to the page? Has clear_refs_write a selftest as aggressive and racy as the uffd selftest to reproduce that workload?
The race fixed with the group lock internally to uffd, is possible thanks to marker+catcher in NUMA balancing style? But that's not there for clear_refs_write even after the above commit.
It doesn't appear either that this patch is adding a synchronous tlb flush inside the PT lock either.
Overall, it would be unfair if clear_refs_write is allowed to do the same thing that userfaultfd_writeprotect has to do, but with the mmap_read_lock, if userfaultfd_writeprotect will be forced to take mmap_write_lock.
In my view there are 3 official ways to deal with this:
1) set a marker in the pte/hugepmd inside the PT lock, and add a catcher for the marker in the page fault to send the page fault to a dead end while there are stale TLB entries
cases: userfaultfd_writeprotect() and NUMA balancing
2) take mmap_write_lock
case: mprotect
3) flush the TLB before the PT lock release
case: KSM write_protect_page
Where does the below patch land in the 3 cases? It doesn't fit any of them and what it does looks still not sufficient to fix the userfault memory corruption.
https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...
It'd be backwards if the complexity of the kernel was increased for clear_refs_write, but forbidden for the O(1) API that delivers the exact same information (clear_refs_write API delivers that info in O(N)).
We went the last mile to guarantee uffd can be always used fully securely unprivileged and by default in current Linus's tree and in future Android out of the box. It's simply impossible with the current defaults, to use uffd to enlarge the any kernel user-after-free race window either, so even that concern is gone. From on those research testcases will stick to fuse instead I guess.
Thanks, Andrea