On Dec 22, 2020, at 11:31 AM, Andrea Arcangeli aarcange@redhat.com wrote:
From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli aarcange@redhat.com Date: Tue, 22 Dec 2020 14:30:16 -0500 Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after userfaultfd_writeprotect()
change_protection() is called by userfaultfd_writeprotect() with the mmap_lock_read like in change_prot_numa().
The page fault code in wp_copy_page() rightfully assumes if the CPU issued a write fault and the write bit in the pagetable is not set, no CPU can write to the page. That's wrong assumption after userfaultfd_writeprotect(). That's also wrong assumption after change_prot_numa() where the regular page fault code also would assume that if the present bit is not set and the page fault is running, there should be no stale TLB entry, but there is still.
So to stay safe, the page fault code must be prevented to run as long as long as the TLB flush remains pending. That is already achieved by the do_numa_page() path for change_prot_numa() and by the userfaultfd_pte_wp() path for userfaultfd_writeprotect().
The problem that needs fixing is that an un-wrprotect (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not set) could run in between the original wrprotect (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set) and wp_copy_page, while the TLB flush remains pending.
I may need to read your patch more carefully, but in general I do not like the approach. You are much more experienced than I am, but IMHO the TLB flushing logic needs to be further simplified and generalized and not the other way around.
The complexity is already too high. We have tlb_flush_batched and tlb_flush_pending, which I think should be (somehow) combined. tlb_gather_mmu() is designed for zapping, but can’t it be modified to suit protection changes to avoid buggy use-cases (as the wrong use in clear_refs_write() ) ?
So adding new userfaultfd specific code, which potentially does not address all the interactions (now or the future), is concerning.
In this regard, a similar problem to the one in userfaultfd (mmap_read_lock() while write-protecting) already exists with soft-dirty clearing, so any solution should also address the soft-dirty issue.