On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli aarcange@redhat.com wrote:
On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
On Dec 19, 2020, at 1:34 PM, Nadav Amit nadav.amit@gmail.com wrote:
[ cc’ing some more people who have experience with similar problems ]
On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli aarcange@redhat.com wrote:
Hello,
On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range(). This might
Never having to take the mmap_sem for writing, and in turn never blocking, in order to modify the pagetables is quite an important feature in uffd that justifies uffd instead of mprotect. It's not the most important reason to use uffd, but it'd be nice if that guarantee would remain also for the UFFDIO_WRITEPROTECT API, not only for the other pgtable manipulations.
Consider the following scenario with 3 CPUs (cpu2 is not shown):
cpu0 cpu1
userfaultfd_writeprotect() [ write-protecting ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ defer TLB flushes] userfaultfd_writeprotect() mmap_read_lock() change_protection() [ write-unprotect ] ... [ unprotect PTE logically ]
Is the uffd selftest failing with upstream or after your kernel modification that removes the tlb flush from unprotect?
Please see my reply to Yu. I was wrong in this analysis, and I sent a correction to my analysis. The problem actually happens when userfaultfd_writeprotect() unprotects the memory.
} else if (uffd_wp_resolve) { /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); }
Upstraem this will still do pages++, there's a tlb flush before change_protection can return here, so I'm confused.
You are correct. The problem I encountered with userfaultfd_writeprotect() is during unprotecting path.
Having said that, I think that there are additional scenarios that are problematic. Consider for instance madvise_dontneed_free() that is racing with userfaultfd_writeprotect(). If madvise_dontneed_free() completed removing the PTEs, but still did not flush, change_pte_range() will see non-present PTEs, say a flush is not needed, and then change_protection_range() will not do a flush, and return while the memory is still not protected.
I don't share your concern. What matters is the PT lock, so it wouldn't be one per pte, but a least an order 9 higher, but let's assume one flush per pte.
It's either huge mapping and then it's likely running without other tlb flushing in background (postcopy snapshotting), or it's a granular protect with distributed shared memory in which case the number of changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't matter if it's deferred.
I agree it may require a larger tlb flush review not just mprotect though, but it didn't sound particularly complex. Note the UFFDIO_WRITEPROTECT is still relatively recent so backports won't risk to reject so heavy as to require a band-aid.
My second thought is, I don't see exactly the bug and it's not clear if it's upstream reproducing this, but assuming this happens on upstream, even ignoring everything else happening in the tlb flush code, this sounds like purely introduced by userfaultfd_writeprotect() vs userfaultfd_writeprotect() (since it's the only place changing protection with mmap_sem for reading and note we already unmap and flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears the dirty bit etc..). Flushing tlbs with mmap_sem for reading is nothing new, the only new thing is the flush after wrprotect.
So instead of altering any tlb flush code, would it be possible to just stick to mmap_lock for reading and then serialize userfaultfd_writeprotect() against itself with an additional mm->mmap_wprotect_lock mutex? That'd be a very local change to userfaultfd too.
Can you look if the rule mmap_sem for reading plus a new mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever wrprotecting ptes, is enough to comply with the current tlb flushing code, so not to require any change non local to uffd (modulo the additional mutex).
So I did not fully understand your solution, but I took your point and looked again on similar cases. To be fair, despite my experience with these deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep getting confused (e.g., can’t we somehow combine tlb_flush_batched and tlb_flush_pending ?)
As I said before, my initial scenario was wrong, and the problem is not userfaultfd_writeprotect() racing against itself. This one seems actually benign to me.
Nevertheless, I do think there is a problem in change_protection_range(). Specifically, see the aforementioned scenario of a race between madvise_dontneed_free() and userfaultfd_writeprotect().
So an immediate solution for such a case can be resolve without holding mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in change_protection_range():
/* * Only flush the TLB if we actually modified any entries * or if there are pending TLB flushes. */ if (pages || mm_tlb_flush_nested(mm)) flush_tlb_range(vma, start, end);
To be fair, I am not confident I did not miss other problematic cases.
But for now, this change, with the preserve_write change should address the immediate issues. Let me know if you agree.
Let me know whether you agree.