On Dec 23, 2020, at 2:05 PM, Andrea Arcangeli aarcange@redhat.com wrote:
On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
On Dec 21, 2020, at 1:24 PM, Yu Zhao yuzhao@google.com wrote:
On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit nadav.amit@gmail.com wrote:
Using mmap_write_lock() was my initial fix and there was a strong pushback on this approach due to its potential impact on performance.
From whom?
Somebody who doesn't understand that correctness is more important than performance? And that userfaultfd is not the most important part of the system?
The fact is, userfaultfd is CLEARLY BUGGY.
Linus
Fair enough.
Nadav, for your patch (you might want to update the commit message).
Reviewed-by: Yu Zhao yuzhao@google.com
While we are all here, there is also clear_soft_dirty() that could use a similar fix…
Just an update as for why I have still not sent v2: I fixed clear_soft_dirty(), created a reproducer, and the reproducer kept failing.
So after some debugging, it appears that clear_refs_write() does not flush the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494 ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not flush the TLB since there is clear_refs_write() does not call to __tlb_adjust_range() (unless there are nested TLBs are pending).
So I have a patch for this issue too: arguably the tlb_gather interface is not the right one for clear_refs_write() that does not clear PTEs but changes them.
Yet, sadly, my reproducer keeps falling (less frequently, but still). So I will keep debugging to see what goes wrong. I will send v2 once I figure out what the heck is wrong in the code or my reproducer.
If you put the page_mapcount check back in do_wp_page instead of page_count, it'll stop reproducing but the bug is still very much there...
I know. I designed the (re)producer just to be able to hit the bug without changing the kernel (and test my fix), but I am fully aware that it would not work on older kernels although the bug is still there.
And then we'll have to enforce uffd-wp cannot be registered if VM_SOFTDIRTY is set or the other way around so that VM_UFFD* is mutually exclusive with VM_SOFTDIRTY. So then we can also unify the bit so they all use the same software bit in the pgtable (that's something I considered anyway originally since it doesn't make whole lot of sense to use the two features on the same vma at the same time).
I think it may be reasonable.
Just a proposal: At some point we can also ask ourselves whether the “artificial" limitation of the number of software bits per PTE should really limit us, or do we want to hold some additional metadata per-PTE by either putting it in an adjacent page (holding 64-bits of additional software-bits per PTE) or by finding some place in the page-struct to link to this metadata (and have the liberty of number of bits per PTE). One of the PTE software-bits can be repurposed to say whether there is “extra-metadata” associated with the PTE.
I am fully aware that there will be some overhead associated, but it can be limited to less-common use-cases.