On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
On Jan 12, 2021, at 11:02 AM, Laurent Dufour ldufour@linux.vnet.ibm.com wrote:
Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
Possibility of race against other PTE modifiers
- Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
Right, that's exactly the kind of thing I was worried about.
- mprotect - change_protection in mprotect which does the deferred flush is
marked under vm_write_begin/vm_write_end, thus SPF bails out on faults on those VMAs.
Sure, mprotect also changes vm_flags, so it really needs that anyway.
- userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
But SPF does not take UFFD faults. 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by (2) above. 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
What happened to shared/file-backed stuff? ISTR I had that working.
File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
The tricky one is demotion, specifically write to non-write.
I could not see a case where speculative path cannot see a PTE update done via a fault on another CPU.
One you didn't mention is the NUMA balancing scanning crud; although I think that's fine, loosing a PTE update there is harmless. But I've not thought overly hard on it.
That's a good point, I need to double check on that side.
You explained it fine. Indeed SPF is handling deferred TLB invalidation by marking the VMA through vm_write_begin/end(), as for the fork case you mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE values read are valid.
That should indeed work, but are we really sure we covered them all? Should we invest in better TLBI APIs to make sure we can't get this wrong?
That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
I will send an RFC soon for per-table deferred TLB flushes tracking. The basic idea is to save a generation in the page-struct that tracks when deferred PTE change took place, and track whenever a TLB flush completed. In addition, other users - such as mprotect - would use the tlb_gather interface.
Unfortunately, due to limited space in page-struct this would only be possible for 64-bit (and my implementation is only for x86-64).
I don't want to discourage you but I don't think this would end up well. PPC doesn't necessarily follow one-page-struct-per-table rule, and I've run into problems with this before while trying to do something similar.
I'd recommend per-vma and per-category (unmapping, clearing writable and clearing dirty) tracking, which only rely on arch-independent data structures, i.e., vm_area_struct and mm_struct.
It would still require to do the copying while holding the PTL though.
IMO, this is unacceptable. Most archs don't support per-table PTL, and even x86_64 can be configured to use per-mm PTL. What if we want to support a larger page size in the feature?
It seems to me the only way to solve the problem with self-explanatory code and without performance impact is to check mm_tlb_flush_pending and the writable bit (and two other cases I mentioned above) at the same time. Of course, this requires a lot of effort to audit the existing uses, clean them up and properly wrap them up with new primitives, BUG_ON all invalid cases and document the exact workflow to prevent misuses.
I've mentioned the following before -- it only demonstrates the rough idea.
diff --git a/mm/memory.c b/mm/memory.c index 5e9ca612d7d7..af38c5ee327e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) goto unlock; } if (vmf->flags & FAULT_FLAG_WRITE) { - if (!pte_write(entry)) + if (!pte_write(entry)) { + if (mm_tlb_flush_pending(vmf->vma->vm_mm)) + flush_tlb_page(vmf->vma, vmf->address); return do_wp_page(vmf); + } entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry);