On Dec 23, 2020, at 2:29 PM, Yu Zhao yuzhao@google.com wrote:
I was hesitant to suggest the following because it isn't that straight forward. But since you seem to be less concerned with the complexity, I'll just bring it on the table -- it would take care of both ufd and clear_refs_write, wouldn't it?
diff --git a/mm/memory.c b/mm/memory.c index 5e9ca612d7d7..af38c5ee327e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) goto unlock; } if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
if (!pte_write(entry)) {
if (mm_tlb_flush_pending(vmf->vma->vm_mm))
flush_tlb_page(vmf->vma, vmf->address); return do_wp_page(vmf);
}
I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait in IO while splitting a huge page, for example. That gives us a window in which every write fault turns into a TLB flush.
I’m not immediately sure how to do all that much better, though. We could potentially keep a record of pending ranges that need flushing per mm or per PTL, protected by the PTL, and arrange to do the flush if we notice that flushes are pending when we want to do_wp_page(). At least this would limit us to one point extra flush, at least until the concurrent mprotect (or whatever) makes further progress. The bookkeeping might be nasty, though.
But x86 already sort of does some of this bookkeeping, and arguably x86’s code could be improved by tracking TLB ranges to flush per mm instead of per flush request — Nadav already got us half way there by making a little cache of flush_tlb_info structs. IMO it wouldn’t be totally crazy to integrate this better with tlb_gather_mmu to make the pending flush data visible to other CPUs even before actually kicking off the flush. In the limit, this starts to look a bit like a fully async flush mechanism. We would have a function to request a flush, and that function would return a generation count but not actually flush anything. The case of flushing a range adjacent to a still-pending range would be explicitly optimized. Then another function would actually initiate and wait for the flush to complete. And we could, while holding PTL, scan the list of pending flushes, if any, to see if the PTE we’re looking at has a flush pending. This is sort of easy in the one-PTL-per-mm case but potentially rather complicated in the split PTL case. And I’m genuinely unsure where the “batch” TLB flush interface fits in, because it can create a batch that spans more than one mm. x86 can deal with this moderately efficiently since we limit the number of live mms per CPU and our flushes are (for now?) per cpu, not per mm.
u64 gen = 0; for(...) gen = queue_flush(mm, start, end, freed_tables); flush_to_gen(mm, gen);
and the wp fault path does:
wait_for_pending_flushes(mm, address);
Other than the possibility of devolving to one flush per operation if one thread is page faulting at the same speed that another thread is modifying PTEs, this should be decently performant.