On Mon, Mar 23, 2020 at 07:40:31AM -0700, Sean Christopherson wrote:
On Sun, Mar 22, 2020 at 07:54:32PM -0700, Mike Kravetz wrote:
On 3/22/20 7:03 PM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
On 2020/3/22 7:38, Mike Kravetz wrote:
On 2/21/20 7:33 PM, Longpeng(Mike) wrote:
From: Longpeng longpeng2@huawei.com
I have not looked closely at the generated code for lookup_address_in_pgd. It appears that it would dereference p4d, pud and pmd multiple times. Sean seemed to think there was something about the calling context that would make issues like those seen with huge_pte_offset less likely to happen. I do not know if this is accurate or not.
Only for KVM's calls to lookup_address_in_mm(), I can't speak to other calls that funnel into to lookup_address_in_pgd().
KVM uses a combination of tracking and blocking mmu_notifier calls to ensure PTE changes/invalidations between gup() and lookup_address_in_pgd() cause a restart of the faulting instruction, and that pending changes/invalidations are blocked until installation of the pfn in KVM's secondary MMU completes.
kvm_mmu_page_fault():
mmu_seq = kvm->mmu_notifier_seq; smp_rmb();
pfn = gup(hva);
spin_lock(&kvm->mmu_lock); smp_rmb(); if (kvm->mmu_notifier_seq != mmu_seq) goto out_unlock: // Restart guest, i.e. retry the fault
lookup_address_in_mm(hva, ...);
It works because the mmu_lock spinlock is taken before and after any change to the page table via invalidate_range_start/end() callbacks.
So if you are in the spinlock and mmu_notifier_count == 0, then nobody can be writing to the page tables.
It is effectively a full page table lock, so any page table read under that lock do not need to worry about any data races.
Jason