commit a8b48a4dccea77e29462e59f1dbf0d5aa1ff167c upstream.
This fixes a bug where the trap number that is returned by __kvmppc_vcore_entry gets corrupted. The effect of the corruption is that IPIs get ignored on POWER9 systems when the IPI is sent via a doorbell interrupt to a CPU which is executing in a KVM guest. The effect of the IPI being ignored is often that another CPU locks up inside smp_call_function_many() (and if that CPU is holding a spinlock, other CPUs then lock up inside raw_spin_lock()).
The trap number is currently held in register r12 for most of the assembly-language part of the guest exit path. In that path, we call kvmppc_subcore_exit_guest(), which is a C function, without restoring r12 afterwards. Depending on the kernel config and the compiler, it may modify r12 or it may not, so some config/compiler combinations see the bug and others don't.
To fix this, we arrange for the trap number to be stored on the stack from the point where kvmhv_commence_exit is called until the end of the function, then the trap number is loaded and returned in r12 as before.
Cc: stable@vger.kernel.org # v4.8+ Fixes: fd7bacbca47a ("KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt") Signed-off-by: Paul Mackerras paulus@ozlabs.org --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 2b3194b9608f..663a398449b7 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -308,7 +308,6 @@ kvm_novcpu_exit: stw r12, STACK_SLOT_TRAP(r1) bl kvmhv_commence_exit nop - lwz r12, STACK_SLOT_TRAP(r1) b kvmhv_switch_to_host
/* @@ -1136,6 +1135,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
secondary_too_late: li r12, 0 + stw r12, STACK_SLOT_TRAP(r1) cmpdi r4, 0 beq 11f stw r12, VCPU_TRAP(r4) @@ -1445,12 +1445,12 @@ mc_cont: 1: #endif /* CONFIG_KVM_XICS */
+ stw r12, STACK_SLOT_TRAP(r1) mr r3, r12 /* Increment exit count, poke other threads to exit */ bl kvmhv_commence_exit nop ld r9, HSTATE_KVM_VCPU(r13) - lwz r12, VCPU_TRAP(r9)
/* Stop others sending VCPU interrupts to this physical CPU */ li r0, -1 @@ -1816,6 +1816,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_POWER9_DD1) * POWER7/POWER8 guest -> host partition switch code. * We don't have to lock against tlbies but we do * have to coordinate the hardware threads. + * Here STACK_SLOT_TRAP(r1) contains the trap number. */ kvmhv_switch_to_host: /* Secondary threads wait for primary to do partition switch */ @@ -1868,11 +1869,11 @@ BEGIN_FTR_SECTION END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
/* If HMI, call kvmppc_realmode_hmi_handler() */ + lwz r12, STACK_SLOT_TRAP(r1) cmpwi r12, BOOK3S_INTERRUPT_HMI bne 27f bl kvmppc_realmode_hmi_handler nop - li r12, BOOK3S_INTERRUPT_HMI /* * At this point kvmppc_realmode_hmi_handler would have resync-ed * the TB. Hence it is not required to subtract guest timebase @@ -1950,6 +1951,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX) li r0, KVM_GUEST_MODE_NONE stb r0, HSTATE_IN_GUEST(r13)
+ lwz r12, STACK_SLOT_TRAP(r1) /* return trap # in r12 */ ld r0, SFS+PPC_LR_STKOFF(r1) addi r1, r1, SFS mtlr r0
commit 27f29dbceb3c979d00833a90aa27ff0756ecc1e0 upstream.
This fixes several bugs in the radix page fault handler relating to the way large pages in the memory backing the guest were handled. First, the check for large pages only checked for explicit huge pages and missed transparent huge pages. Then the check that the addresses (host virtual vs. guest physical) had appropriate alignment was wrong, meaning that the code never put a large page in the partition scoped radix tree; it was always demoted to a small page.
Fixing this exposed bugs in kvmppc_create_pte(). We were never invalidating a 2MB PTE, which meant that if a page was initially faulted in without write permission and the guest then attempted to store to it, we would never update the PTE to have write permission. If we find a valid 2MB PTE in the PMD, we need to clear it and do a TLB invalidation before installing either the new 2MB PTE or a pointer to a page table page.
This also corrects an assumption that get_user_pages_fast would set the _PAGE_DIRTY bit if we are writing, which is not true. Instead we mark the page dirty explicitly with set_page_dirty_lock(). This also means we don't need the dirty bit set on the host PTE when providing write access on a read fault.
[paulus@ozlabs.org - use mark_pages_dirty instead of kvmppc_update_dirty_map]
Signed-off-by: Paul Mackerras paulus@ozlabs.org --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 72 ++++++++++++++++++++++------------ 1 file changed, 46 insertions(+), 26 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index c5d7435455f1..27a41695fcfd 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -19,6 +19,9 @@ #include <asm/pgalloc.h> #include <asm/pte-walk.h>
+static void mark_pages_dirty(struct kvm *kvm, struct kvm_memory_slot *memslot, + unsigned long gfn, unsigned int order); + /* * Supported radix tree geometry. * Like p9, we support either 5 or 9 bits at the first (lowest) level, @@ -195,6 +198,12 @@ static void kvmppc_pte_free(pte_t *ptep) kmem_cache_free(kvm_pte_cache, ptep); }
+/* Like pmd_huge() and pmd_large(), but works regardless of config options */ +static inline int pmd_is_leaf(pmd_t pmd) +{ + return !!(pmd_val(pmd) & _PAGE_PTE); +} + static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, unsigned int level, unsigned long mmu_seq) { @@ -219,7 +228,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, else new_pmd = pmd_alloc_one(kvm->mm, gpa);
- if (level == 0 && !(pmd && pmd_present(*pmd))) + if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_is_leaf(*pmd))) new_ptep = kvmppc_pte_alloc();
/* Check if we might have been invalidated; let the guest retry if so */ @@ -244,12 +253,30 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, new_pmd = NULL; } pmd = pmd_offset(pud, gpa); - if (pmd_large(*pmd)) { - /* Someone else has instantiated a large page here; retry */ - ret = -EAGAIN; - goto out_unlock; - } - if (level == 1 && !pmd_none(*pmd)) { + if (pmd_is_leaf(*pmd)) { + unsigned long lgpa = gpa & PMD_MASK; + + /* + * If we raced with another CPU which has just put + * a 2MB pte in after we saw a pte page, try again. + */ + if (level == 0 && !new_ptep) { + ret = -EAGAIN; + goto out_unlock; + } + /* Valid 2MB page here already, remove it */ + old = kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd), + ~0UL, 0, lgpa, PMD_SHIFT); + kvmppc_radix_tlbie_page(kvm, lgpa, PMD_SHIFT); + if (old & _PAGE_DIRTY) { + unsigned long gfn = lgpa >> PAGE_SHIFT; + struct kvm_memory_slot *memslot; + memslot = gfn_to_memslot(kvm, gfn); + if (memslot) + mark_pages_dirty(kvm, memslot, gfn, + PMD_SHIFT - PAGE_SHIFT); + } + } else if (level == 1 && !pmd_none(*pmd)) { /* * There's a page table page here, but we wanted * to install a large page. Tell the caller and let @@ -412,28 +439,24 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, } else { page = pages[0]; pfn = page_to_pfn(page); - if (PageHuge(page)) { - page = compound_head(page); - pte_size <<= compound_order(page); + if (PageCompound(page)) { + pte_size <<= compound_order(compound_head(page)); /* See if we can insert a 2MB large-page PTE here */ if (pte_size >= PMD_SIZE && - (gpa & PMD_MASK & PAGE_MASK) == - (hva & PMD_MASK & PAGE_MASK)) { + (gpa & (PMD_SIZE - PAGE_SIZE)) == + (hva & (PMD_SIZE - PAGE_SIZE))) { level = 1; pfn &= ~((PMD_SIZE >> PAGE_SHIFT) - 1); } } /* See if we can provide write access */ if (writing) { - /* - * We assume gup_fast has set dirty on the host PTE. - */ pgflags |= _PAGE_WRITE; } else { local_irq_save(flags); ptep = find_current_mm_pte(current->mm->pgd, hva, NULL, NULL); - if (ptep && pte_write(*ptep) && pte_dirty(*ptep)) + if (ptep && pte_write(*ptep)) pgflags |= _PAGE_WRITE; local_irq_restore(flags); } @@ -459,18 +482,15 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, pte = pfn_pte(pfn, __pgprot(pgflags)); ret = kvmppc_create_pte(kvm, pte, gpa, level, mmu_seq); } - if (ret == 0 || ret == -EAGAIN) - ret = RESUME_GUEST;
if (page) { - /* - * We drop pages[0] here, not page because page might - * have been set to the head page of a compound, but - * we have to drop the reference on the correct tail - * page to match the get inside gup() - */ - put_page(pages[0]); + if (!ret && (pgflags & _PAGE_WRITE)) + set_page_dirty_lock(page); + put_page(page); } + + if (ret == 0 || ret == -EAGAIN) + ret = RESUME_GUEST; return ret; }
@@ -676,7 +696,7 @@ void kvmppc_free_radix(struct kvm *kvm) continue; pmd = pmd_offset(pud, 0); for (im = 0; im < PTRS_PER_PMD; ++im, ++pmd) { - if (pmd_huge(*pmd)) { + if (pmd_is_leaf(*pmd)) { pmd_clear(pmd); continue; }
On Fri, May 11, 2018 at 04:20:21PM +1000, Paul Mackerras wrote:
commit 27f29dbceb3c979d00833a90aa27ff0756ecc1e0 upstream.
There no such commit in Linus's tree. This commit id is in linux-stable, but that is a totally different commit, dealing with the tracing subsystem :(
Something went really wrong here, please fix up and resend.
thanks,
greg k-h
On Fri, May 11, 2018 at 08:58:39AM +0200, Greg KH wrote:
On Fri, May 11, 2018 at 04:20:21PM +1000, Paul Mackerras wrote:
commit 27f29dbceb3c979d00833a90aa27ff0756ecc1e0 upstream.
There no such commit in Linus's tree. This commit id is in linux-stable, but that is a totally different commit, dealing with the tracing subsystem :(
Something went really wrong here, please fix up and resend.
Sorry about that, it was my error. The patch is fine, I tested it after manually rebasing onto 4.14.40, but I stuffed up in assembling the commit message. I'll repost.
Paul.
commit debd574f4195e205ba505b25e19b2b797f4bcd94 upstream.
The current code for initializing the VRMA (virtual real memory area) for HPT guests requires the page size of the backing memory to be one of 4kB, 64kB or 16MB. With a radix host we have the possibility that the backing memory page size can be 2MB or 1GB. In these cases, if the guest switches to HPT mode, KVM will not initialize the VRMA and the guest will fail to run.
In fact it is not necessary that the VRMA page size is the same as the backing memory page size; any VRMA page size less than or equal to the backing memory page size is acceptable. Therefore we now choose the largest page size out of the set {4k, 64k, 16M} which is not larger than the backing memory page size.
Signed-off-by: Paul Mackerras paulus@ozlabs.org --- arch/powerpc/kvm/book3s_hv.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index e094dc90ff1b..fde21233743a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -3619,15 +3619,17 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu) goto up_out;
psize = vma_kernel_pagesize(vma); - porder = __ilog2(psize);
up_read(¤t->mm->mmap_sem);
/* We can handle 4k, 64k or 16M pages in the VRMA */ - err = -EINVAL; - if (!(psize == 0x1000 || psize == 0x10000 || - psize == 0x1000000)) - goto out_srcu; + if (psize >= 0x1000000) + psize = 0x1000000; + else if (psize >= 0x10000) + psize = 0x10000; + else + psize = 0x1000; + porder = __ilog2(psize);
senc = slb_pgsize_encoding(psize); kvm->arch.vrma_slb_v = senc | SLB_VSID_B_1T |
On Fri, May 11, 2018 at 04:20:58PM +1000, Paul Mackerras wrote:
commit debd574f4195e205ba505b25e19b2b797f4bcd94 upstream.
The current code for initializing the VRMA (virtual real memory area) for HPT guests requires the page size of the backing memory to be one of 4kB, 64kB or 16MB. With a radix host we have the possibility that the backing memory page size can be 2MB or 1GB. In these cases, if the guest switches to HPT mode, KVM will not initialize the VRMA and the guest will fail to run.
In fact it is not necessary that the VRMA page size is the same as the backing memory page size; any VRMA page size less than or equal to the backing memory page size is acceptable. Therefore we now choose the largest page size out of the set {4k, 64k, 16M} which is not larger than the backing memory page size.
Signed-off-by: Paul Mackerras paulus@ozlabs.org
arch/powerpc/kvm/book3s_hv.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-)
Applied, thanks.
greg k-h
commit 61bd0f66ff92d5ce765ff9850fd3cbfec773c560 upstream.
Since commit 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry"), if CONFIG_VIRT_CPU_ACCOUNTING_GEN is set, the guest time is not accounted to guest time and user time, but instead to system time.
This is because guest_enter()/guest_exit() are called while interrupts are disabled and the tick counter cannot be updated between them.
To fix that, move guest_exit() after local_irq_enable(), and as guest_enter() is called with IRQ disabled, call guest_enter_irqoff() instead.
Fixes: 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry") Signed-off-by: Laurent Vivier lvivier@redhat.com Reviewed-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Paul Mackerras paulus@ozlabs.org --- arch/powerpc/kvm/book3s_hv.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index fde21233743a..377d1420bd02 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -2847,7 +2847,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc) */ trace_hardirqs_on();
- guest_enter(); + guest_enter_irqoff();
srcu_idx = srcu_read_lock(&vc->kvm->srcu);
@@ -2855,8 +2855,6 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
srcu_read_unlock(&vc->kvm->srcu, srcu_idx);
- guest_exit(); - trace_hardirqs_off(); set_irq_happened(trap);
@@ -2890,6 +2888,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc) kvmppc_set_host_core(pcpu);
local_irq_enable(); + guest_exit();
/* Let secondaries go back to the offline loop */ for (i = 0; i < controlled_threads; ++i) {
On Fri, May 11, 2018 at 04:21:41PM +1000, Paul Mackerras wrote:
commit 61bd0f66ff92d5ce765ff9850fd3cbfec773c560 upstream.
Since commit 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry"), if CONFIG_VIRT_CPU_ACCOUNTING_GEN is set, the guest time is not accounted to guest time and user time, but instead to system time.
This is because guest_enter()/guest_exit() are called while interrupts are disabled and the tick counter cannot be updated between them.
To fix that, move guest_exit() after local_irq_enable(), and as guest_enter() is called with IRQ disabled, call guest_enter_irqoff() instead.
Fixes: 8b24e69fc47e ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry") Signed-off-by: Laurent Vivier lvivier@redhat.com Reviewed-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Paul Mackerras paulus@ozlabs.org
arch/powerpc/kvm/book3s_hv.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
Applied, thanks.
greg k-h
On Fri, May 11, 2018 at 04:19:06PM +1000, Paul Mackerras wrote:
commit a8b48a4dccea77e29462e59f1dbf0d5aa1ff167c upstream.
This fixes a bug where the trap number that is returned by __kvmppc_vcore_entry gets corrupted. The effect of the corruption is that IPIs get ignored on POWER9 systems when the IPI is sent via a doorbell interrupt to a CPU which is executing in a KVM guest. The effect of the IPI being ignored is often that another CPU locks up inside smp_call_function_many() (and if that CPU is holding a spinlock, other CPUs then lock up inside raw_spin_lock()).
The trap number is currently held in register r12 for most of the assembly-language part of the guest exit path. In that path, we call kvmppc_subcore_exit_guest(), which is a C function, without restoring r12 afterwards. Depending on the kernel config and the compiler, it may modify r12 or it may not, so some config/compiler combinations see the bug and others don't.
To fix this, we arrange for the trap number to be stored on the stack from the point where kvmhv_commence_exit is called until the end of the function, then the trap number is loaded and returned in r12 as before.
Cc: stable@vger.kernel.org # v4.8+ Fixes: fd7bacbca47a ("KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt") Signed-off-by: Paul Mackerras paulus@ozlabs.org
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
Now applied, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org