This is a collection of backports for patches that were Cc'd to stable, but failed to apply, along with their dependencies.
Note, Sasha already posted[1][2] these (and I acked them):
KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrs KVM: VMX: Extract checking of guest's DEBUGCTL into helper KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs
I'm including them here to hopefully make life easier for y'all, and because the order they are presented here is the preferred ordering, i.e. should be the same ordering as the original upstream patches.
But, if you end up grabbing Sasha's patches first, it's not a big deal as the only true dependencies is that the DEBUGCTL.RTM_DEBUG patch needs to land before "Check vmcs12->guest_ia32_debugctl on nested VM-Enter".
Many of the patches to get to the last patch (the DEBUGCTLMSR_FREEZE_IN_SMM fix) are dependencies that arguably shouldn't be backported to LTS kernels. I opted to do the backports because none of the patches are scary (if it was 1-3 dependency patches instead of 8 I wouldn't hesitate), and there's a decent chance they'll be dependencies for future fixes.
[1] https://lore.kernel.org/all/20250813184918.2071296-1-sashal@kernel.org [2] https://lore.kernel.org/all/20250814132434.2096873-1-sashal@kernel.org
Chao Gao (1): KVM: nVMX: Defer SVI update to vmcs01 on EOI when L2 is active w/o VID
Maxim Levitsky (3): KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest
Sean Christopherson (17): KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadow KVM: x86: Re-split x2APIC ICR into ICR+ICR2 for AMD (x2AVIC) KVM: x86: Plumb in the vCPU to kvm_x86_ops.hwapic_isr_update() KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producer KVM: x86: Snapshot the host's DEBUGCTL in common x86 KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrs KVM: x86: Plumb "force_immediate_exit" into kvm_entry() tracepoint KVM: VMX: Re-enter guest in fastpath for "spurious" preemption timer exits KVM: VMX: Handle forced exit due to preemption timer in fastpath KVM: x86: Move handling of is_guest_mode() into fastpath exit handlers KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2 KVM: x86: Fully defer to vendor code to decide how to force immediate exit KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmap KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flag KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported KVM: VMX: Extract checking of guest's DEBUGCTL into helper
arch/x86/include/asm/kvm-x86-ops.h | 2 - arch/x86/include/asm/kvm_host.h | 24 +++-- arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/hyperv.c | 10 +- arch/x86/kvm/lapic.c | 61 ++++++++--- arch/x86/kvm/lapic.h | 1 + arch/x86/kvm/svm/svm.c | 49 ++++++--- arch/x86/kvm/svm/vmenter.S | 9 +- arch/x86/kvm/trace.h | 9 +- arch/x86/kvm/vmx/nested.c | 26 ++++- arch/x86/kvm/vmx/pmu_intel.c | 8 +- arch/x86/kvm/vmx/vmx.c | 168 ++++++++++++++++++----------- arch/x86/kvm/vmx/vmx.h | 31 +++++- arch/x86/kvm/x86.c | 65 ++++++----- arch/x86/kvm/x86.h | 12 +++ 15 files changed, 322 insertions(+), 154 deletions(-)
base-commit: 3594f306da129190de25938b823f353ef7f9e322
[ Upstream commit be45bc4eff33d9a7dae84a2150f242a91a617402 ]
Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff() so that VMRUN is not executed in an STI shadow. AMD CPUs have a quirk (some would say "bug"), where the STI shadow bleeds into the guest's intr_state field if a #VMEXIT occurs during injection of an event, i.e. if the VMRUN doesn't complete before the subsequent #VMEXIT.
The spurious "interrupts masked" state is relatively benign, as it only occurs during event injection and is transient. Because KVM is already injecting an event, the guest can't be in HLT, and if KVM is querying IRQ blocking for injection, then KVM would need to force an immediate exit anyways since injecting multiple events is impossible.
However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the spurious STI shadow is visible to L1 when running a nested VM, which can trip sanity checks, e.g. in VMware's VMM.
Hoist the STI+CLI all the way to C code, as the aforementioned calls to guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already possible. I.e. if there's kernel code that is confused by running with RFLAGS.IF=1, then it's already a problem. In practice, since GIF=0 also blocks NMIs, the only change in exposure to non-KVM code (relative to surrounding VMRUN with STI+CLI) is exception handling code, and except for the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path are fatal.
Use the "raw" variants to enable/disable IRQs to avoid tracing in the "no instrumentation" code; the guest state helpers also take care of tracing IRQ state.
Oppurtunstically document why KVM needs to do STI in the first place.
Reported-by: Doug Covelli doug.covelli@broadcom.com Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4u... Fixes: f14eec0a3203 ("KVM: SVM: move more vmentry code to assembly") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson jmattson@google.com Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: resolve minor syntatic conflict in __svm_sev_es_vcpu_run()] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/svm/svm.c | 14 ++++++++++++++ arch/x86/kvm/svm/vmenter.S | 9 +-------- 2 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b6bbd0dc4e65..c95a84afc35f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3982,6 +3982,18 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in
guest_state_enter_irqoff();
+ /* + * Set RFLAGS.IF prior to VMRUN, as the host's RFLAGS.IF at the time of + * VMRUN controls whether or not physical IRQs are masked (KVM always + * runs with V_INTR_MASKING_MASK). Toggle RFLAGS.IF here to avoid the + * temptation to do STI+VMRUN+CLI, as AMD CPUs bleed the STI shadow + * into guest state if delivery of an event during VMRUN triggers a + * #VMEXIT, and the guest_state transitions already tell lockdep that + * IRQs are being enabled/disabled. Note! GIF=0 for the entirety of + * this path, so IRQs aren't actually unmasked while running host code. + */ + raw_local_irq_enable(); + amd_clear_divider();
if (sev_es_guest(vcpu->kvm)) @@ -3989,6 +4001,8 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in else __svm_vcpu_run(svm, spec_ctrl_intercepted);
+ raw_local_irq_disable(); + guest_state_exit_irqoff(); }
diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S index 42824f9b06a2..48b72625cc45 100644 --- a/arch/x86/kvm/svm/vmenter.S +++ b/arch/x86/kvm/svm/vmenter.S @@ -170,12 +170,8 @@ SYM_FUNC_START(__svm_vcpu_run) VM_CLEAR_CPU_BUFFERS
/* Enter guest mode */ - sti - 3: vmrun %_ASM_AX 4: - cli - /* Pop @svm to RAX while it's the only available register. */ pop %_ASM_AX
@@ -343,11 +339,8 @@ SYM_FUNC_START(__svm_sev_es_vcpu_run) VM_CLEAR_CPU_BUFFERS
/* Enter guest mode */ - sti - 1: vmrun %_ASM_AX - -2: cli +2:
/* Pop @svm to RDI, guest registers have been saved already. */ pop %_ASM_DI
[ Upstream commit 73b42dc69be8564d4951a14d00f827929fe5ef79 ]
Re-introduce the "split" x2APIC ICR storage that KVM used prior to Intel's IPI virtualization support, but only for AMD. While not stated anywhere in the APM, despite stating the ICR is a single 64-bit register, AMD CPUs store the 64-bit ICR as two separate 32-bit values in ICR and ICR2. When IPI virtualization (IPIv on Intel, all AVIC flavors on AMD) is enabled, KVM needs to match CPU behavior as some ICR ICR writes will be handled by the CPU, not by KVM.
Add a kvm_x86_ops knob to control the underlying format used by the CPU to store the x2APIC ICR, and tune it to AMD vs. Intel regardless of whether or not x2AVIC is enabled. If KVM is handling all ICR writes, the storage format for x2APIC mode doesn't matter, and having the behavior follow AMD versus Intel will provide better test coverage and ease debugging.
Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Maxim Levitsky mlevitsk@redhat.com Cc: Suravee Suthikulpanit suravee.suthikulpanit@amd.com Link: https://lore.kernel.org/r/20240719235107.3023592-4-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: resolve minor syntatic conflicts] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/lapic.c | 42 +++++++++++++++++++++++---------- arch/x86/kvm/svm/svm.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 2 ++ 4 files changed, 36 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index eb06c2f68314..17b4e61a52b9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1547,6 +1547,8 @@ struct kvm_x86_ops { void (*enable_nmi_window)(struct kvm_vcpu *vcpu); void (*enable_irq_window)(struct kvm_vcpu *vcpu); void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); + + const bool x2apic_icr_is_split; bool (*check_apicv_inhibit_reasons)(enum kvm_apicv_inhibit reason); void (*refresh_apicv_exec_ctrl)(struct kvm_vcpu *vcpu); void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 7f57dce5c828..42eec987ac3d 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2315,11 +2315,25 @@ int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) data &= ~APIC_ICR_BUSY;
kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); - kvm_lapic_set_reg64(apic, APIC_ICR, data); + if (kvm_x86_ops.x2apic_icr_is_split) { + kvm_lapic_set_reg(apic, APIC_ICR, data); + kvm_lapic_set_reg(apic, APIC_ICR2, data >> 32); + } else { + kvm_lapic_set_reg64(apic, APIC_ICR, data); + } trace_kvm_apic_write(APIC_ICR, data); return 0; }
+static u64 kvm_x2apic_icr_read(struct kvm_lapic *apic) +{ + if (kvm_x86_ops.x2apic_icr_is_split) + return (u64)kvm_lapic_get_reg(apic, APIC_ICR) | + (u64)kvm_lapic_get_reg(apic, APIC_ICR2) << 32; + + return kvm_lapic_get_reg64(apic, APIC_ICR); +} + /* emulate APIC access in a trap manner */ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) { @@ -2337,7 +2351,7 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) * maybe-unecessary write, and both are in the noise anyways. */ if (apic_x2apic_mode(apic) && offset == APIC_ICR) - WARN_ON_ONCE(kvm_x2apic_icr_write(apic, kvm_lapic_get_reg64(apic, APIC_ICR))); + WARN_ON_ONCE(kvm_x2apic_icr_write(apic, kvm_x2apic_icr_read(apic))); else kvm_lapic_reg_write(apic, offset, kvm_lapic_get_reg(apic, offset)); } @@ -2760,18 +2774,22 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
/* * In x2APIC mode, the LDR is fixed and based on the id. And - * ICR is internally a single 64-bit register, but needs to be - * split to ICR+ICR2 in userspace for backwards compatibility. + * if the ICR is _not_ split, ICR is internally a single 64-bit + * register, but needs to be split to ICR+ICR2 in userspace for + * backwards compatibility. */ - if (set) { + if (set) *ldr = kvm_apic_calc_x2apic_ldr(*id);
- icr = __kvm_lapic_get_reg(s->regs, APIC_ICR) | - (u64)__kvm_lapic_get_reg(s->regs, APIC_ICR2) << 32; - __kvm_lapic_set_reg64(s->regs, APIC_ICR, icr); - } else { - icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR); - __kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32); + if (!kvm_x86_ops.x2apic_icr_is_split) { + if (set) { + icr = __kvm_lapic_get_reg(s->regs, APIC_ICR) | + (u64)__kvm_lapic_get_reg(s->regs, APIC_ICR2) << 32; + __kvm_lapic_set_reg64(s->regs, APIC_ICR, icr); + } else { + icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR); + __kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32); + } } }
@@ -2971,7 +2989,7 @@ static int kvm_lapic_msr_read(struct kvm_lapic *apic, u32 reg, u64 *data) u32 low;
if (reg == APIC_ICR) { - *data = kvm_lapic_get_reg64(apic, APIC_ICR); + *data = kvm_x2apic_icr_read(apic); return 0; }
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index c95a84afc35f..b922f31d1415 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4851,6 +4851,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .enable_nmi_window = svm_enable_nmi_window, .enable_irq_window = svm_enable_irq_window, .update_cr8_intercept = svm_update_cr8_intercept, + + .x2apic_icr_is_split = true, .set_virtual_apic_mode = avic_refresh_virtual_apic_mode, .refresh_apicv_exec_ctrl = avic_refresh_apicv_exec_ctrl, .check_apicv_inhibit_reasons = avic_check_apicv_inhibit_reasons, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index fbe26b88f731..9a5cb896229f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -8202,6 +8202,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .enable_nmi_window = vmx_enable_nmi_window, .enable_irq_window = vmx_enable_irq_window, .update_cr8_intercept = vmx_update_cr8_intercept, + + .x2apic_icr_is_split = false, .set_virtual_apic_mode = vmx_set_virtual_apic_mode, .set_apic_access_page_addr = vmx_set_apic_access_page_addr, .refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
[ Upstream commit 76bce9f10162cd4b36ac0b7889649b22baf70ebd ]
Pass the target vCPU to the hwapic_isr_update() vendor hook so that VMX can defer the update until after nested VM-Exit if an EOI for L1's vAPIC occurs while L2 is active.
Note, commit d39850f57d21 ("KVM: x86: Drop @vcpu parameter from kvm_x86_ops.hwapic_isr_update()") removed the parameter with the justification that doing so "allows for a decent amount of (future) cleanup in the APIC code", but it's not at all clear what cleanup was intended, or if it was ever realized.
No functional change intended.
Cc: stable@vger.kernel.org Reviewed-by: Chao Gao chao.gao@intel.com Tested-by: Chao Gao chao.gao@intel.com Link: https://lore.kernel.org/r/20241128000010.4051275-2-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: account for lack of kvm_x86_call(), drop vmx/x86_ops.h change] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/lapic.c | 8 ++++---- arch/x86/kvm/vmx/vmx.c | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 17b4e61a52b9..6db42ee82032 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1552,7 +1552,7 @@ struct kvm_x86_ops { bool (*check_apicv_inhibit_reasons)(enum kvm_apicv_inhibit reason); void (*refresh_apicv_exec_ctrl)(struct kvm_vcpu *vcpu); void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); - void (*hwapic_isr_update)(int isr); + void (*hwapic_isr_update)(struct kvm_vcpu *vcpu, int isr); bool (*guest_apic_has_interrupt)(struct kvm_vcpu *vcpu); void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap); void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 42eec987ac3d..3d65d6a023c9 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -587,7 +587,7 @@ static inline void apic_set_isr(int vec, struct kvm_lapic *apic) * just set SVI. */ if (unlikely(apic->apicv_active)) - static_call_cond(kvm_x86_hwapic_isr_update)(vec); + static_call_cond(kvm_x86_hwapic_isr_update)(apic->vcpu, vec); else { ++apic->isr_count; BUG_ON(apic->isr_count > MAX_APIC_VECTOR); @@ -632,7 +632,7 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic) * and must be left alone. */ if (unlikely(apic->apicv_active)) - static_call_cond(kvm_x86_hwapic_isr_update)(apic_find_highest_isr(apic)); + static_call_cond(kvm_x86_hwapic_isr_update)(apic->vcpu, apic_find_highest_isr(apic)); else { --apic->isr_count; BUG_ON(apic->isr_count < 0); @@ -2554,7 +2554,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) if (apic->apicv_active) { static_call_cond(kvm_x86_apicv_post_state_restore)(vcpu); static_call_cond(kvm_x86_hwapic_irr_update)(vcpu, -1); - static_call_cond(kvm_x86_hwapic_isr_update)(-1); + static_call_cond(kvm_x86_hwapic_isr_update)(vcpu, -1); }
vcpu->arch.apic_arb_prio = 0; @@ -2847,7 +2847,7 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) if (apic->apicv_active) { static_call_cond(kvm_x86_apicv_post_state_restore)(vcpu); static_call_cond(kvm_x86_hwapic_irr_update)(vcpu, apic_find_highest_irr(apic)); - static_call_cond(kvm_x86_hwapic_isr_update)(apic_find_highest_isr(apic)); + static_call_cond(kvm_x86_hwapic_isr_update)(vcpu, apic_find_highest_isr(apic)); } kvm_make_request(KVM_REQ_EVENT, vcpu); if (ioapic_in_kernel(vcpu->kvm)) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9a5cb896229f..721ba6ddb121 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6708,7 +6708,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu) put_page(page); }
-static void vmx_hwapic_isr_update(int max_isr) +static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) { u16 status; u8 old;
From: Chao Gao chao.gao@intel.com
[ Upstream commit 04bc93cf49d16d01753b95ddb5d4f230b809a991 ]
If KVM emulates an EOI for L1's virtual APIC while L2 is active, defer updating GUEST_INTERUPT_STATUS.SVI, i.e. the VMCS's cache of the highest in-service IRQ, until L1 is active, as vmcs01, not vmcs02, needs to track vISR. The missed SVI update for vmcs01 can result in L1 interrupts being incorrectly blocked, e.g. if there is a pending interrupt with lower priority than the interrupt that was EOI'd.
This bug only affects use cases where L1's vAPIC is effectively passed through to L2, e.g. in a pKVM scenario where L2 is L1's depriveleged host, as KVM will only emulate an EOI for L1's vAPIC if Virtual Interrupt Delivery (VID) is disabled in vmc12, and L1 isn't intercepting L2 accesses to its (virtual) APIC page (or if x2APIC is enabled, the EOI MSR).
WARN() if KVM updates L1's ISR while L2 is active with VID enabled, as an EOI from L2 is supposed to affect L2's vAPIC, but still defer the update, to try to keep L1 alive. Specifically, KVM forwards all APICv-related VM-Exits to L1 via nested_vmx_l1_wants_exit():
case EXIT_REASON_APIC_ACCESS: case EXIT_REASON_APIC_WRITE: case EXIT_REASON_EOI_INDUCED: /* * The controls for "virtualize APIC accesses," "APIC- * register virtualization," and "virtual-interrupt * delivery" only come from vmcs12. */ return true;
Fixes: c7c9c56ca26f ("x86, apicv: add virtual interrupt delivery support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@intel.com Reported-by: Markku Ahvenjärvi mankku@gmail.com Closes: https://lore.kernel.org/all/20240920080012.74405-1-mankku@gmail.com Cc: Janne Karhunen janne.karhunen@gmail.com Signed-off-by: Chao Gao chao.gao@intel.com [sean: drop request, handle in VMX, write changelog] Tested-by: Chao Gao chao.gao@intel.com Link: https://lore.kernel.org/r/20241128000010.4051275-3-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: resolve minor syntactic conflict in lapic.h, account for lack of kvm_x86_call(), drop sanity check due to lack of wants_to_run] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/lapic.c | 11 +++++++++++ arch/x86/kvm/lapic.h | 1 + arch/x86/kvm/vmx/nested.c | 5 +++++ arch/x86/kvm/vmx/vmx.c | 16 ++++++++++++++++ arch/x86/kvm/vmx/vmx.h | 1 + 5 files changed, 34 insertions(+)
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 3d65d6a023c9..9aae76b74417 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -640,6 +640,17 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic) } }
+void kvm_apic_update_hwapic_isr(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + + if (WARN_ON_ONCE(!lapic_in_kernel(vcpu)) || !apic->apicv_active) + return; + + static_call(kvm_x86_hwapic_isr_update)(vcpu, apic_find_highest_isr(apic)); +} +EXPORT_SYMBOL_GPL(kvm_apic_update_hwapic_isr); + int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu) { /* This may race with setting of irr in __apic_accept_irq() and diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index a5ac4a5a5179..e5d2dc58fcf8 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -122,6 +122,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int kvm_apic_get_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); enum lapic_mode kvm_get_apic_mode(struct kvm_vcpu *vcpu); +void kvm_apic_update_hwapic_isr(struct kvm_vcpu *vcpu); int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 8052f8b7d8e1..d55f7edc0860 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4839,6 +4839,11 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); }
+ if (vmx->nested.update_vmcs01_hwapic_isr) { + vmx->nested.update_vmcs01_hwapic_isr = false; + kvm_apic_update_hwapic_isr(vcpu); + } + if ((vm_exit_reason != -1) && (enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr))) vmx->nested.need_vmcs12_to_shadow_sync = true; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 721ba6ddb121..7b87fbc69b21 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6713,6 +6713,22 @@ static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) u16 status; u8 old;
+ /* + * If L2 is active, defer the SVI update until vmcs01 is loaded, as SVI + * is only relevant for if and only if Virtual Interrupt Delivery is + * enabled in vmcs12, and if VID is enabled then L2 EOIs affect L2's + * vAPIC, not L1's vAPIC. KVM must update vmcs01 on the next nested + * VM-Exit, otherwise L1 with run with a stale SVI. + */ + if (is_guest_mode(vcpu)) { + /* + * KVM is supposed to forward intercepted L2 EOIs to L1 if VID + * is enabled in vmcs12; as above, the EOIs affect L2's vAPIC. + */ + to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true; + return; + } + if (max_isr == -1) max_isr = 0;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 9e0bb98b116d..8b4b149bd9c1 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -189,6 +189,7 @@ struct nested_vmx { bool reload_vmcs01_apic_access_page; bool update_vmcs01_cpu_dirty_logging; bool update_vmcs01_apicv_status; + bool update_vmcs01_hwapic_isr;
/* * Enlightened VMCS has been enabled. It does not mean that L1 has to
[ Upstream commit f1fb088d9cecde5c3066d8ff8846789667519b7d ]
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure irqfd->producer isn't modified while kvm_irq_routing_update() is running. The only lock held when a producer is added/removed is irqbypass's mutex.
Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Message-ID: 20250404193923.1413163-5-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com [sean: account for lack of kvm_x86_call()] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/x86.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a6dc8f662fa4..08c4ad276ccb 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -13385,16 +13385,22 @@ int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons, { struct kvm_kernel_irqfd *irqfd = container_of(cons, struct kvm_kernel_irqfd, consumer); + struct kvm *kvm = irqfd->kvm; int ret;
- irqfd->producer = prod; kvm_arch_start_assignment(irqfd->kvm); + + spin_lock_irq(&kvm->irqfds.lock); + irqfd->producer = prod; + ret = static_call(kvm_x86_pi_update_irte)(irqfd->kvm, prod->irq, irqfd->gsi, 1); - if (ret) kvm_arch_end_assignment(irqfd->kvm);
+ spin_unlock_irq(&kvm->irqfds.lock); + + return ret; }
@@ -13404,9 +13410,9 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons, int ret; struct kvm_kernel_irqfd *irqfd = container_of(cons, struct kvm_kernel_irqfd, consumer); + struct kvm *kvm = irqfd->kvm;
WARN_ON(irqfd->producer != prod); - irqfd->producer = NULL;
/* * When producer of consumer is unregistered, we change back to @@ -13414,11 +13420,18 @@ void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons, * when the irq is masked/disabled or the consumer side (KVM * int this case doesn't want to receive the interrupts. */ + spin_lock_irq(&kvm->irqfds.lock); + irqfd->producer = NULL; + + ret = static_call(kvm_x86_pi_update_irte)(irqfd->kvm, prod->irq, irqfd->gsi, 0); if (ret) printk(KERN_INFO "irq bypass consumer (token %p) unregistration" " fails: %d\n", irqfd->consumer.token, ret);
+ spin_unlock_irq(&kvm->irqfds.lock); + + kvm_arch_end_assignment(irqfd->kvm); }
[ Upstream commit fb71c795935652fa20eaf9517ca9547f5af99a76 ]
Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot.
Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value.
Reviewed-by: Xiaoyao Li xiaoyao.li@intel.com Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria ravi.bangoria@amd.com Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: resolve minor syntatic conflict in vmx_vcpu_load()] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/vmx.c | 8 ++------ arch/x86/kvm/vmx/vmx.h | 2 -- arch/x86/kvm/x86.c | 1 + 4 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6db42ee82032..555c7bf35e28 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -677,6 +677,7 @@ struct kvm_vcpu_arch { u32 pkru; u32 hflags; u64 efer; + u64 host_debugctl; u64 apic_base; struct kvm_lapic *apic; /* kernel irqchip context */ bool load_eoi_exitmap_pending; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 7b87fbc69b21..c24da2cff208 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1418,13 +1418,9 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, */ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { - struct vcpu_vmx *vmx = to_vmx(vcpu); - vmx_vcpu_load_vmcs(vcpu, cpu, NULL);
vmx_vcpu_pi_load(vcpu, cpu); - - vmx->host_debugctlmsr = get_debugctlmsr(); }
static void vmx_vcpu_put(struct kvm_vcpu *vcpu) @@ -7275,8 +7271,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) }
/* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */ - if (vmx->host_debugctlmsr) - update_debugctlmsr(vmx->host_debugctlmsr); + if (vcpu->arch.host_debugctl) + update_debugctlmsr(vcpu->arch.host_debugctl);
#ifndef CONFIG_X86_64 /* diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 8b4b149bd9c1..357819872d80 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -352,8 +352,6 @@ struct vcpu_vmx { /* apic deadline value in host tsc */ u64 hv_deadline_tsc;
- unsigned long host_debugctlmsr; - /* * Only bits masked by msr_ia32_feature_control_valid_bits can be set in * msr_ia32_feature_control. FEAT_CTL_LOCKED is always included diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 08c4ad276ccb..2178f6bb8e90 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4742,6 +4742,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
/* Save host pkru register if supported */ vcpu->arch.host_pkru = read_pkru(); + vcpu->arch.host_debugctl = get_debugctlmsr();
/* Apply any externally detected TSC adjustments (due to suspend) */ if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
[ Upstream commit 189ecdb3e112da703ac0699f4ec76aa78122f911 ]
Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle debugctl bits from IRQ context, e.g. when enabling/disabling events via smp_call_function_single(). Taking the snapshot (long) before IRQs are disabled could result in KVM effectively clobbering DEBUGCTL due to using a stale snapshot.
Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria ravi.bangoria@amd.com Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/x86.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2178f6bb8e90..0c3908544205 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4742,7 +4742,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
/* Save host pkru register if supported */ vcpu->arch.host_pkru = read_pkru(); - vcpu->arch.host_debugctl = get_debugctlmsr();
/* Apply any externally detected TSC adjustments (due to suspend) */ if (unlikely(vcpu->arch.tsc_offset_adjustment)) { @@ -10851,6 +10850,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(0, 7); }
+ vcpu->arch.host_debugctl = get_debugctlmsr(); + guest_timing_enter_irqoff();
for (;;) {
[ Upstream commit e76ae52747a82a548742107b4100e90da41a624d ]
Add helpers to print unimplemented MSR accesses and condition all such prints on report_ignored_msrs, i.e. honor userspace's request to not print unimplemented MSRs. Even though vcpu_unimpl() is ratelimited, printing can still be problematic, e.g. if a print gets stalled when host userspace is writing MSRs during live migration, an effective stall can result in very noticeable disruption in the guest.
E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU counters while the PMU was disabled in KVM.
- 99.75% 0.00% [.] __ioctl - __ioctl - 99.74% entry_SYSCALL_64_after_hwframe do_syscall_64 sys_ioctl - do_vfs_ioctl - 92.48% kvm_vcpu_ioctl - kvm_arch_vcpu_ioctl - 85.12% kvm_set_msr_ignored_check svm_set_msr kvm_set_msr_common printk vprintk_func vprintk_default vprintk_emit console_unlock call_console_drivers univ8250_console_write serial8250_console_write uart_console_write
Reported-by: Aaron Lewis aaronlewis@google.com Reviewed-by: Vitaly Kuznetsov vkuznets@redhat.com Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com Stable-dep-of: 7d0cce6cbe71 ("KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs") Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/hyperv.c | 10 ++++------ arch/x86/kvm/svm/svm.c | 5 ++--- arch/x86/kvm/vmx/vmx.c | 4 +--- arch/x86/kvm/x86.c | 18 +++++------------- arch/x86/kvm/x86.h | 12 ++++++++++++ 5 files changed, 24 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 28555bbd52e8..cb0a531e13c5 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1406,8 +1406,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data, case HV_X64_MSR_SYNDBG_CONTROL ... HV_X64_MSR_SYNDBG_PENDING_BUFFER: return syndbg_set_msr(vcpu, msr, data, host); default: - vcpu_unimpl(vcpu, "Hyper-V unhandled wrmsr: 0x%x data 0x%llx\n", - msr, data); + kvm_pr_unimpl_wrmsr(vcpu, msr, data); return 1; } return 0; @@ -1528,8 +1527,7 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host) return 1; break; default: - vcpu_unimpl(vcpu, "Hyper-V unhandled wrmsr: 0x%x data 0x%llx\n", - msr, data); + kvm_pr_unimpl_wrmsr(vcpu, msr, data); return 1; }
@@ -1581,7 +1579,7 @@ static int kvm_hv_get_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, case HV_X64_MSR_SYNDBG_CONTROL ... HV_X64_MSR_SYNDBG_PENDING_BUFFER: return syndbg_get_msr(vcpu, msr, pdata, host); default: - vcpu_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr); + kvm_pr_unimpl_rdmsr(vcpu, msr); return 1; }
@@ -1646,7 +1644,7 @@ static int kvm_hv_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, data = APIC_BUS_FREQUENCY; break; default: - vcpu_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr); + kvm_pr_unimpl_rdmsr(vcpu, msr); return 1; } *pdata = data; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b922f31d1415..2c0f9c7d1242 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3035,8 +3035,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) break; case MSR_IA32_DEBUGCTLMSR: if (!lbrv) { - vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTL 0x%llx, nop\n", - __func__, data); + kvm_pr_unimpl_wrmsr(vcpu, ecx, data); break; }
@@ -3077,7 +3076,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) case MSR_VM_CR: return svm_set_vm_cr(vcpu, data); case MSR_VM_IGNNE: - vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", ecx, data); + kvm_pr_unimpl_wrmsr(vcpu, ecx, data); break; case MSR_AMD64_DE_CFG: { struct kvm_msr_entry msr_entry; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c24da2cff208..390af16d9a67 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2140,9 +2140,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
invalid = data & ~vmx_get_supported_debugctl(vcpu, msr_info->host_initiated); if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) { - if (report_ignored_msrs) - vcpu_unimpl(vcpu, "%s: BTF|LBR in IA32_DEBUGCTLMSR 0x%llx, nop\n", - __func__, data); + kvm_pr_unimpl_wrmsr(vcpu, msr_index, data); data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR); invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c3908544205..e7c73360890d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3573,7 +3573,6 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { - bool pr = false; u32 msr = msr_info->index; u64 data = msr_info->data;
@@ -3625,15 +3624,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (data == BIT_ULL(18)) { vcpu->arch.msr_hwcr = data; } else if (data != 0) { - vcpu_unimpl(vcpu, "unimplemented HWCR wrmsr: 0x%llx\n", - data); + kvm_pr_unimpl_wrmsr(vcpu, msr, data); return 1; } break; case MSR_FAM10H_MMIO_CONF_BASE: if (data != 0) { - vcpu_unimpl(vcpu, "unimplemented MMIO_CONF_BASE wrmsr: " - "0x%llx\n", data); + kvm_pr_unimpl_wrmsr(vcpu, msr, data); return 1; } break; @@ -3813,16 +3810,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3: case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1: - pr = true; - fallthrough; case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3: case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1: if (kvm_pmu_is_valid_msr(vcpu, msr)) return kvm_pmu_set_msr(vcpu, msr_info);
- if (pr || data != 0) - vcpu_unimpl(vcpu, "disabled perfctr wrmsr: " - "0x%x data 0x%llx\n", msr, data); + if (data) + kvm_pr_unimpl_wrmsr(vcpu, msr, data); break; case MSR_K7_CLK_CTL: /* @@ -3849,9 +3843,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) /* Drop writes to this legacy MSR -- see rdmsr * counterpart for further detail. */ - if (report_ignored_msrs) - vcpu_unimpl(vcpu, "ignored wrmsr: 0x%x data 0x%llx\n", - msr, data); + kvm_pr_unimpl_wrmsr(vcpu, msr, data); break; case MSR_AMD64_OSVW_ID_LENGTH: if (!guest_cpuid_has(vcpu, X86_FEATURE_OSVW)) diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 9de72586f406..f3554bf05201 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -331,6 +331,18 @@ extern bool report_ignored_msrs;
extern bool eager_page_split;
+static inline void kvm_pr_unimpl_wrmsr(struct kvm_vcpu *vcpu, u32 msr, u64 data) +{ + if (report_ignored_msrs) + vcpu_unimpl(vcpu, "Unhandled WRMSR(0x%x) = 0x%llx\n", msr, data); +} + +static inline void kvm_pr_unimpl_rdmsr(struct kvm_vcpu *vcpu, u32 msr) +{ + if (report_ignored_msrs) + vcpu_unimpl(vcpu, "Unhandled RDMSR(0x%x)\n", msr); +} + static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec) { return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,
[ Upstream commit 9c9025ea003a03f967affd690f39b4ef3452c0f5 ]
Annotate the kvm_entry() tracepoint with "immediate exit" when KVM is forcing a VM-Exit immediately after VM-Enter, e.g. when KVM wants to inject an event but needs to first complete some other operation. Knowing that KVM is (or isn't) forcing an exit is useful information when debugging issues related to event injection.
Suggested-by: Maxim Levitsky mlevitsk@redhat.com Link: https://lore.kernel.org/r/20240110012705.506918-2-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/kvm/svm/svm.c | 5 +++-- arch/x86/kvm/trace.h | 9 ++++++--- arch/x86/kvm/vmx/vmx.c | 4 ++-- arch/x86/kvm/x86.c | 2 +- 5 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 555c7bf35e28..93f523762854 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1528,7 +1528,8 @@ struct kvm_x86_ops { void (*flush_tlb_guest)(struct kvm_vcpu *vcpu);
int (*vcpu_pre_run)(struct kvm_vcpu *vcpu); - enum exit_fastpath_completion (*vcpu_run)(struct kvm_vcpu *vcpu); + enum exit_fastpath_completion (*vcpu_run)(struct kvm_vcpu *vcpu, + bool force_immediate_exit); int (*handle_exit)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion exit_fastpath); int (*skip_emulated_instruction)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 2c0f9c7d1242..b4283c2358a6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4005,12 +4005,13 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in guest_state_exit_irqoff(); }
-static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) +static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, + bool force_immediate_exit) { struct vcpu_svm *svm = to_svm(vcpu); bool spec_ctrl_intercepted = msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL);
- trace_kvm_entry(vcpu); + trace_kvm_entry(vcpu, force_immediate_exit);
svm->vmcb->save.rax = vcpu->arch.regs[VCPU_REGS_RAX]; svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 6c1dcf44c4fa..ab407bc00d84 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -15,20 +15,23 @@ * Tracepoint for guest mode entry. */ TRACE_EVENT(kvm_entry, - TP_PROTO(struct kvm_vcpu *vcpu), - TP_ARGS(vcpu), + TP_PROTO(struct kvm_vcpu *vcpu, bool force_immediate_exit), + TP_ARGS(vcpu, force_immediate_exit),
TP_STRUCT__entry( __field( unsigned int, vcpu_id ) __field( unsigned long, rip ) + __field( bool, immediate_exit ) ),
TP_fast_assign( __entry->vcpu_id = vcpu->vcpu_id; __entry->rip = kvm_rip_read(vcpu); + __entry->immediate_exit = force_immediate_exit; ),
- TP_printk("vcpu %u, rip 0x%lx", __entry->vcpu_id, __entry->rip) + TP_printk("vcpu %u, rip 0x%lx%s", __entry->vcpu_id, __entry->rip, + __entry->immediate_exit ? "[immediate exit]" : "") );
/* diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 390af16d9a67..0b495979a02b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7171,7 +7171,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, guest_state_exit_irqoff(); }
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) +static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) { struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long cr3, cr4; @@ -7198,7 +7198,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu) return EXIT_FASTPATH_NONE; }
- trace_kvm_entry(vcpu); + trace_kvm_entry(vcpu, force_immediate_exit);
if (vmx->ple_window_dirty) { vmx->ple_window_dirty = false; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e7c73360890d..652f90ad7107 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10856,7 +10856,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) WARN_ON_ONCE((kvm_vcpu_apicv_activated(vcpu) != kvm_vcpu_apicv_active(vcpu)) && (kvm_get_apic_mode(vcpu) != LAPIC_MODE_DISABLED));
- exit_fastpath = static_call(kvm_x86_vcpu_run)(vcpu); + exit_fastpath = static_call(kvm_x86_vcpu_run)(vcpu, req_immediate_exit); if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST)) break;
[ Upstream commit e6b5d16bbd2d4c8259ad76aa33de80d561aba5f9 ]
Re-enter the guest in the fast path if VMX preeemption timer VM-Exit was "spurious", i.e. if KVM "soft disabled" the timer by writing -1u and by some miracle the timer expired before any other VM-Exit occurred. This is just an intermediate step to cleaning up the preemption timer handling, optimizing these types of spurious VM-Exits is not interesting as they are extremely rare/infrequent.
Link: https://lore.kernel.org/r/20240110012705.506918-3-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/vmx/vmx.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0b495979a02b..96bbccd9477c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5933,8 +5933,15 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu);
- if (!vmx->req_immediate_exit && - !unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled)) { + /* + * In the *extremely* unlikely scenario that this is a spurious VM-Exit + * due to the timer expiring while it was "soft" disabled, just eat the + * exit and re-enter the guest. + */ + if (unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled)) + return EXIT_FASTPATH_REENTER_GUEST; + + if (!vmx->req_immediate_exit) { kvm_lapic_expired_hv_timer(vcpu); return EXIT_FASTPATH_REENTER_GUEST; }
[ Upstream commit 11776aa0cfa7d007ad1799b1553bdcbd830e5010 ]
Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the exit fastpath, i.e. avoid calling back into handle_preemption_timer() for the same exit. There is no work to be done for forced exits, as the name suggests the goal is purely to get control back in KVM.
In addition to shaving a few cycles, this will allow cleanly separating handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g. it's not immediately obvious why _apparently_ calling handle_fastpath_preemption_timer() twice on a "slow" exit is necessary: the "slow" call is necessary to handle exits from L2, which are excluded from the fastpath by vmx_vcpu_run().
Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/vmx/vmx.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 96bbccd9477c..c804ad001a79 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5941,12 +5941,15 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) if (unlikely(vmx->loaded_vmcs->hv_timer_soft_disabled)) return EXIT_FASTPATH_REENTER_GUEST;
- if (!vmx->req_immediate_exit) { - kvm_lapic_expired_hv_timer(vcpu); - return EXIT_FASTPATH_REENTER_GUEST; - } + /* + * If the timer expired because KVM used it to force an immediate exit, + * then mission accomplished. + */ + if (vmx->req_immediate_exit) + return EXIT_FASTPATH_EXIT_HANDLED;
- return EXIT_FASTPATH_NONE; + kvm_lapic_expired_hv_timer(vcpu); + return EXIT_FASTPATH_REENTER_GUEST; }
static int handle_preemption_timer(struct kvm_vcpu *vcpu)
[ Upstream commit bf1a49436ea37b98dd2f37c57608951d0e28eecc ]
Let the fastpath code decide which exits can/can't be handled in the fastpath when L2 is active, e.g. when KVM generates a VMX preemption timer exit to forcefully regain control, there is no "work" to be done and so such exits can be handled in the fastpath regardless of whether L1 or L2 is active.
Moving the is_guest_mode() check into the fastpath code also makes it easier to see that L2 isn't allowed to use the fastpath in most cases, e.g. it's not immediately obvious why handle_fastpath_preemption_timer() is called from the fastpath and the normal path.
Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: resolve syntactic conflict in svm_exit_handlers_fastpath()] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/svm/svm.c | 6 +++--- arch/x86/kvm/vmx/vmx.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b4283c2358a6..337a304d211b 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3964,6 +3964,9 @@ static fastpath_t svm_exit_handlers_fastpath(struct kvm_vcpu *vcpu) { struct vmcb_control_area *control = &to_svm(vcpu)->vmcb->control;
+ if (is_guest_mode(vcpu)) + return EXIT_FASTPATH_NONE; + /* * Note, the next RIP must be provided as SRCU isn't held, i.e. KVM * can't read guest memory (dereference memslots) to decode the WRMSR. @@ -4127,9 +4130,6 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu,
svm_complete_interrupts(vcpu);
- if (is_guest_mode(vcpu)) - return EXIT_FASTPATH_NONE; - return svm_exit_handlers_fastpath(vcpu); }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index c804ad001a79..18ceed9046a9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7138,6 +7138,9 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx,
static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) { + if (is_guest_mode(vcpu)) + return EXIT_FASTPATH_NONE; + switch (to_vmx(vcpu)->exit_reason.basic) { case EXIT_REASON_MSR_WRITE: return handle_fastpath_set_msr_irqoff(vcpu); @@ -7337,9 +7340,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) vmx_recover_nmi_blocking(vmx); vmx_complete_interrupts(vmx);
- if (is_guest_mode(vcpu)) - return EXIT_FASTPATH_NONE; - return vmx_exit_handlers_fastpath(vcpu); }
[ Upstream commit 7b3d1bbf8d68d76fb21210932a5e8ed8ea80dbcc ]
Eat VMX treemption timer exits in the fastpath regardless of whether L1 or L2 is active. The VM-Exit is 100% KVM-induced, i.e. there is nothing directly related to the exit that KVM needs to do on behalf of the guest, thus there is no reason to wait until the slow path to do nothing.
Opportunistically add comments explaining why preemption timer exits for emulating the guest's APIC timer need to go down the slow path.
Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/vmx/vmx.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 18ceed9046a9..4db9d41d988c 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5948,13 +5948,26 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) if (vmx->req_immediate_exit) return EXIT_FASTPATH_EXIT_HANDLED;
+ /* + * If L2 is active, go down the slow path as emulating the guest timer + * expiration likely requires synthesizing a nested VM-Exit. + */ + if (is_guest_mode(vcpu)) + return EXIT_FASTPATH_NONE; + kvm_lapic_expired_hv_timer(vcpu); return EXIT_FASTPATH_REENTER_GUEST; }
static int handle_preemption_timer(struct kvm_vcpu *vcpu) { - handle_fastpath_preemption_timer(vcpu); + /* + * This non-fastpath handler is reached if and only if the preemption + * timer was being used to emulate a guest timer while L2 is active. + * All other scenarios are supposed to be handled in the fastpath. + */ + WARN_ON_ONCE(!is_guest_mode(vcpu)); + kvm_lapic_expired_hv_timer(vcpu); return 1; }
@@ -7138,7 +7151,12 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx,
static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) { - if (is_guest_mode(vcpu)) + /* + * If L2 is active, some VMX preemption timer exits can be handled in + * the fastpath even, all other exits must use the slow path. + */ + if (is_guest_mode(vcpu) && + to_vmx(vcpu)->exit_reason.basic != EXIT_REASON_PREEMPTION_TIMER) return EXIT_FASTPATH_NONE;
switch (to_vmx(vcpu)->exit_reason.basic) {
[ Upstream commit 0ec3d6d1f169baa7fc512ae4b78d17e7c94b7763 ]
Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit.
Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit.
Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: resolve absurd conflict due to funky kvm_x86_ops.sched_in prototype] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm-x86-ops.h | 1 - arch/x86/include/asm/kvm_host.h | 3 --- arch/x86/kvm/svm/svm.c | 7 ++++--- arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++----------------- arch/x86/kvm/vmx/vmx.h | 2 -- arch/x86/kvm/x86.c | 10 +--------- 6 files changed, 19 insertions(+), 36 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 29bef25ac77c..0e5ae3b0c867 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -100,7 +100,6 @@ KVM_X86_OP(write_tsc_multiplier) KVM_X86_OP(get_exit_info) KVM_X86_OP(check_intercept) KVM_X86_OP(handle_exit_irqoff) -KVM_X86_OP(request_immediate_exit) KVM_X86_OP(sched_in) KVM_X86_OP_OPTIONAL(update_cpu_dirty_logging) KVM_X86_OP_OPTIONAL(vcpu_blocking) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 93f523762854..86f3bd6601e7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1590,8 +1590,6 @@ struct kvm_x86_ops { struct x86_exception *exception); void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu);
- void (*request_immediate_exit)(struct kvm_vcpu *vcpu); - void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
/* @@ -2059,7 +2057,6 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu); int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err); -void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 337a304d211b..12de50db401f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4033,8 +4033,11 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, * is enough to force an immediate vmexit. */ disable_nmi_singlestep(svm); + force_immediate_exit = true; + } + + if (force_immediate_exit) smp_send_reschedule(vcpu->cpu); - }
pre_svm_run(vcpu);
@@ -4874,8 +4877,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .check_intercept = svm_check_intercept, .handle_exit_irqoff = svm_handle_exit_irqoff,
- .request_immediate_exit = __kvm_request_immediate_exit, - .sched_in = svm_sched_in,
.nested_ops = &svm_nested_ops, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4db9d41d988c..179747d04edc 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -49,6 +49,8 @@ #include <asm/virtext.h> #include <asm/vmx.h>
+#include <trace/events/ipi.h> + #include "capabilities.h" #include "cpuid.h" #include "evmcs.h" @@ -1223,8 +1225,6 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) u16 fs_sel, gs_sel; int i;
- vmx->req_immediate_exit = false; - /* * Note that guest MSRs to be saved/restored can also be changed * when guest state is loaded. This happens when guest transitions @@ -5929,7 +5929,8 @@ static int handle_pml_full(struct kvm_vcpu *vcpu) return 1; }
-static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) +static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu, + bool force_immediate_exit) { struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5945,7 +5946,7 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu) * If the timer expired because KVM used it to force an immediate exit, * then mission accomplished. */ - if (vmx->req_immediate_exit) + if (force_immediate_exit) return EXIT_FASTPATH_EXIT_HANDLED;
/* @@ -7090,13 +7091,13 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx) msrs[i].host, false); }
-static void vmx_update_hv_timer(struct kvm_vcpu *vcpu) +static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit) { struct vcpu_vmx *vmx = to_vmx(vcpu); u64 tscl; u32 delta_tsc;
- if (vmx->req_immediate_exit) { + if (force_immediate_exit) { vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0); vmx->loaded_vmcs->hv_timer_soft_disabled = false; } else if (vmx->hv_deadline_tsc != -1) { @@ -7149,7 +7150,8 @@ void noinstr vmx_spec_ctrl_restore_host(struct vcpu_vmx *vmx, barrier_nospec(); }
-static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) +static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu, + bool force_immediate_exit) { /* * If L2 is active, some VMX preemption timer exits can be handled in @@ -7163,7 +7165,7 @@ static fastpath_t vmx_exit_handlers_fastpath(struct kvm_vcpu *vcpu) case EXIT_REASON_MSR_WRITE: return handle_fastpath_set_msr_irqoff(vcpu); case EXIT_REASON_PREEMPTION_TIMER: - return handle_fastpath_preemption_timer(vcpu); + return handle_fastpath_preemption_timer(vcpu, force_immediate_exit); default: return EXIT_FASTPATH_NONE; } @@ -7284,7 +7286,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) vmx_passthrough_lbr_msrs(vcpu);
if (enable_preemption_timer) - vmx_update_hv_timer(vcpu); + vmx_update_hv_timer(vcpu, force_immediate_exit); + else if (force_immediate_exit) + smp_send_reschedule(vcpu->cpu);
kvm_wait_lapic_expire(vcpu);
@@ -7358,7 +7362,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) vmx_recover_nmi_blocking(vmx); vmx_complete_interrupts(vmx);
- return vmx_exit_handlers_fastpath(vcpu); + return vmx_exit_handlers_fastpath(vcpu, force_immediate_exit); }
static void vmx_vcpu_free(struct kvm_vcpu *vcpu) @@ -7865,11 +7869,6 @@ static __init void vmx_set_cpu_caps(void) kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG); }
-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu) -{ - to_vmx(vcpu)->req_immediate_exit = true; -} - static int vmx_check_intercept_io(struct kvm_vcpu *vcpu, struct x86_instruction_info *info) { @@ -8275,8 +8274,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .check_intercept = vmx_check_intercept, .handle_exit_irqoff = vmx_handle_exit_irqoff,
- .request_immediate_exit = vmx_request_immediate_exit, - .sched_in = vmx_sched_in,
.cpu_dirty_log_size = PML_ENTITY_NUM, @@ -8533,7 +8530,6 @@ static __init int hardware_setup(void) if (!enable_preemption_timer) { vmx_x86_ops.set_hv_timer = NULL; vmx_x86_ops.cancel_hv_timer = NULL; - vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit; }
kvm_caps.supported_mce_cap |= MCG_LMCE_P; diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 357819872d80..ddbe73958d7f 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -343,8 +343,6 @@ struct vcpu_vmx { unsigned int ple_window; bool ple_window_dirty;
- bool req_immediate_exit; - /* Support for PML */ #define PML_ENTITY_NUM 512 struct page *pml_pg; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 652f90ad7107..bc586a6df6ab 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10578,12 +10578,6 @@ static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu); }
-void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu) -{ - smp_send_reschedule(vcpu->cpu); -} -EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit); - /* * Called within kvm->srcu read side. * Returns 1 to let vcpu_run() continue the guest execution loop without @@ -10817,10 +10811,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto cancel_injection; }
- if (req_immediate_exit) { + if (req_immediate_exit) kvm_make_request(KVM_REQ_EVENT, vcpu); - static_call(kvm_x86_request_immediate_exit)(vcpu); - }
fpregs_assert_state_consistent(); if (test_thread_flag(TIF_NEED_FPU_LOAD))
[ Upstream commit 2478b1b220c49d25cb1c3f061ec4f9b351d9a131 ]
Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter into an a generic bitmap so that similar "take action" information can be passed to vendor code without creating a pile of boolean parameters.
This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and will also allow for adding similar functionality for re-loading debugctl in the active VMCS.
Opportunistically massage the TDX WARN and comment to prepare for adding more run_flags, all of which are expected to be mutually exclusive with TDX, i.e. should be WARNed on.
No functional change intended.
Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: drop TDX crud, account for lack of kvm_x86_call()] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm_host.h | 6 +++++- arch/x86/kvm/svm/svm.c | 4 ++-- arch/x86/kvm/vmx/vmx.c | 3 ++- arch/x86/kvm/x86.c | 10 ++++++++-- 4 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 86f3bd6601e7..1383f5e5238a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1456,6 +1456,10 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical) return dest_mode_logical ? APIC_DEST_LOGICAL : APIC_DEST_PHYSICAL; }
+enum kvm_x86_run_flags { + KVM_RUN_FORCE_IMMEDIATE_EXIT = BIT(0), +}; + struct kvm_x86_ops { const char *name;
@@ -1529,7 +1533,7 @@ struct kvm_x86_ops {
int (*vcpu_pre_run)(struct kvm_vcpu *vcpu); enum exit_fastpath_completion (*vcpu_run)(struct kvm_vcpu *vcpu, - bool force_immediate_exit); + u64 run_flags); int (*handle_exit)(struct kvm_vcpu *vcpu, enum exit_fastpath_completion exit_fastpath); int (*skip_emulated_instruction)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 12de50db401f..dc8a1b72d8ec 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4008,9 +4008,9 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_in guest_state_exit_irqoff(); }
-static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, - bool force_immediate_exit) +static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) { + bool force_immediate_exit = run_flags & KVM_RUN_FORCE_IMMEDIATE_EXIT; struct vcpu_svm *svm = to_svm(vcpu); bool spec_ctrl_intercepted = msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 179747d04edc..382f42200688 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7204,8 +7204,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, guest_state_exit_irqoff(); }
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) +static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) { + bool force_immediate_exit = run_flags & KVM_RUN_FORCE_IMMEDIATE_EXIT; struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long cr3, cr4;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bc586a6df6ab..b9cc3c05590e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10591,6 +10591,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) dm_request_for_irq_injection(vcpu) && kvm_cpu_accept_dm_intr(vcpu); fastpath_t exit_fastpath; + u64 run_flags;
bool req_immediate_exit = false;
@@ -10811,8 +10812,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto cancel_injection; }
- if (req_immediate_exit) + run_flags = 0; + if (req_immediate_exit) { + run_flags |= KVM_RUN_FORCE_IMMEDIATE_EXIT; kvm_make_request(KVM_REQ_EVENT, vcpu); + }
fpregs_assert_state_consistent(); if (test_thread_flag(TIF_NEED_FPU_LOAD)) @@ -10848,7 +10852,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) WARN_ON_ONCE((kvm_vcpu_apicv_activated(vcpu) != kvm_vcpu_apicv_active(vcpu)) && (kvm_get_apic_mode(vcpu) != LAPIC_MODE_DISABLED));
- exit_fastpath = static_call(kvm_x86_vcpu_run)(vcpu, req_immediate_exit); + exit_fastpath = static_call(kvm_x86_vcpu_run)(vcpu, run_flags); if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST)) break;
@@ -10860,6 +10864,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) break; }
+ run_flags = 0; + /* Note, VM-Exits that go down the "slow" path are accounted below. */ ++vcpu->stat.exits; }
[ Upstream commit 80c64c7afea1da6a93ebe88d3d29d8a60377ef80 ]
Instruct vendor code to load the guest's DR6 into hardware via a new KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to load vcpu->arch.dr6 into hardware when DR6 can be read/written directly by the guest.
Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6.
Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: account for lack of vmx/main.c] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm-x86-ops.h | 1 - arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm/svm.c | 10 ++++++---- arch/x86/kvm/vmx/vmx.c | 10 +++------- arch/x86/kvm/x86.c | 2 +- 5 files changed, 11 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 0e5ae3b0c867..c068565fe954 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -47,7 +47,6 @@ KVM_X86_OP(set_idt) KVM_X86_OP(get_gdt) KVM_X86_OP(set_gdt) KVM_X86_OP(sync_dirty_debug_regs) -KVM_X86_OP(set_dr6) KVM_X86_OP(set_dr7) KVM_X86_OP(cache_reg) KVM_X86_OP(get_rflags) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1383f5e5238a..c8fc4f2acf69 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1458,6 +1458,7 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical)
enum kvm_x86_run_flags { KVM_RUN_FORCE_IMMEDIATE_EXIT = BIT(0), + KVM_RUN_LOAD_GUEST_DR6 = BIT(1), };
struct kvm_x86_ops { @@ -1504,7 +1505,6 @@ struct kvm_x86_ops { void (*get_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt); void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt); void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu); - void (*set_dr6)(struct kvm_vcpu *vcpu, unsigned long value); void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value); void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg); unsigned long (*get_rflags)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index dc8a1b72d8ec..5a6bd9d5cceb 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4052,10 +4052,13 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) svm_hv_update_vp_id(svm->vmcb, vcpu);
/* - * Run with all-zero DR6 unless needed, so that we can get the exact cause - * of a #DB. + * Run with all-zero DR6 unless the guest can write DR6 freely, so that + * KVM can get the exact cause of a #DB. Note, loading guest DR6 from + * KVM's snapshot is only necessary when DR accesses won't exit. */ - if (likely(!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT))) + if (unlikely(run_flags & KVM_RUN_LOAD_GUEST_DR6)) + svm_set_dr6(vcpu, vcpu->arch.dr6); + else if (likely(!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT))) svm_set_dr6(vcpu, DR6_ACTIVE_LOW);
clgi(); @@ -4822,7 +4825,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = { .set_idt = svm_set_idt, .get_gdt = svm_get_gdt, .set_gdt = svm_set_gdt, - .set_dr6 = svm_set_dr6, .set_dr7 = svm_set_dr7, .sync_dirty_debug_regs = svm_sync_dirty_debug_regs, .cache_reg = svm_cache_reg, diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 382f42200688..60d1ff3fca45 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5530,12 +5530,6 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) set_debugreg(DR6_RESERVED, 6); }
-static void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val) -{ - lockdep_assert_irqs_disabled(); - set_debugreg(vcpu->arch.dr6, 6); -} - static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) { vmcs_writel(GUEST_DR7, val); @@ -7251,6 +7245,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]); vcpu->arch.regs_dirty = 0;
+ if (run_flags & KVM_RUN_LOAD_GUEST_DR6) + set_debugreg(vcpu->arch.dr6, 6); + /* * Refresh vmcs.HOST_CR3 if necessary. This must be done immediately * prior to VM-Enter, as the kernel may load a new ASID (PCID) any time @@ -8208,7 +8205,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .set_idt = vmx_set_idt, .get_gdt = vmx_get_gdt, .set_gdt = vmx_set_gdt, - .set_dr6 = vmx_set_dr6, .set_dr7 = vmx_set_dr7, .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs, .cache_reg = vmx_cache_reg, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b9cc3c05590e..27e7253972ea 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10833,7 +10833,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(vcpu->arch.eff_db[3], 3); /* When KVM_DEBUGREG_WONT_EXIT, dr6 is accessible in guest. */ if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) - static_call(kvm_x86_set_dr6)(vcpu, vcpu->arch.dr6); + run_flags |= KVM_RUN_LOAD_GUEST_DR6; } else if (unlikely(hw_breakpoint_active())) { set_debugreg(0, 7); }
[ Upstream commit 17ec2f965344ee3fd6620bef7ef68792f4ac3af0 ]
Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the guest CPUID model, as debug support is supposed to be available if RTM is supported, and there are no known downsides to letting the guest debug RTM aborts.
Note, there are no known bug reports related to RTM_DEBUG, the primary motivation is to reduce the probability of breaking existing guests when a future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL (KVM currently lets L2 run with whatever hardware supports; whoops).
Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to DR7.RTM.
Fixes: 83c529151ab0 ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/msr-index.h | 1 + arch/x86/kvm/vmx/vmx.c | 4 ++++ 2 files changed, 5 insertions(+)
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 727947ed5e5e..afd65c815043 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -379,6 +379,7 @@ #define DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI (1UL << 12) #define DEBUGCTLMSR_FREEZE_IN_SMM_BIT 14 #define DEBUGCTLMSR_FREEZE_IN_SMM (1UL << DEBUGCTLMSR_FREEZE_IN_SMM_BIT) +#define DEBUGCTLMSR_RTM_DEBUG BIT(15)
#define MSR_PEBS_FRONTEND 0x000003f7
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 60d1ff3fca45..9445def2b3d2 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2064,6 +2064,10 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated (host_initiated || intel_pmu_lbr_is_enabled(vcpu))) debugctl |= DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI;
+ if (boot_cpu_has(X86_FEATURE_RTM) && + (host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_RTM))) + debugctl |= DEBUGCTLMSR_RTM_DEBUG; + return debugctl; }
[ Upstream commit 8a4351ac302cd8c19729ba2636acfd0467c22ae8 ]
Move VMX's logic to check DEBUGCTL values into a standalone helper so that the code can be used by nested VM-Enter to apply the same logic to the value being loaded from vmcs12.
KVM needs to explicitly check vmcs12->guest_ia32_debugctl on nested VM-Enter, as hardware may support features that KVM does not, i.e. relying on hardware to detect invalid guest state will result in false negatives. Unfortunately, that means applying KVM's funky suppression of BTF and LBR to vmcs12 so as not to break existing guests.
No functional change intended.
Reviewed-by: Dapeng Mi dapeng1.mi@linux.intel.com Link: https://lore.kernel.org/r/20250610232010.162191-6-seanjc@google.com Stable-dep-of: 7d0cce6cbe71 ("KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs") Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/vmx/vmx.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 9445def2b3d2..6517b9d929bf 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2071,6 +2071,19 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated return debugctl; }
+static bool vmx_is_valid_debugctl(struct kvm_vcpu *vcpu, u64 data, + bool host_initiated) +{ + u64 invalid; + + invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated); + if (invalid & (DEBUGCTLMSR_BTF | DEBUGCTLMSR_LBR)) { + kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data); + invalid &= ~(DEBUGCTLMSR_BTF | DEBUGCTLMSR_LBR); + } + return !invalid; +} + /* * Writes msr value into the appropriate "register". * Returns 0 on success, non-0 otherwise. @@ -2139,19 +2152,12 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) } vmcs_writel(GUEST_SYSENTER_ESP, data); break; - case MSR_IA32_DEBUGCTLMSR: { - u64 invalid; - - invalid = data & ~vmx_get_supported_debugctl(vcpu, msr_info->host_initiated); - if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) { - kvm_pr_unimpl_wrmsr(vcpu, msr_index, data); - data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR); - invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR); - } - - if (invalid) + case MSR_IA32_DEBUGCTLMSR: + if (!vmx_is_valid_debugctl(vcpu, data, msr_info->host_initiated)) return 1;
+ data &= vmx_get_supported_debugctl(vcpu, msr_info->host_initiated); + if (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) get_vmcs12(vcpu)->guest_ia32_debugctl = data; @@ -2161,7 +2167,6 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) (data & DEBUGCTLMSR_LBR)) intel_pmu_create_guest_lbr_event(vcpu); return 0; - } case MSR_IA32_BNDCFGS: if (!kvm_mpx_supported() || (!msr_info->host_initiated &&
From: Maxim Levitsky mlevitsk@redhat.com
[ Upstream commit 095686e6fcb4150f0a55b1a25987fad3d8af58d6 ]
Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports a subset of hardware functionality, i.e. KVM can't rely on hardware to detect illegal/unsupported values. Failure to check the vmcs12 value would allow the guest to load any harware-supported value while running L2.
Take care to exempt BTF and LBR from the validity check in order to match KVM's behavior for writes via WRMSR, but without clobbering vmcs12. Even if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR are being intercepted.
Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set *and* writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but that would incur non-trivial complexity and wouldn't change the fact that KVM's handling of DEBUGCTL is blatantly broken. I.e. the extra complexity is not worth carrying.
Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky mlevitsk@redhat.com Co-developed-by: Sean Christopherson seanjc@google.com Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com Stable-dep-of: 7d0cce6cbe71 ("KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs") Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/vmx/nested.c | 12 ++++++++++-- arch/x86/kvm/vmx/vmx.c | 5 ++--- arch/x86/kvm/vmx/vmx.h | 3 +++ 3 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index d55f7edc0860..da129e12cff9 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2532,7 +2532,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, if (vmx->nested.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) { kvm_set_dr(vcpu, 7, vmcs12->guest_dr7); - vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl); + vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl & + vmx_get_supported_debugctl(vcpu, false)); } else { kvm_set_dr(vcpu, 7, vcpu->arch.dr7); vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl); @@ -3022,7 +3023,8 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu, return -EINVAL;
if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) && - CC(!kvm_dr7_valid(vmcs12->guest_dr7))) + (CC(!kvm_dr7_valid(vmcs12->guest_dr7)) || + CC(!vmx_is_valid_debugctl(vcpu, vmcs12->guest_ia32_debugctl, false)))) return -EINVAL;
if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT) && @@ -4374,6 +4376,12 @@ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) (vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) | (vm_entry_controls_get(to_vmx(vcpu)) & VM_ENTRY_IA32E_MODE);
+ /* + * Note! Save DR7, but intentionally don't grab DEBUGCTL from vmcs02. + * Writes to DEBUGCTL that aren't intercepted by L1 are immediately + * propagated to vmcs12 (see vmx_set_msr()), as the value loaded into + * vmcs02 doesn't strictly track vmcs12. + */ if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) kvm_get_dr(vcpu, 7, (unsigned long *)&vmcs12->guest_dr7);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6517b9d929bf..0b37e21d55b1 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2052,7 +2052,7 @@ static u64 nested_vmx_truncate_sysenter_addr(struct kvm_vcpu *vcpu, return (unsigned long)data; }
-static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated) +u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated) { u64 debugctl = 0;
@@ -2071,8 +2071,7 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated return debugctl; }
-static bool vmx_is_valid_debugctl(struct kvm_vcpu *vcpu, u64 data, - bool host_initiated) +bool vmx_is_valid_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated) { u64 invalid;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index ddbe73958d7f..99e3f46de2ec 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -442,6 +442,9 @@ static inline void vmx_set_intercept_for_msr(struct kvm_vcpu *vcpu, u32 msr,
void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated); +bool vmx_is_valid_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated); + /* * Note, early Intel manuals have the write-low and read-high bitmap offsets * the wrong way round. The bitmaps control MSRs 0x00000000-0x00001fff and
From: Maxim Levitsky mlevitsk@redhat.com
[ Upstream commit 7d0cce6cbe71af6e9c1831bff101a2b9c249c4a2 ]
Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state into the guest, and without needing to copy+paste the FREEZE_IN_SMM logic into every patch that accesses GUEST_IA32_DEBUGCTL.
No functional change intended.
Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky mlevitsk@redhat.com [sean: massage changelog, make inline, use in all prepare_vmcs02() cases] Reviewed-by: Dapeng Mi dapeng1.mi@linux.intel.com Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com Signed-off-by: Sasha Levin sashal@kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kvm/vmx/nested.c | 10 +++++----- arch/x86/kvm/vmx/pmu_intel.c | 8 ++++---- arch/x86/kvm/vmx/vmx.c | 8 +++++--- arch/x86/kvm/vmx/vmx.h | 10 ++++++++++ 4 files changed, 24 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index da129e12cff9..a220770644e1 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2532,11 +2532,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, if (vmx->nested.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) { kvm_set_dr(vcpu, 7, vmcs12->guest_dr7); - vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl & - vmx_get_supported_debugctl(vcpu, false)); + vmx_guest_debugctl_write(vcpu, vmcs12->guest_ia32_debugctl & + vmx_get_supported_debugctl(vcpu, false)); } else { kvm_set_dr(vcpu, 7, vcpu->arch.dr7); - vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl); + vmx_guest_debugctl_write(vcpu, vmx->nested.pre_vmenter_debugctl); } if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) @@ -3404,7 +3404,7 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
if (!vmx->nested.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) - vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); + vmx->nested.pre_vmenter_debugctl = vmx_guest_debugctl_read(); if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) @@ -4572,7 +4572,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, __vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR);
kvm_set_dr(vcpu, 7, 0x400); - vmcs_write64(GUEST_IA32_DEBUGCTL, 0); + vmx_guest_debugctl_write(vcpu, 0);
if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr, vmcs12->vm_exit_msr_load_count)) diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c index 220cdbe1e286..76d3ed8abf6a 100644 --- a/arch/x86/kvm/vmx/pmu_intel.c +++ b/arch/x86/kvm/vmx/pmu_intel.c @@ -672,11 +672,11 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu) */ static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu) { - u64 data = vmcs_read64(GUEST_IA32_DEBUGCTL); + u64 data = vmx_guest_debugctl_read();
if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) { data &= ~DEBUGCTLMSR_LBR; - vmcs_write64(GUEST_IA32_DEBUGCTL, data); + vmx_guest_debugctl_write(vcpu, data); } }
@@ -746,7 +746,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
if (!lbr_desc->event) { vmx_disable_lbr_msrs_passthrough(vcpu); - if (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR) + if (vmx_guest_debugctl_read() & DEBUGCTLMSR_LBR) goto warn; if (test_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use)) goto warn; @@ -769,7 +769,7 @@ void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu)
static void intel_pmu_cleanup(struct kvm_vcpu *vcpu) { - if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR)) + if (!(vmx_guest_debugctl_read() & DEBUGCTLMSR_LBR)) intel_pmu_release_guest_lbr_event(vcpu); }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 0b37e21d55b1..e470a294b22d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -2027,7 +2027,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) msr_info->data = vmx->pt_desc.guest.addr_a[index / 2]; break; case MSR_IA32_DEBUGCTLMSR: - msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL); + msr_info->data = vmx_guest_debugctl_read(); break; default: find_uret_msr: @@ -2161,7 +2161,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) VM_EXIT_SAVE_DEBUG_CONTROLS) get_vmcs12(vcpu)->guest_ia32_debugctl = data;
- vmcs_write64(GUEST_IA32_DEBUGCTL, data); + vmx_guest_debugctl_write(vcpu, data); + if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event && (data & DEBUGCTLMSR_LBR)) intel_pmu_create_guest_lbr_event(vcpu); @@ -4751,7 +4752,8 @@ static void init_vmcs(struct vcpu_vmx *vmx) vmcs_write32(GUEST_SYSENTER_CS, 0); vmcs_writel(GUEST_SYSENTER_ESP, 0); vmcs_writel(GUEST_SYSENTER_EIP, 0); - vmcs_write64(GUEST_IA32_DEBUGCTL, 0); + + vmx_guest_debugctl_write(&vmx->vcpu, 0);
if (cpu_has_vmx_tpr_shadow()) { vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, 0); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 99e3f46de2ec..b7ae263cde7b 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -445,6 +445,16 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu); u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated); bool vmx_is_valid_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated);
+static inline void vmx_guest_debugctl_write(struct kvm_vcpu *vcpu, u64 val) +{ + vmcs_write64(GUEST_IA32_DEBUGCTL, val); +} + +static inline u64 vmx_guest_debugctl_read(void) +{ + return vmcs_read64(GUEST_IA32_DEBUGCTL); +} + /* * Note, early Intel manuals have the write-low and read-high bitmap offsets * the wrong way round. The bitmaps control MSRs 0x00000000-0x00001fff and
From: Maxim Levitsky mlevitsk@redhat.com
[ Upstream commit 6b1dd26544d045f6a79e8c73572c0c0db3ef3c1a ]
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting while running the guest. When running with the "default treatment of SMIs" in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that is visible to host (non-SMM) software, and instead transitions directly from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched by hardware on SMI or RSM, i.e. SMM will run with whatever value was resident in hardware at the time of the SMI.
Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting events while the CPU is executing in SMM, which can pollute profiling and potentially leak information into the guest.
Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner run loop, as the bit can be toggled in IRQ context via IPI callback (SMP function call), by way of /sys/devices/cpu/freeze_on_smi.
Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs, i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and at worst could lead to undesirable behavior in the future if AMD CPUs ever happened to pick up a collision with the bit.
Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module owns and controls GUEST_IA32_DEBUGCTL.
WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the lack of handling isn't a KVM bug (TDX already WARNs on any run_flag).
Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state(). Doing so avoids the need to track host_debugctl on a per-VMCS basis, as GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't have actually entered the guest, vcpu_enter_guest() will have run with vmcs02 active and thus could result in vmcs01 being run with a stale value.
Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky mlevitsk@redhat.com Co-developed-by: Sean Christopherson seanjc@google.com Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com Signed-off-by: Sean Christopherson seanjc@google.com [sean: move vmx/main.c change to vmx/vmx.c] Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/kvm_host.h | 7 +++++++ arch/x86/kvm/vmx/nested.c | 3 +++ arch/x86/kvm/vmx/vmx.c | 5 +++++ arch/x86/kvm/vmx/vmx.h | 15 ++++++++++++++- arch/x86/kvm/x86.c | 14 ++++++++++++-- 5 files changed, 41 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c8fc4f2acf69..d0229323ca63 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1459,6 +1459,7 @@ static inline u16 kvm_lapic_irq_dest_mode(bool dest_mode_logical) enum kvm_x86_run_flags { KVM_RUN_FORCE_IMMEDIATE_EXIT = BIT(0), KVM_RUN_LOAD_GUEST_DR6 = BIT(1), + KVM_RUN_LOAD_DEBUGCTL = BIT(2), };
struct kvm_x86_ops { @@ -1484,6 +1485,12 @@ struct kvm_x86_ops { void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); void (*vcpu_put)(struct kvm_vcpu *vcpu);
+ /* + * Mask of DEBUGCTL bits that are owned by the host, i.e. that need to + * match the host's value even while the guest is active. + */ + const u64 HOST_OWNED_DEBUGCTL; + void (*update_exception_bitmap)(struct kvm_vcpu *vcpu); int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index a220770644e1..2c3cf4351c4c 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -4627,6 +4627,9 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu) WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7))); }
+ /* Reload DEBUGCTL to ensure vmcs01 has a fresh FREEZE_IN_SMM value. */ + vmx_reload_guest_debugctl(vcpu); + /* * Note that calling vmx_set_{efer,cr0,cr4} is important as they * handle a variety of side effects to KVM's software model. diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index e470a294b22d..3fef4e14abc6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7258,6 +7258,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags) if (run_flags & KVM_RUN_LOAD_GUEST_DR6) set_debugreg(vcpu->arch.dr6, 6);
+ if (run_flags & KVM_RUN_LOAD_DEBUGCTL) + vmx_reload_guest_debugctl(vcpu); + /* * Refresh vmcs.HOST_CR3 if necessary. This must be done immediately * prior to VM-Enter, as the kernel may load a new ASID (PCID) any time @@ -8197,6 +8200,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = { .vcpu_load = vmx_vcpu_load, .vcpu_put = vmx_vcpu_put,
+ .HOST_OWNED_DEBUGCTL = DEBUGCTLMSR_FREEZE_IN_SMM, + .update_exception_bitmap = vmx_update_exception_bitmap, .get_msr_feature = vmx_get_msr_feature, .get_msr = vmx_get_msr, diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index b7ae263cde7b..dc6f06326648 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -447,12 +447,25 @@ bool vmx_is_valid_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
static inline void vmx_guest_debugctl_write(struct kvm_vcpu *vcpu, u64 val) { + WARN_ON_ONCE(val & DEBUGCTLMSR_FREEZE_IN_SMM); + + val |= vcpu->arch.host_debugctl & DEBUGCTLMSR_FREEZE_IN_SMM; vmcs_write64(GUEST_IA32_DEBUGCTL, val); }
static inline u64 vmx_guest_debugctl_read(void) { - return vmcs_read64(GUEST_IA32_DEBUGCTL); + return vmcs_read64(GUEST_IA32_DEBUGCTL) & ~DEBUGCTLMSR_FREEZE_IN_SMM; +} + +static inline void vmx_reload_guest_debugctl(struct kvm_vcpu *vcpu) +{ + u64 val = vmcs_read64(GUEST_IA32_DEBUGCTL); + + if (!((val ^ vcpu->arch.host_debugctl) & DEBUGCTLMSR_FREEZE_IN_SMM)) + return; + + vmx_guest_debugctl_write(vcpu, val & ~DEBUGCTLMSR_FREEZE_IN_SMM); }
/* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 27e7253972ea..f542ab5e8698 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10591,7 +10591,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) dm_request_for_irq_injection(vcpu) && kvm_cpu_accept_dm_intr(vcpu); fastpath_t exit_fastpath; - u64 run_flags; + u64 run_flags, debug_ctl;
bool req_immediate_exit = false;
@@ -10838,7 +10838,17 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(0, 7); }
- vcpu->arch.host_debugctl = get_debugctlmsr(); + /* + * Refresh the host DEBUGCTL snapshot after disabling IRQs, as DEBUGCTL + * can be modified in IRQ context, e.g. via SMP function calls. Inform + * vendor code if any host-owned bits were changed, e.g. so that the + * value loaded into hardware while running the guest can be updated. + */ + debug_ctl = get_debugctlmsr(); + if ((debug_ctl ^ vcpu->arch.host_debugctl) & kvm_x86_ops.HOST_OWNED_DEBUGCTL && + !vcpu->arch.guest_state_protected) + run_flags |= KVM_RUN_LOAD_DEBUGCTL; + vcpu->arch.host_debugctl = debug_ctl;
guest_timing_enter_irqoff();
linux-stable-mirror@lists.linaro.org