From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit 6cd88243c7e03845a450795e134b488fc2afb736 ]
If a vCPU is outside guest mode and is scheduled out, it might be in the process of making a memory access. A problem occurs if another vCPU uses the PV TLB flush feature during the period when the vCPU is scheduled out, and a virtual address has already been translated but has not yet been accessed, because this is equivalent to using a stale TLB entry.
To avoid this, only report a vCPU as preempted if sure that the guest is at an instruction boundary. A rescheduling request will be delivered to the host physical CPU as an external interrupt, so for simplicity consider any vmexit *not* instruction boundary except for external interrupts.
It would in principle be okay to report the vCPU as preempted also if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the vmentry/vmexit overhead unnecessarily, and optimistic spinning is also unlikely to succeed. However, leave it for later because right now kvm_vcpu_check_block() is doing memory accesses. Even though the TLB flush issue only applies to virtual memory address, it's very much preferrable to be conservative.
Reported-by: Jann Horn jannh@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/svm/svm.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++ 4 files changed, 28 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4ff36610af6a..9fdaa847d4b6 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -651,6 +651,7 @@ struct kvm_vcpu_arch { u64 ia32_misc_enable_msr; u64 smbase; u64 smi_count; + bool at_instruction_boundary; bool tpr_access_reporting; bool xsaves_enabled; bool xfd_no_write_intercept; @@ -1289,6 +1290,8 @@ struct kvm_vcpu_stat { u64 nested_run; u64 directed_yield_attempted; u64 directed_yield_successful; + u64 preemption_reported; + u64 preemption_other; u64 guest_mode; };
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7e45d03cd018..5842abf1eac4 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4165,6 +4165,8 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu) { + if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR) + vcpu->arch.at_instruction_boundary = true; }
static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 982df9c000d3..c44f8e1d30c8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6549,6 +6549,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu) return;
handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc)); + vcpu->arch.at_instruction_boundary = true; }
static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 39c571224ac2..36453517e847 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -291,6 +291,8 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = { STATS_DESC_COUNTER(VCPU, nested_run), STATS_DESC_COUNTER(VCPU, directed_yield_attempted), STATS_DESC_COUNTER(VCPU, directed_yield_successful), + STATS_DESC_COUNTER(VCPU, preemption_reported), + STATS_DESC_COUNTER(VCPU, preemption_other), STATS_DESC_ICOUNTER(VCPU, guest_mode) };
@@ -4604,6 +4606,19 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) struct kvm_memslots *slots; static const u8 preempted = KVM_VCPU_PREEMPTED;
+ /* + * The vCPU can be marked preempted if and only if the VM-Exit was on + * an instruction boundary and will not trigger guest emulation of any + * kind (see vcpu_run). Vendor specific code controls (conservatively) + * when this is true, for example allowing the vCPU to be marked + * preempted if and only if the VM-Exit was due to a host interrupt. + */ + if (!vcpu->arch.at_instruction_boundary) { + vcpu->stat.preemption_other++; + return; + } + + vcpu->stat.preemption_reported++; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return;
@@ -10358,6 +10373,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu) vcpu->arch.l1tf_flush_l1d = true;
for (;;) { + /* + * If another guest vCPU requests a PV TLB flush in the middle + * of instruction emulation, the rest of the emulation could + * use a stale page translation. Assume that any code after + * this point can start executing an instruction. + */ + vcpu->arch.at_instruction_boundary = false; if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); } else {
From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit 54aa83c90198e68eee8b0850c749bc70efb548da ]
Similar to the Xen path, only change the vCPU's reported state if the vCPU was actually preempted. The reason for KVM's behavior is that for example optimistic spinning might not be a good idea if the guest is doing repeated exits to userspace; however, it is confusing and unlikely to make a difference, because well-tuned guests will hardly ever exit KVM_RUN in the first place.
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kvm/x86.c | 26 ++++++++++++++------------ arch/x86/kvm/xen.h | 6 ++++-- 2 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 36453517e847..165e2912995f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4648,19 +4648,21 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { int idx;
- if (vcpu->preempted && !vcpu->arch.guest_state_protected) - vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu); + if (vcpu->preempted) { + if (!vcpu->arch.guest_state_protected) + vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);
- /* - * Take the srcu lock as memslots will be accessed to check the gfn - * cache generation against the memslots generation. - */ - idx = srcu_read_lock(&vcpu->kvm->srcu); - if (kvm_xen_msr_enabled(vcpu->kvm)) - kvm_xen_runstate_set_preempted(vcpu); - else - kvm_steal_time_set_preempted(vcpu); - srcu_read_unlock(&vcpu->kvm->srcu, idx); + /* + * Take the srcu lock as memslots will be accessed to check the gfn + * cache generation against the memslots generation. + */ + idx = srcu_read_lock(&vcpu->kvm->srcu); + if (kvm_xen_msr_enabled(vcpu->kvm)) + kvm_xen_runstate_set_preempted(vcpu); + else + kvm_steal_time_set_preempted(vcpu); + srcu_read_unlock(&vcpu->kvm->srcu, idx); + }
static_call(kvm_x86_vcpu_put)(vcpu); vcpu->arch.last_host_tsc = rdtsc(); diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h index adbcc9ed59db..fda1413f8af9 100644 --- a/arch/x86/kvm/xen.h +++ b/arch/x86/kvm/xen.h @@ -103,8 +103,10 @@ static inline void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu) * behalf of the vCPU. Only if the VMM does actually block * does it need to enter RUNSTATE_blocked. */ - if (vcpu->preempted) - kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable); + if (WARN_ON_ONCE(!vcpu->preempted)) + return; + + kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable); }
/* 32-bit compatibility definitions, also used natively in 32-bit build */
From: Vitaly Kuznetsov vkuznets@redhat.com
[ Upstream commit eae260be3a0111a28fe95923e117a55dddec0384 ]
hyperv_clock doesn't always give a stable test result, especially with AMD CPUs. The test compares Hyper-V MSR clocksource (acquired either with rdmsr() from within the guest or KVM_GET_MSRS from the host) against rdtsc(). To increase the accuracy, increase the measured delay (done with nop loop) by two orders of magnitude and take the mean rdtsc() value before and after rdmsr()/KVM_GET_MSRS.
Reported-by: Maxim Levitsky mlevitsk@redhat.com Signed-off-by: Vitaly Kuznetsov vkuznets@redhat.com Reviewed-by: Maxim Levitsky mlevitsk@redhat.com Tested-by: Maxim Levitsky mlevitsk@redhat.com Message-Id: 20220601144322.1968742-1-vkuznets@redhat.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- tools/testing/selftests/kvm/x86_64/hyperv_clock.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c index e0b2bb1339b1..3330fb183c68 100644 --- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c +++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c @@ -44,7 +44,7 @@ static inline void nop_loop(void) { int i;
- for (i = 0; i < 1000000; i++) + for (i = 0; i < 100000000; i++) asm volatile("nop"); }
@@ -56,12 +56,14 @@ static inline void check_tsc_msr_rdtsc(void) tsc_freq = rdmsr(HV_X64_MSR_TSC_FREQUENCY); GUEST_ASSERT(tsc_freq > 0);
- /* First, check MSR-based clocksource */ + /* For increased accuracy, take mean rdtsc() before and afrer rdmsr() */ r1 = rdtsc(); t1 = rdmsr(HV_X64_MSR_TIME_REF_COUNT); + r1 = (r1 + rdtsc()) / 2; nop_loop(); r2 = rdtsc(); t2 = rdmsr(HV_X64_MSR_TIME_REF_COUNT); + r2 = (r2 + rdtsc()) / 2;
GUEST_ASSERT(r2 > r1 && t2 > t1);
@@ -181,12 +183,14 @@ static void host_check_tsc_msr_rdtsc(struct kvm_vm *vm) tsc_freq = vcpu_get_msr(vm, VCPU_ID, HV_X64_MSR_TSC_FREQUENCY); TEST_ASSERT(tsc_freq > 0, "TSC frequency must be nonzero");
- /* First, check MSR-based clocksource */ + /* For increased accuracy, take mean rdtsc() before and afrer ioctl */ r1 = rdtsc(); t1 = vcpu_get_msr(vm, VCPU_ID, HV_X64_MSR_TIME_REF_COUNT); + r1 = (r1 + rdtsc()) / 2; nop_loop(); r2 = rdtsc(); t2 = vcpu_get_msr(vm, VCPU_ID, HV_X64_MSR_TIME_REF_COUNT); + r2 = (r2 + rdtsc()) / 2;
TEST_ASSERT(t2 > t1, "Time reference MSR is not monotonic (%ld <= %ld)", t1, t2);
From: Ben Gardon bgardon@google.com
[ Upstream commit 5ba7c4c6d1c7af47a916f728bb5940669684a087 ]
Currently disabling dirty logging with the TDP MMU is extremely slow. On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with the shadow MMU.
When disabling dirty logging, zap non-leaf parent entries to allow replacement with huge pages instead of recursing and zapping all of the child, leaf entries. This reduces the number of TLB flushes required. and reduces the disable dirty log time with the TDP MMU to ~3 seconds.
Opportunistically add a WARN() to catch GFNs that are mapped at a higher level than their max level.
Signed-off-by: Ben Gardon bgardon@google.com Message-Id: 20220525230904.1584480-1-bgardon@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kvm/mmu/tdp_iter.c | 9 +++++++++ arch/x86/kvm/mmu/tdp_iter.h | 1 + arch/x86/kvm/mmu/tdp_mmu.c | 38 +++++++++++++++++++++++++++++++------ 3 files changed, 42 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c index 6d3b3e5a5533..ee4802d7b36c 100644 --- a/arch/x86/kvm/mmu/tdp_iter.c +++ b/arch/x86/kvm/mmu/tdp_iter.c @@ -145,6 +145,15 @@ static bool try_step_up(struct tdp_iter *iter) return true; }
+/* + * Step the iterator back up a level in the paging structure. Should only be + * used when the iterator is below the root level. + */ +void tdp_iter_step_up(struct tdp_iter *iter) +{ + WARN_ON(!try_step_up(iter)); +} + /* * Step to the next SPTE in a pre-order traversal of the paging structure. * To get to the next SPTE, the iterator either steps down towards the goal diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index f0af385c56e0..adfca0cf94d3 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -114,5 +114,6 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root, int min_level, gfn_t next_last_level_gfn); void tdp_iter_next(struct tdp_iter *iter); void tdp_iter_restart(struct tdp_iter *iter); +void tdp_iter_step_up(struct tdp_iter *iter);
#endif /* __KVM_X86_MMU_TDP_ITER_H */ diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 922b06bf4b94..b61a11d462cc 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1748,12 +1748,12 @@ static void zap_collapsible_spte_range(struct kvm *kvm, gfn_t start = slot->base_gfn; gfn_t end = start + slot->npages; struct tdp_iter iter; + int max_mapping_level; kvm_pfn_t pfn;
rcu_read_lock();
tdp_root_for_each_pte(iter, root, start, end) { -retry: if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true)) continue;
@@ -1761,15 +1761,41 @@ static void zap_collapsible_spte_range(struct kvm *kvm, !is_last_spte(iter.old_spte, iter.level)) continue;
+ /* + * This is a leaf SPTE. Check if the PFN it maps can + * be mapped at a higher level. + */ pfn = spte_to_pfn(iter.old_spte); - if (kvm_is_reserved_pfn(pfn) || - iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn, - pfn, PG_LEVEL_NUM)) + + if (kvm_is_reserved_pfn(pfn)) continue;
+ max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, + iter.gfn, pfn, PG_LEVEL_NUM); + + WARN_ON(max_mapping_level < iter.level); + + /* + * If this page is already mapped at the highest + * viable level, there's nothing more to do. + */ + if (max_mapping_level == iter.level) + continue; + + /* + * The page can be remapped at a higher level, so step + * up to zap the parent SPTE. + */ + while (max_mapping_level > iter.level) + tdp_iter_step_up(&iter); + /* Note, a successful atomic zap also does a remote TLB flush. */ - if (tdp_mmu_zap_spte_atomic(kvm, &iter)) - goto retry; + tdp_mmu_zap_spte_atomic(kvm, &iter); + + /* + * If the atomic zap fails, the iter will recurse back into + * the same subtree to retry. + */ }
rcu_read_unlock();
From: Seth Forshee sforshee@digitalocean.com
[ Upstream commit 3e684903a8574ffc9475fdf13c4780a7adb506ad ]
A livepatch transition may stall indefinitely when a kvm vCPU is heavily loaded. To the host, the vCPU task is a user thread which is spending a very long time in the ioctl(KVM_RUN) syscall. During livepatch transition, set_notify_signal() will be called on such tasks to interrupt the syscall so that the task can be transitioned. This interrupts guest execution, but when xfer_to_guest_mode_work() sees that TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an exit to user mode is unnecessary, and guest execution is resumed without transitioning the task for the livepatch.
This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal() is expected to break tasks out of interruptible kernel loops and cause them to return to userspace. Change xfer_to_guest_mode_work() to handle TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run loop that an exit to userpsace is needed. Any pending task_work will be run when get_signal() is called from exit_to_user_mode_loop(), so there is no longer any need to run task work from xfer_to_guest_mode_work().
Suggested-by: "Eric W. Biederman" ebiederm@xmission.com Cc: Petr Mladek pmladek@suse.com Signed-off-by: Seth Forshee sforshee@digitalocean.com Message-Id: 20220504180840.2907296-1-sforshee@digitalocean.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- kernel/entry/kvm.c | 6 ------ 1 file changed, 6 deletions(-)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 9d09f489b60e..2e0f75bcb7fd 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -9,12 +9,6 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) int ret;
if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) { - clear_notify_signal(); - if (task_work_pending(current)) - task_work_run(); - } - - if (ti_work & _TIF_SIGPENDING) { kvm_handle_signal_exit(vcpu); return -EINTR; }
Starting with v5.18 TIF_NOTIFY_SIGNAL and task_work_run are separated, so backporting this to v5.18 looks safe.
For anyone considering backporting this any farther please be aware that TIF_NOTIFY_SIGNAL and task_work were all mixed together and to backport this also likely requires backporting my cleanups that separated them.
Eric
Sasha Levin sashal@kernel.org writes:
From: Seth Forshee sforshee@digitalocean.com
[ Upstream commit 3e684903a8574ffc9475fdf13c4780a7adb506ad ]
A livepatch transition may stall indefinitely when a kvm vCPU is heavily loaded. To the host, the vCPU task is a user thread which is spending a very long time in the ioctl(KVM_RUN) syscall. During livepatch transition, set_notify_signal() will be called on such tasks to interrupt the syscall so that the task can be transitioned. This interrupts guest execution, but when xfer_to_guest_mode_work() sees that TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an exit to user mode is unnecessary, and guest execution is resumed without transitioning the task for the livepatch.
This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal() is expected to break tasks out of interruptible kernel loops and cause them to return to userspace. Change xfer_to_guest_mode_work() to handle TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run loop that an exit to userpsace is needed. Any pending task_work will be run when get_signal() is called from exit_to_user_mode_loop(), so there is no longer any need to run task work from xfer_to_guest_mode_work().
Suggested-by: "Eric W. Biederman" ebiederm@xmission.com Cc: Petr Mladek pmladek@suse.com Signed-off-by: Seth Forshee sforshee@digitalocean.com Message-Id: 20220504180840.2907296-1-sforshee@digitalocean.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org
kernel/entry/kvm.c | 6 ------ 1 file changed, 6 deletions(-)
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c index 9d09f489b60e..2e0f75bcb7fd 100644 --- a/kernel/entry/kvm.c +++ b/kernel/entry/kvm.c @@ -9,12 +9,6 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work) int ret; if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
clear_notify_signal();
if (task_work_pending(current))
task_work_run();
}
}if (ti_work & _TIF_SIGPENDING) { kvm_handle_signal_exit(vcpu); return -EINTR;
From: Alexey Kardashevskiy aik@ozlabs.ru
[ Upstream commit e8bc2427018826e02add7b0ed0fc625a60390ae5 ]
A KVM device cleanup happens in either of two callbacks: 1) destroy() which is called when the VM is being destroyed; 2) release() which is called when a device fd is closed.
Most KVM devices use 1) but Book3s's interrupt controller KVM devices (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during the machine execution. The error handling in kvm_ioctl_create_device() assumes destroy() is always defined which leads to NULL dereference as discovered by Syzkaller.
This adds a checks for destroy!=NULL and adds a missing release().
This is not changing kvm_destroy_devices() as devices with defined release() should have been removed from the KVM devices list by then.
Suggested-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Alexey Kardashevskiy aik@ozlabs.ru Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- virt/kvm/kvm_main.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5ab12214e18d..24cb37d19c63 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4299,8 +4299,11 @@ static int kvm_ioctl_create_device(struct kvm *kvm, kvm_put_kvm_no_destroy(kvm); mutex_lock(&kvm->lock); list_del(&dev->vm_node); + if (ops->release) + ops->release(dev); mutex_unlock(&kvm->lock); - ops->destroy(dev); + if (ops->destroy) + ops->destroy(dev); return ret; }
Paolo, ping?
On Mon, Jun 13, 2022 at 10:11:10PM -0400, Sasha Levin wrote:
From: Paolo Bonzini pbonzini@redhat.com
[ Upstream commit 6cd88243c7e03845a450795e134b488fc2afb736 ]
If a vCPU is outside guest mode and is scheduled out, it might be in the process of making a memory access. A problem occurs if another vCPU uses the PV TLB flush feature during the period when the vCPU is scheduled out, and a virtual address has already been translated but has not yet been accessed, because this is equivalent to using a stale TLB entry.
To avoid this, only report a vCPU as preempted if sure that the guest is at an instruction boundary. A rescheduling request will be delivered to the host physical CPU as an external interrupt, so for simplicity consider any vmexit *not* instruction boundary except for external interrupts.
It would in principle be okay to report the vCPU as preempted also if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the vmentry/vmexit overhead unnecessarily, and optimistic spinning is also unlikely to succeed. However, leave it for later because right now kvm_vcpu_check_block() is doing memory accesses. Even though the TLB flush issue only applies to virtual memory address, it's very much preferrable to be conservative.
Reported-by: Jann Horn jannh@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org
arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/svm/svm.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++ 4 files changed, 28 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4ff36610af6a..9fdaa847d4b6 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -651,6 +651,7 @@ struct kvm_vcpu_arch { u64 ia32_misc_enable_msr; u64 smbase; u64 smi_count;
- bool at_instruction_boundary; bool tpr_access_reporting; bool xsaves_enabled; bool xfd_no_write_intercept;
@@ -1289,6 +1290,8 @@ struct kvm_vcpu_stat { u64 nested_run; u64 directed_yield_attempted; u64 directed_yield_successful;
- u64 preemption_reported;
- u64 preemption_other; u64 guest_mode;
};
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7e45d03cd018..5842abf1eac4 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4165,6 +4165,8 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu) {
- if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR)
vcpu->arch.at_instruction_boundary = true;
}
static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 982df9c000d3..c44f8e1d30c8 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6549,6 +6549,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu) return;
handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
- vcpu->arch.at_instruction_boundary = true;
}
static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 39c571224ac2..36453517e847 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -291,6 +291,8 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = { STATS_DESC_COUNTER(VCPU, nested_run), STATS_DESC_COUNTER(VCPU, directed_yield_attempted), STATS_DESC_COUNTER(VCPU, directed_yield_successful),
- STATS_DESC_COUNTER(VCPU, preemption_reported),
- STATS_DESC_COUNTER(VCPU, preemption_other), STATS_DESC_ICOUNTER(VCPU, guest_mode)
};
@@ -4604,6 +4606,19 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) struct kvm_memslots *slots; static const u8 preempted = KVM_VCPU_PREEMPTED;
- /*
* The vCPU can be marked preempted if and only if the VM-Exit was on
* an instruction boundary and will not trigger guest emulation of any
* kind (see vcpu_run). Vendor specific code controls (conservatively)
* when this is true, for example allowing the vCPU to be marked
* preempted if and only if the VM-Exit was due to a host interrupt.
*/
- if (!vcpu->arch.at_instruction_boundary) {
vcpu->stat.preemption_other++;
return;
- }
- vcpu->stat.preemption_reported++; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return;
@@ -10358,6 +10373,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu) vcpu->arch.l1tf_flush_l1d = true;
for (;;) {
/*
* If another guest vCPU requests a PV TLB flush in the middle
* of instruction emulation, the rest of the emulation could
* use a stale page translation. Assume that any code after
* this point can start executing an instruction.
*/
if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); } else {vcpu->arch.at_instruction_boundary = false;
-- 2.35.1
On Tue, Jun 21, 2022 at 11:23 PM Sasha Levin sashal@kernel.org wrote:
Paolo, ping?
What happened here? From what I can tell, even the backports that Sasha already wrote didn't get applied?
On 6/21/22 23:23, Sasha Levin wrote:
Paolo, ping?
Sorry I missed the whole bunch of MANUALSEL patches.
Acked-by: Paolo Bonzini pbonzini@redhat.com
for all six.
Paolo
On Wed, Aug 03, 2022 at 06:50:25PM +0200, Paolo Bonzini wrote:
On 6/21/22 23:23, Sasha Levin wrote:
Paolo, ping?
Sorry I missed the whole bunch of MANUALSEL patches.
Acked-by: Paolo Bonzini pbonzini@redhat.com
for all six.
Thanks Paolo! I'll queue up these and the missing ones as pointed out by Jann.
linux-stable-mirror@lists.linaro.org