This patch series backports a few VM preemption_status, steal_time and PV TLB flushing fixes to 5.10 stable kernel.
Most of the changes backport cleanly except i had to work around a few becauseof missing support/APIs in 5.10 kernel. I have captured those in the changelog as well in the individual patches.
Changelog - Use mark_page_dirty_in_slot api without kvm argument (KVM: x86: Fix recording of guest steal time / preempted status) - Avoid checking for xen_msr and SEV-ES conditions (KVM: x86: do not set st->preempted when going back to user space) - Use VCPU_STAT macro to expose preemption_reported and preemption_other fields (KVM: x86: do not report a vCPU as preempted outside instruction boundaries)
David Woodhouse (2): KVM: x86: Fix recording of guest steal time / preempted status KVM: Fix steal time asm constraints
Lai Jiangshan (1): KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior
Paolo Bonzini (5): KVM: x86: do not set st->preempted when going back to user space KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: revalidate steal time cache if MSR value changes KVM: x86: do not report preemption if the steal time cache is stale KVM: x86: move guest_pv_has out of user_access section
Sean Christopherson (1): KVM: x86: Remove obsolete disabling of page faults in kvm_arch_vcpu_put()
arch/x86/include/asm/kvm_host.h | 5 +- arch/x86/kvm/svm/svm.c | 2 + arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 164 ++++++++++++++++++++++---------- 4 files changed, 122 insertions(+), 50 deletions(-)
From: Lai Jiangshan laijs@linux.alibaba.com
commit af3511ff7fa2107d6410831f3d71030f5e8d2b25 upstream.
In record_steal_time(), st->preempted is read twice, and trace_kvm_pv_tlb_flush() might output result inconsistent if kvm_vcpu_flush_tlb_guest() see a different st->preempted later.
It is a very trivial problem and hardly has actual harm and can be avoided by reseting and reading st->preempted in atomic way via xchg().
Signed-off-by: Lai Jiangshan laijs@linux.alibaba.com
Message-Id: 20210531174628.10265-1-jiangshanlai@gmail.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/kvm/x86.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c5a08ec348e6..3640b298c42e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3032,9 +3032,11 @@ static void record_steal_time(struct kvm_vcpu *vcpu) * expensive IPIs. */ if (guest_pv_has(vcpu, KVM_FEATURE_PV_TLB_FLUSH)) { + u8 st_preempted = xchg(&st->preempted, 0); + trace_kvm_pv_tlb_flush(vcpu->vcpu_id, - st->preempted & KVM_VCPU_FLUSH_TLB); - if (xchg(&st->preempted, 0) & KVM_VCPU_FLUSH_TLB) + st_preempted & KVM_VCPU_FLUSH_TLB); + if (st_preempted & KVM_VCPU_FLUSH_TLB) kvm_vcpu_flush_tlb_guest(vcpu); } else { st->preempted = 0;
From: David Woodhouse dwmw2@infradead.org
commit 7e2175ebd695f17860c5bd4ad7616cce12ed4591 upstream.
In commit b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") we switched to using a gfn_to_pfn_cache for accessing the guest steal time structure in order to allow for an atomic xchg of the preempted field. This has a couple of problems.
Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a guest vCPU using an IOMEM page for its steal time would never have its preempted field set.
Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it should have been. There are two stages to the GFN->PFN conversion; first the GFN is converted to a userspace HVA, and then that HVA is looked up in the process page tables to find the underlying host PFN. Correct invalidation of the latter would require being hooked up to the MMU notifiers, but that doesn't happen---so it just keeps mapping and unmapping the *wrong* PFN after the userspace page tables change.
In the !IOMEM case at least the stale page *is* pinned all the time it's cached, so it won't be freed and reused by anyone else while still receiving the steal time updates. The map/unmap dance only takes care of the KVM administrivia such as marking the page dirty.
Until the gfn_to_pfn cache handles the remapping automatically by integrating with the MMU notifiers, we might as well not get a kernel mapping of it, and use the perfectly serviceable userspace HVA that we already have. We just need to implement the atomic xchg on the userspace address with appropriate exception handling, which is fairly trivial.
Cc: stable@vger.kernel.org Fixes: b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") Signed-off-by: David Woodhouse dwmw@amazon.co.uk Message-Id: 3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org [I didn't entirely agree with David's assessment of the usefulness of the gfn_to_pfn cache, and integrated the outcome of the discussion in the above commit message. - Paolo] Signed-off-by: Paolo Bonzini pbonzini@redhat.com [risbhat@amazon.com: Use the older mark_page_dirty_in_slot api without kvm argument] Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/x86.c | 105 +++++++++++++++++++++++--------- 2 files changed, 76 insertions(+), 31 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 38c63a78aba6..2b35f8139f15 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -663,7 +663,7 @@ struct kvm_vcpu_arch { u8 preempted; u64 msr_val; u64 last_steal; - struct gfn_to_pfn_cache cache; + struct gfn_to_hva_cache cache; } st;
u64 l1_tsc_offset; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3640b298c42e..cd1e6710bc33 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3013,53 +3013,92 @@ static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
static void record_steal_time(struct kvm_vcpu *vcpu) { - struct kvm_host_map map; - struct kvm_steal_time *st; + struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache; + struct kvm_steal_time __user *st; + struct kvm_memslots *slots; + u64 steal; + u32 version;
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return;
- /* -EAGAIN is returned in atomic context so we can just return. */ - if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT, - &map, &vcpu->arch.st.cache, false)) + if (WARN_ON_ONCE(current->mm != vcpu->kvm->mm)) return;
- st = map.hva + - offset_in_page(vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS); + slots = kvm_memslots(vcpu->kvm); + + if (unlikely(slots->generation != ghc->generation || + kvm_is_error_hva(ghc->hva) || !ghc->memslot)) { + gfn_t gfn = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS; + + /* We rely on the fact that it fits in a single page. */ + BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS); + + if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gfn, sizeof(*st)) || + kvm_is_error_hva(ghc->hva) || !ghc->memslot) + return; + } + + st = (struct kvm_steal_time __user *)ghc->hva; + if (!user_access_begin(st, sizeof(*st))) + return;
/* * Doing a TLB flush here, on the guest's behalf, can avoid * expensive IPIs. */ if (guest_pv_has(vcpu, KVM_FEATURE_PV_TLB_FLUSH)) { - u8 st_preempted = xchg(&st->preempted, 0); + u8 st_preempted = 0; + int err = -EFAULT; + + asm volatile("1: xchgb %0, %2\n" + "xor %1, %1\n" + "2:\n" + _ASM_EXTABLE_UA(1b, 2b) + : "+r" (st_preempted), + "+&r" (err) + : "m" (st->preempted)); + if (err) + goto out; + + user_access_end(); + + vcpu->arch.st.preempted = 0;
trace_kvm_pv_tlb_flush(vcpu->vcpu_id, st_preempted & KVM_VCPU_FLUSH_TLB); if (st_preempted & KVM_VCPU_FLUSH_TLB) kvm_vcpu_flush_tlb_guest(vcpu); + + if (!user_access_begin(st, sizeof(*st))) + goto dirty; } else { - st->preempted = 0; + unsafe_put_user(0, &st->preempted, out); + vcpu->arch.st.preempted = 0; }
- vcpu->arch.st.preempted = 0; - - if (st->version & 1) - st->version += 1; /* first time write, random junk */ + unsafe_get_user(version, &st->version, out); + if (version & 1) + version += 1; /* first time write, random junk */
- st->version += 1; + version += 1; + unsafe_put_user(version, &st->version, out);
smp_wmb();
- st->steal += current->sched_info.run_delay - + unsafe_get_user(steal, &st->steal, out); + steal += current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; + unsafe_put_user(steal, &st->steal, out);
- smp_wmb(); - - st->version += 1; + version += 1; + unsafe_put_user(version, &st->version, out);
- kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false); + out: + user_access_end(); + dirty: + mark_page_dirty_in_slot(ghc->memslot, gpa_to_gfn(ghc->gpa)); }
int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) @@ -4044,8 +4083,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) { - struct kvm_host_map map; - struct kvm_steal_time *st; + struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache; + struct kvm_steal_time __user *st; + struct kvm_memslots *slots; + static const u8 preempted = KVM_VCPU_PREEMPTED;
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; @@ -4053,16 +4094,23 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) if (vcpu->arch.st.preempted) return;
- if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT, &map, - &vcpu->arch.st.cache, true)) + /* This happens on process exit */ + if (unlikely(current->mm != vcpu->kvm->mm)) return;
- st = map.hva + - offset_in_page(vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS); + slots = kvm_memslots(vcpu->kvm); + + if (unlikely(slots->generation != ghc->generation || + kvm_is_error_hva(ghc->hva) || !ghc->memslot)) + return;
- st->preempted = vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED; + st = (struct kvm_steal_time __user *)ghc->hva; + BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted));
- kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, true); + if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted))) + vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED; + + mark_page_dirty_in_slot(ghc->memslot, gpa_to_gfn(ghc->gpa)); }
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) @@ -10145,11 +10193,8 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) { - struct gfn_to_pfn_cache *cache = &vcpu->arch.st.cache; int idx;
- kvm_release_pfn(cache->pfn, cache->dirty, cache); - kvmclock_reset(vcpu);
kvm_x86_ops.vcpu_free(vcpu);
From: David Woodhouse dwmw@amazon.co.uk
commit 964b7aa0b040bdc6ec1c543ee620cda3f8b4c68a upstream.
In 64-bit mode, x86 instruction encoding allows us to use the low 8 bits of any GPR as an 8-bit operand. In 32-bit mode, however, we can only use the [abcd] registers. For which, GCC has the "q" constraint instead of the less restrictive "r".
Also fix st->preempted, which is an input/output operand rather than an input.
Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status") Reported-by: kernel test robot lkp@intel.com Signed-off-by: David Woodhouse dwmw@amazon.co.uk Message-Id: 89bf72db1b859990355f9c40713a34e0d2d86c98.camel@infradead.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/kvm/x86.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cd1e6710bc33..3de3dcb27f7b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3055,9 +3055,9 @@ static void record_steal_time(struct kvm_vcpu *vcpu) "xor %1, %1\n" "2:\n" _ASM_EXTABLE_UA(1b, 2b) - : "+r" (st_preempted), - "+&r" (err) - : "m" (st->preempted)); + : "+q" (st_preempted), + "+&r" (err), + "+m" (st->preempted)); if (err) goto out;
From: Sean Christopherson seanjc@google.com
commit 19979fba9bfaeab427a8e106d915f0627c952828 upstream.
Remove the disabling of page faults across kvm_steal_time_set_preempted() as KVM now accesses the steal time struct (shared with the guest) via a cached mapping (see commit b043138246a4, "x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed".) The cache lookup is flagged as atomic, thus it would be a bug if KVM tried to resolve a new pfn, i.e. we want the splat that would be reached via might_fault().
Signed-off-by: Sean Christopherson seanjc@google.com Message-Id: 20210123000334.3123628-2-seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/kvm/x86.c | 10 ---------- 1 file changed, 10 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3de3dcb27f7b..87c2283f12c4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4120,15 +4120,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) if (vcpu->preempted) vcpu->arch.preempted_in_kernel = !kvm_x86_ops.get_cpl(vcpu);
- /* - * Disable page faults because we're in atomic context here. - * kvm_write_guest_offset_cached() would call might_fault() - * that relies on pagefault_disable() to tell if there's a - * bug. NOTE: the write to guest memory may not go through if - * during postcopy live migration or if there's heavy guest - * paging. - */ - pagefault_disable(); /* * kvm_memslots() will be called by * kvm_write_guest_offset_cached() so take the srcu lock. @@ -4136,7 +4127,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) idx = srcu_read_lock(&vcpu->kvm->srcu); kvm_steal_time_set_preempted(vcpu); srcu_read_unlock(&vcpu->kvm->srcu, idx); - pagefault_enable(); kvm_x86_ops.vcpu_put(vcpu); vcpu->arch.last_host_tsc = rdtsc(); /*
From: Paolo Bonzini pbonzini@redhat.com
commit 54aa83c90198e68eee8b0850c749bc70efb548da upstream.
Similar to the Xen path, only change the vCPU's reported state if the vCPU was actually preempted. The reason for KVM's behavior is that for example optimistic spinning might not be a good idea if the guest is doing repeated exits to userspace; however, it is confusing and unlikely to make a difference, because well-tuned guests will hardly ever exit KVM_RUN in the first place.
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com [risbhat@amazon.com: Don't check for xen msr as support is not available and skip the SEV-ES condition] Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/kvm/x86.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 87c2283f12c4..0df41be32314 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4117,16 +4117,18 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { int idx;
- if (vcpu->preempted) + if (vcpu->preempted) { vcpu->arch.preempted_in_kernel = !kvm_x86_ops.get_cpl(vcpu);
- /* - * kvm_memslots() will be called by - * kvm_write_guest_offset_cached() so take the srcu lock. - */ - idx = srcu_read_lock(&vcpu->kvm->srcu); - kvm_steal_time_set_preempted(vcpu); - srcu_read_unlock(&vcpu->kvm->srcu, idx); + /* + * Take the srcu lock as memslots will be accessed to check the gfn + * cache generation against the memslots generation. + */ + idx = srcu_read_lock(&vcpu->kvm->srcu); + kvm_steal_time_set_preempted(vcpu); + srcu_read_unlock(&vcpu->kvm->srcu, idx); + } + kvm_x86_ops.vcpu_put(vcpu); vcpu->arch.last_host_tsc = rdtsc(); /*
From: Paolo Bonzini pbonzini@redhat.com
commit 6cd88243c7e03845a450795e134b488fc2afb736 upstream.
If a vCPU is outside guest mode and is scheduled out, it might be in the process of making a memory access. A problem occurs if another vCPU uses the PV TLB flush feature during the period when the vCPU is scheduled out, and a virtual address has already been translated but has not yet been accessed, because this is equivalent to using a stale TLB entry.
To avoid this, only report a vCPU as preempted if sure that the guest is at an instruction boundary. A rescheduling request will be delivered to the host physical CPU as an external interrupt, so for simplicity consider any vmexit *not* instruction boundary except for external interrupts.
It would in principle be okay to report the vCPU as preempted also if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the vmentry/vmexit overhead unnecessarily, and optimistic spinning is also unlikely to succeed. However, leave it for later because right now kvm_vcpu_check_block() is doing memory accesses. Even though the TLB flush issue only applies to virtual memory address, it's very much preferrable to be conservative.
Reported-by: Jann Horn jannh@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com [risbhat@amazon.com: Use VCPU_STAT to expose preemption_reported and preemption_other debugfs entries.] Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/svm/svm.c | 2 ++ arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++ 4 files changed, 28 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 2b35f8139f15..25b720304640 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -553,6 +553,7 @@ struct kvm_vcpu_arch { u64 ia32_misc_enable_msr; u64 smbase; u64 smi_count; + bool at_instruction_boundary; bool tpr_access_reporting; bool xsaves_enabled; u64 ia32_xss; @@ -1061,6 +1062,8 @@ struct kvm_vcpu_stat { u64 req_event; u64 halt_poll_success_ns; u64 halt_poll_fail_ns; + u64 preemption_reported; + u64 preemption_other; };
struct x86_instruction_info; diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 442705517caf..f3b7a6a82b07 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3991,6 +3991,8 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,
static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu) { + if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR) + vcpu->arch.at_instruction_boundary = true; }
static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index b33d0f283d4f..b8a6ab210c4e 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6496,6 +6496,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu) return;
handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc)); + vcpu->arch.at_instruction_boundary = true; }
static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0df41be32314..75494b3c2d1e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -231,6 +231,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { VCPU_STAT("l1d_flush", l1d_flush), VCPU_STAT("halt_poll_success_ns", halt_poll_success_ns), VCPU_STAT("halt_poll_fail_ns", halt_poll_fail_ns), + VCPU_STAT("preemption_reported", preemption_reported), + VCPU_STAT("preemption_other", preemption_other), VM_STAT("mmu_shadow_zapped", mmu_shadow_zapped), VM_STAT("mmu_pte_write", mmu_pte_write), VM_STAT("mmu_pde_zapped", mmu_pde_zapped), @@ -4088,6 +4090,19 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) struct kvm_memslots *slots; static const u8 preempted = KVM_VCPU_PREEMPTED;
+ /* + * The vCPU can be marked preempted if and only if the VM-Exit was on + * an instruction boundary and will not trigger guest emulation of any + * kind (see vcpu_run). Vendor specific code controls (conservatively) + * when this is true, for example allowing the vCPU to be marked + * preempted if and only if the VM-Exit was due to a host interrupt. + */ + if (!vcpu->arch.at_instruction_boundary) { + vcpu->stat.preemption_other++; + return; + } + + vcpu->stat.preemption_reported++; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return;
@@ -9300,6 +9315,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu) vcpu->arch.l1tf_flush_l1d = true;
for (;;) { + /* + * If another guest vCPU requests a PV TLB flush in the middle + * of instruction emulation, the rest of the emulation could + * use a stale page translation. Assume that any code after + * this point can start executing an instruction. + */ + vcpu->arch.at_instruction_boundary = false; if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); } else {
From: Paolo Bonzini pbonzini@redhat.com
commit 901d3765fa804ce42812f1d5b1f3de2dfbb26723 upstream.
Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status", 2021-11-11) open coded the previous call to kvm_map_gfn, but in doing so it dropped the comparison between the cached guest physical address and the one in the MSR. This cause an incorrect cache hit if the guest modifies the steal time address while the memslots remain the same. This can happen with kexec, in which case the steal time data is written at the address used by the old kernel instead of the old one.
While at it, rename the variable from gfn to gpa since it is a plain physical address and not a right-shifted one.
Reported-by: Dave Young ruyang@redhat.com Reported-by: Xiaoying Yan yiyan@redhat.com Analyzed-by: Dr. David Alan Gilbert dgilbert@redhat.com Cc: David Woodhouse dwmw@amazon.co.uk Cc: stable@vger.kernel.org Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status") Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/kvm/x86.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 75494b3c2d1e..111aa95f3de3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3018,6 +3018,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu) struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache; struct kvm_steal_time __user *st; struct kvm_memslots *slots; + gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS; u64 steal; u32 version;
@@ -3030,13 +3031,12 @@ static void record_steal_time(struct kvm_vcpu *vcpu) slots = kvm_memslots(vcpu->kvm);
if (unlikely(slots->generation != ghc->generation || + gpa != ghc->gpa || kvm_is_error_hva(ghc->hva) || !ghc->memslot)) { - gfn_t gfn = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS; - /* We rely on the fact that it fits in a single page. */ BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS);
- if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gfn, sizeof(*st)) || + if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gpa, sizeof(*st)) || kvm_is_error_hva(ghc->hva) || !ghc->memslot) return; }
From: Paolo Bonzini pbonzini@redhat.com
commit c3c28d24d910a746b02f496d190e0e8c6560224b upstream.
Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status", 2021-11-11) open coded the previous call to kvm_map_gfn, but in doing so it dropped the comparison between the cached guest physical address and the one in the MSR. This cause an incorrect cache hit if the guest modifies the steal time address while the memslots remain the same. This can happen with kexec, in which case the preempted bit is written at the address used by the old kernel instead of the old one.
Cc: David Woodhouse dwmw@amazon.co.uk Cc: stable@vger.kernel.org Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status") Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/kvm/x86.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 111aa95f3de3..9e9298c333c8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4089,6 +4089,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) struct kvm_steal_time __user *st; struct kvm_memslots *slots; static const u8 preempted = KVM_VCPU_PREEMPTED; + gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
/* * The vCPU can be marked preempted if and only if the VM-Exit was on @@ -4116,6 +4117,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) slots = kvm_memslots(vcpu->kvm);
if (unlikely(slots->generation != ghc->generation || + gpa != ghc->gpa || kvm_is_error_hva(ghc->hva) || !ghc->memslot)) return;
From: Paolo Bonzini pbonzini@redhat.com
commit 3e067fd8503d6205aa0c1c8f48f6b209c592d19c upstream.
When UBSAN is enabled, the code emitted for the call to guest_pv_has includes a call to __ubsan_handle_load_invalid_value. objtool complains that this call happens with UACCESS enabled; to avoid the warning, pull the calls to user_access_begin into both arms of the "if" statement, after the check for guest_pv_has.
Reported-by: Stephen Rothwell sfr@canb.auug.org.au Cc: David Woodhouse dwmw2@infradead.org Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Rishabh Bhatnagar risbhat@amazon.com --- arch/x86/kvm/x86.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9e9298c333c8..e3599a51c72d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3042,9 +3042,6 @@ static void record_steal_time(struct kvm_vcpu *vcpu) }
st = (struct kvm_steal_time __user *)ghc->hva; - if (!user_access_begin(st, sizeof(*st))) - return; - /* * Doing a TLB flush here, on the guest's behalf, can avoid * expensive IPIs. @@ -3053,6 +3050,9 @@ static void record_steal_time(struct kvm_vcpu *vcpu) u8 st_preempted = 0; int err = -EFAULT;
+ if (!user_access_begin(st, sizeof(*st))) + return; + asm volatile("1: xchgb %0, %2\n" "xor %1, %1\n" "2:\n" @@ -3075,6 +3075,9 @@ static void record_steal_time(struct kvm_vcpu *vcpu) if (!user_access_begin(st, sizeof(*st))) goto dirty; } else { + if (!user_access_begin(st, sizeof(*st))) + return; + unsafe_put_user(0, &st->preempted, out); vcpu->arch.st.preempted = 0; }
Gentle reminder to review this patch series.
On 9/9/22, 11:56 AM, "Rishabh Bhatnagar" risbhat@amazon.com wrote:
This patch series backports a few VM preemption_status, steal_time and PV TLB flushing fixes to 5.10 stable kernel.
Most of the changes backport cleanly except i had to work around a few becauseof missing support/APIs in 5.10 kernel. I have captured those in the changelog as well in the individual patches.
Changelog - Use mark_page_dirty_in_slot api without kvm argument (KVM: x86: Fix recording of guest steal time / preempted status) - Avoid checking for xen_msr and SEV-ES conditions (KVM: x86: do not set st->preempted when going back to user space) - Use VCPU_STAT macro to expose preemption_reported and preemption_other fields (KVM: x86: do not report a vCPU as preempted outside instruction boundaries)
David Woodhouse (2): KVM: x86: Fix recording of guest steal time / preempted status KVM: Fix steal time asm constraints
Lai Jiangshan (1): KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior
Paolo Bonzini (5): KVM: x86: do not set st->preempted when going back to user space KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: revalidate steal time cache if MSR value changes KVM: x86: do not report preemption if the steal time cache is stale KVM: x86: move guest_pv_has out of user_access section
Sean Christopherson (1): KVM: x86: Remove obsolete disabling of page faults in kvm_arch_vcpu_put()
arch/x86/include/asm/kvm_host.h | 5 +- arch/x86/kvm/svm/svm.c | 2 + arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 164 ++++++++++++++++++++++---------- 4 files changed, 122 insertions(+), 50 deletions(-)
-- 2.37.1
On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote:
Gentle reminder to review this patch series.
Gentle reminder to never top-post :)
Also, it's up to the KVM maintainers if they wish to review this or not. I can't make them care about old and obsolete kernels like 5.10.y. Why not just use 5.15.y or newer?
thanks,
greg k-h
On Tue, Sep 20, 2022 at 06:19:26PM +0200, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote:
Gentle reminder to review this patch series.
Gentle reminder to never top-post :)
Also, it's up to the KVM maintainers if they wish to review this or not. I can't make them care about old and obsolete kernels like 5.10.y. Why not just use 5.15.y or newer?
Given the lack of responses here from the KVM developers, I'll drop this from my mbox and wait for them to be properly reviewed and resend before considering them for a stable release.
thanks,
greg k-h
On Wed, 21 Sep 2022, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 06:19:26PM +0200, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote:
Gentle reminder to review this patch series.
Gentle reminder to never top-post :)
Also, it's up to the KVM maintainers if they wish to review this or not. I can't make them care about old and obsolete kernels like 5.10.y. Why not just use 5.15.y or newer?
Given the lack of responses here from the KVM developers, I'll drop this from my mbox and wait for them to be properly reviewed and resend before considering them for a stable release.
KVM maintainers,
Would someone be kind enough to take a look at this for Greg please?
Note that at least one of the patches in this set has been identified as a fix for a serious security issue regarding the compromise of guest kernels due to the mishandling of flush operations.
Please could someone confirm or otherwise that this is relevant for v5.10.y and older?
Thank you.
On Wed, Apr 19, 2023, Lee Jones wrote:
On Wed, 21 Sep 2022, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 06:19:26PM +0200, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote:
Gentle reminder to review this patch series.
Gentle reminder to never top-post :)
Also, it's up to the KVM maintainers if they wish to review this or not. I can't make them care about old and obsolete kernels like 5.10.y. Why not just use 5.15.y or newer?
Given the lack of responses here from the KVM developers, I'll drop this from my mbox and wait for them to be properly reviewed and resend before considering them for a stable release.
KVM maintainers,
Would someone be kind enough to take a look at this for Greg please?
Note that at least one of the patches in this set has been identified as a fix for a serious security issue regarding the compromise of guest kernels due to the mishandling of flush operations.
A minor note, the security issue is serious _if_ the bug can be exploited, which as is often the case for KVM, is a fairly big "if". Jann's PoC relied on collusion between host userspace and the guest kernel, and as Jann called out, triggering the bug on a !PREEMPT host kernel would be quite difficult in practice.
I don't want to downplay the seriousness of compromising guest security, but CVSS scores for KVM CVEs almost always fail to account for the multitude of factors in play. E.g. CVE-2023-30456 also had a score of 7.8, and that bug required disabling EPT, which pretty much no one does when running untrusted guest code.
In other words, take the purported severity with a grain of salt.
Please could someone confirm or otherwise that this is relevant for v5.10.y and older?
Acked-by: Sean Christopherson seanjc@google.com
On Tue, 02 May 2023, Sean Christopherson wrote:
On Wed, Apr 19, 2023, Lee Jones wrote:
On Wed, 21 Sep 2022, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 06:19:26PM +0200, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote:
Gentle reminder to review this patch series.
Gentle reminder to never top-post :)
Also, it's up to the KVM maintainers if they wish to review this or not. I can't make them care about old and obsolete kernels like 5.10.y. Why not just use 5.15.y or newer?
Given the lack of responses here from the KVM developers, I'll drop this from my mbox and wait for them to be properly reviewed and resend before considering them for a stable release.
KVM maintainers,
Would someone be kind enough to take a look at this for Greg please?
Note that at least one of the patches in this set has been identified as a fix for a serious security issue regarding the compromise of guest kernels due to the mishandling of flush operations.
A minor note, the security issue is serious _if_ the bug can be exploited, which as is often the case for KVM, is a fairly big "if". Jann's PoC relied on collusion between host userspace and the guest kernel, and as Jann called out, triggering the bug on a !PREEMPT host kernel would be quite difficult in practice.
I don't want to downplay the seriousness of compromising guest security, but CVSS scores for KVM CVEs almost always fail to account for the multitude of factors in play. E.g. CVE-2023-30456 also had a score of 7.8, and that bug required disabling EPT, which pretty much no one does when running untrusted guest code.
In other words, take the purported severity with a grain of salt.
Please could someone confirm or otherwise that this is relevant for v5.10.y and older?
Acked-by: Sean Christopherson seanjc@google.com
Thanks for taking the time to provide some background information and for the Ack Sean, much appreciated.
For anyone taking notice, I expect a little lag on this still whilst Greg is AFK. I'll follow-up in a few days.
On Wed, May 03, 2023 at 08:34:33AM +0100, Lee Jones wrote:
On Tue, 02 May 2023, Sean Christopherson wrote:
On Wed, Apr 19, 2023, Lee Jones wrote:
On Wed, 21 Sep 2022, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 06:19:26PM +0200, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote:
Gentle reminder to review this patch series.
Gentle reminder to never top-post :)
Also, it's up to the KVM maintainers if they wish to review this or not. I can't make them care about old and obsolete kernels like 5.10.y. Why not just use 5.15.y or newer?
Given the lack of responses here from the KVM developers, I'll drop this from my mbox and wait for them to be properly reviewed and resend before considering them for a stable release.
KVM maintainers,
Would someone be kind enough to take a look at this for Greg please?
Note that at least one of the patches in this set has been identified as a fix for a serious security issue regarding the compromise of guest kernels due to the mishandling of flush operations.
A minor note, the security issue is serious _if_ the bug can be exploited, which as is often the case for KVM, is a fairly big "if". Jann's PoC relied on collusion between host userspace and the guest kernel, and as Jann called out, triggering the bug on a !PREEMPT host kernel would be quite difficult in practice.
I don't want to downplay the seriousness of compromising guest security, but CVSS scores for KVM CVEs almost always fail to account for the multitude of factors in play. E.g. CVE-2023-30456 also had a score of 7.8, and that bug required disabling EPT, which pretty much no one does when running untrusted guest code.
In other words, take the purported severity with a grain of salt.
Please could someone confirm or otherwise that this is relevant for v5.10.y and older?
Acked-by: Sean Christopherson seanjc@google.com
Thanks for taking the time to provide some background information and for the Ack Sean, much appreciated.
For anyone taking notice, I expect a little lag on this still whilst Greg is AFK. I'll follow-up in a few days.
What am I supposed to do here? The thread is long-gone from my stable review queue, is there some patch I'm supposed to apply? If so, can I get a resend with the proper acks added?
thanks,
greg k-h
On 5/3/23 6:10 PM, gregkh@linuxfoundation.org wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Wed, May 03, 2023 at 08:34:33AM +0100, Lee Jones wrote:
On Tue, 02 May 2023, Sean Christopherson wrote:
On Wed, Apr 19, 2023, Lee Jones wrote:
On Wed, 21 Sep 2022, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 06:19:26PM +0200, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote: > Gentle reminder to review this patch series. Gentle reminder to never top-post :)
Also, it's up to the KVM maintainers if they wish to review this or not. I can't make them care about old and obsolete kernels like 5.10.y. Why not just use 5.15.y or newer?
Given the lack of responses here from the KVM developers, I'll drop this from my mbox and wait for them to be properly reviewed and resend before considering them for a stable release.
KVM maintainers,
Would someone be kind enough to take a look at this for Greg please?
Note that at least one of the patches in this set has been identified as a fix for a serious security issue regarding the compromise of guest kernels due to the mishandling of flush operations.
A minor note, the security issue is serious _if_ the bug can be exploited, which as is often the case for KVM, is a fairly big "if". Jann's PoC relied on collusion between host userspace and the guest kernel, and as Jann called out, triggering the bug on a !PREEMPT host kernel would be quite difficult in practice.
I don't want to downplay the seriousness of compromising guest security, but CVSS scores for KVM CVEs almost always fail to account for the multitude of factors in play. E.g. CVE-2023-30456 also had a score of 7.8, and that bug required disabling EPT, which pretty much no one does when running untrusted guest code.
In other words, take the purported severity with a grain of salt.
Please could someone confirm or otherwise that this is relevant for v5.10.y and older?
Acked-by: Sean Christopherson seanjc@google.com
Thanks for taking the time to provide some background information and for the Ack Sean, much appreciated.
For anyone taking notice, I expect a little lag on this still whilst Greg is AFK. I'll follow-up in a few days.
What am I supposed to do here? The thread is long-gone from my stable review queue, is there some patch I'm supposed to apply? If so, can I get a resend with the proper acks added?
thanks,
greg k-h
Yeah its been half a year since i sent this series and i had mostly forgotten about this. Sure i can resend a new version with acks/tested-by added.
Thanks Rishabh
On Thu, 04 May 2023, Bhatnagar, Rishabh wrote:
On 5/3/23 6:10 PM, gregkh@linuxfoundation.org wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Wed, May 03, 2023 at 08:34:33AM +0100, Lee Jones wrote:
On Tue, 02 May 2023, Sean Christopherson wrote:
On Wed, Apr 19, 2023, Lee Jones wrote:
On Wed, 21 Sep 2022, gregkh@linuxfoundation.org wrote:
On Tue, Sep 20, 2022 at 06:19:26PM +0200, gregkh@linuxfoundation.org wrote: > On Tue, Sep 20, 2022 at 03:34:04PM +0000, Bhatnagar, Rishabh wrote: > > Gentle reminder to review this patch series. > Gentle reminder to never top-post :) > > Also, it's up to the KVM maintainers if they wish to review this or not. > I can't make them care about old and obsolete kernels like 5.10.y. Why > not just use 5.15.y or newer? Given the lack of responses here from the KVM developers, I'll drop this from my mbox and wait for them to be properly reviewed and resend before considering them for a stable release.
KVM maintainers,
Would someone be kind enough to take a look at this for Greg please?
Note that at least one of the patches in this set has been identified as a fix for a serious security issue regarding the compromise of guest kernels due to the mishandling of flush operations.
A minor note, the security issue is serious _if_ the bug can be exploited, which as is often the case for KVM, is a fairly big "if". Jann's PoC relied on collusion between host userspace and the guest kernel, and as Jann called out, triggering the bug on a !PREEMPT host kernel would be quite difficult in practice.
I don't want to downplay the seriousness of compromising guest security, but CVSS scores for KVM CVEs almost always fail to account for the multitude of factors in play. E.g. CVE-2023-30456 also had a score of 7.8, and that bug required disabling EPT, which pretty much no one does when running untrusted guest code.
In other words, take the purported severity with a grain of salt.
Please could someone confirm or otherwise that this is relevant for v5.10.y and older?
Acked-by: Sean Christopherson seanjc@google.com
Thanks for taking the time to provide some background information and for the Ack Sean, much appreciated.
For anyone taking notice, I expect a little lag on this still whilst Greg is AFK. I'll follow-up in a few days.
What am I supposed to do here? The thread is long-gone from my stable review queue, is there some patch I'm supposed to apply? If so, can I get a resend with the proper acks added?
thanks,
greg k-h
Yeah its been half a year since i sent this series and i had mostly forgotten about this. Sure i can resend a new version with acks/tested-by added.
Thank you Rishabh.
Please can you ensure that you Cc me on it please.
This patch series backports a few VM preemption_status, steal_time and PV TLB flushing fixes to 5.10 stable kernel.
Most of the changes backport cleanly except i had to work around a few becauseof missing support/APIs in 5.10 kernel. I have captured those in the changelog as well in the individual patches.
Changelog
- Use mark_page_dirty_in_slot api without kvm argument (KVM: x86: Fix recording of guest steal time / preempted status)
- Avoid checking for xen_msr and SEV-ES conditions (KVM: x86: do not set st->preempted when going back to user space)
- Use VCPU_STAT macro to expose preemption_reported and preemption_other fields (KVM: x86: do not report a vCPU as preempted outside instruction boundaries)
David Woodhouse (2): KVM: x86: Fix recording of guest steal time / preempted status KVM: Fix steal time asm constraints
Lai Jiangshan (1): KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior
Paolo Bonzini (5): KVM: x86: do not set st->preempted when going back to user space KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: revalidate steal time cache if MSR value changes KVM: x86: do not report preemption if the steal time cache is stale KVM: x86: move guest_pv_has out of user_access section
Sean Christopherson (1): KVM: x86: Remove obsolete disabling of page faults in kvm_arch_vcpu_put()
Thanks Rishabh for the back-ports.
Tested-by: Allen Pais apais@linux.microsoft.com
Thanks.
arch/x86/include/asm/kvm_host.h | 5 +- arch/x86/kvm/svm/svm.c | 2 + arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 164 ++++++++++++++++++++++---------- 4 files changed, 122 insertions(+), 50 deletions(-)
-- 2.37.1
linux-stable-mirror@lists.linaro.org