From: Fred Griffoul fgriffo@amazon.co.uk
This patch series addresses both performance and correctness issues in nested VMX when handling guest memory.
During nested VMX operations, L0 (KVM) accesses specific L1 guest pages to manage L2 execution. These pages fall into two categories: pages accessed only by L0 (such as the L1 MSR bitmap page or the eVMCS page), and pages passed to the L2 guest via vmcs02 (such as APIC access, virtual APIC, and posted interrupt descriptor pages).
The current implementation uses kvm_vcpu_map/unmap, which causes two issues.
First, the current approach is missing proper invalidation handling in critical scenarios. Enlightened VMCS (eVMCS) pages can become stale when memslots are modified, as there is no mechanism to invalidate the cached mappings. Similarly, APIC access and virtual APIC pages can be migrated by the host, but without proper notification through mmu_notifier callbacks, the mappings become invalid and can lead to incorrect behavior.
Second, for unmanaged guest memory (memory not directly mapped by the kernel, such as memory passed with the mem= parameter or guest_memfd for non-CoCo VMs), this workflow invokes expensive memremap/memunmap operations on every L2 VM entry/exit cycle. This creates significant overhead that impacts nested virtualization performance.
This series replaces kvm_host_map with gfn_to_pfn_cache in nested VMX. The pfncache infrastructure maintains persistent mappings as long as the page GPA does not change, eliminating the memremap/memunmap overhead on every VM entry/exit cycle. Additionally, pfncache provides proper invalidation handling via mmu_notifier callbacks and memslots generation check, ensuring that mappings are correctly updated during both memslot updates and page migration events.
As an example, a microbenchmark using memslot_perf_test with 8192 memslots demonstrates huge improvements in nested VMX operations with unmanaged guest memory:
Before After Improvement map: 26.12s 1.54s ~17x faster unmap: 40.00s 0.017s ~2353x faster unmap chunked: 10.07s 0.005s ~2014x faster
The series is organized as follows:
Patches 1-5 handle the L1 MSR bitmap page and system pages (APIC access, virtual APIC, and posted interrupt descriptor). Patch 1 converts the MSR bitmap to use gfn_to_pfn_cache. Patches 2-3 restore and complete "guest-uses-pfn" support in pfncache. Patch 4 converts the system pages to use gfn_to_pfn_cache. Patch 5 adds a selftest for cache invalidation and memslot updates.
Patches 6-7 add enlightened VMCS support. Patch 6 avoids accessing eVMCS fields after they are copied into the cached vmcs12 structure. Patch 7 converts eVMCS page mapping to use gfn_to_pfn_cache.
Patches 8-10 implement persistent nested context to handle L2 vCPU multiplexing and migration between L1 vCPUs. Patch 8 introduces the nested context management infrastructure. Patch 9 integrates pfncache with persistent nested context. Patch 10 adds a selftest for this L2 vCPU context switching.
v2: - Extended series to support enlightened VMCS (eVMCS). - Added persistent nested context for improved L2 vCPU handling. - Added additional selftests.
Suggested-by: dwmw@amazon.co.uk
Fred Griffoul (10): KVM: nVMX: Implement cache for L1 MSR bitmap KVM: pfncache: Restore guest-uses-pfn support KVM: x86: Add nested state validation for pfncache support KVM: nVMX: Implement cache for L1 APIC pages KVM: selftests: Add nested VMX APIC cache invalidation test KVM: nVMX: Cache evmcs fields to ensure consistency during VM-entry KVM: nVMX: Replace evmcs kvm_host_map with pfncache KVM: x86: Add nested context management KVM: nVMX: Use nested context for pfncache persistence KVM: selftests: Add L2 vcpu context switch test
arch/x86/include/asm/kvm_host.h | 32 ++ arch/x86/include/uapi/asm/kvm.h | 2 + arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/nested.c | 199 ++++++++ arch/x86/kvm/vmx/hyperv.c | 5 +- arch/x86/kvm/vmx/hyperv.h | 33 +- arch/x86/kvm/vmx/nested.c | 463 ++++++++++++++---- arch/x86/kvm/vmx/vmx.c | 8 + arch/x86/kvm/vmx/vmx.h | 16 +- arch/x86/kvm/x86.c | 19 +- include/linux/kvm_host.h | 34 +- include/linux/kvm_types.h | 1 + tools/testing/selftests/kvm/Makefile.kvm | 2 + .../selftests/kvm/x86/vmx_apic_update_test.c | 302 ++++++++++++ .../selftests/kvm/x86/vmx_l2_switch_test.c | 416 ++++++++++++++++ virt/kvm/kvm_main.c | 3 +- virt/kvm/kvm_mm.h | 6 +- virt/kvm/pfncache.c | 43 +- 18 files changed, 1467 insertions(+), 119 deletions(-) create mode 100644 arch/x86/kvm/nested.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c create mode 100644 tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c
-- 2.43.0
From: Fred Griffoul fgriffo@amazon.co.uk
Introduce a gfn_to_pfn_cache to optimize L1 MSR bitmap access by replacing map/unmap operations. This optimization reduces overhead during L2 VM-entry where nested_vmx_prepare_msr_bitmap() merges L1's MSR intercepts with L0's requirements.
Current implementation using kvm_vcpu_map_readonly() and kvm_vcpu_unmap() creates significant performance impact, mostly with unmanaged guest memory.
The cache is initialized when entering VMX operation and deactivated when VMX operation ends.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- arch/x86/kvm/vmx/nested.c | 42 +++++++++++++++++++++++++++++++++++---- arch/x86/kvm/vmx/vmx.h | 2 ++ 2 files changed, 40 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 8b131780e981..0de84b30c41d 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -315,6 +315,34 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs) vcpu->arch.regs_dirty = 0; }
+/* + * Maps a single guest page starting at @gpa and lock the cache for access. + */ +static int nested_gpc_lock(struct gfn_to_pfn_cache *gpc, gpa_t gpa) +{ + int err; + + if (!PAGE_ALIGNED(gpa)) + return -EINVAL; +retry: + read_lock(&gpc->lock); + if (!kvm_gpc_check(gpc, PAGE_SIZE) || (gpc->gpa != gpa)) { + read_unlock(&gpc->lock); + err = kvm_gpc_activate(gpc, gpa, PAGE_SIZE); + if (err) + return err; + + goto retry; + } + + return 0; +} + +static void nested_gpc_unlock(struct gfn_to_pfn_cache *gpc) +{ + read_unlock(&gpc->lock); +} + static void nested_put_vmcs12_pages(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); @@ -344,6 +372,9 @@ static void free_nested(struct kvm_vcpu *vcpu) vmx->nested.vmxon = false; vmx->nested.smm.vmxon = false; vmx->nested.vmxon_ptr = INVALID_GPA; + + kvm_gpc_deactivate(&vmx->nested.msr_bitmap_cache); + free_vpid(vmx->nested.vpid02); vmx->nested.posted_intr_nv = -1; vmx->nested.current_vmptr = INVALID_GPA; @@ -625,7 +656,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, int msr; unsigned long *msr_bitmap_l1; unsigned long *msr_bitmap_l0 = vmx->nested.vmcs02.msr_bitmap; - struct kvm_host_map map; + struct gfn_to_pfn_cache *gpc;
/* Nothing to do if the MSR bitmap is not in use. */ if (!cpu_has_vmx_msr_bitmap() || @@ -648,10 +679,11 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, return true; }
- if (kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), &map)) + gpc = &vmx->nested.msr_bitmap_cache; + if (nested_gpc_lock(gpc, vmcs12->msr_bitmap)) return false;
- msr_bitmap_l1 = (unsigned long *)map.hva; + msr_bitmap_l1 = (unsigned long *)gpc->khva;
/* * To keep the control flow simple, pay eight 8-byte writes (sixteen @@ -739,7 +771,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, MSR_IA32_PL3_SSP, MSR_TYPE_RW);
- kvm_vcpu_unmap(vcpu, &map); + nested_gpc_unlock(gpc);
vmx->nested.force_msr_bitmap_recalc = false;
@@ -5490,6 +5522,8 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
vmx->nested.vpid02 = allocate_vpid();
+ kvm_gpc_init(&vmx->nested.msr_bitmap_cache, vcpu->kvm); + vmx->nested.vmcs02_initialized = false; vmx->nested.vmxon = true;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index ea93121029f9..d76621403c28 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -152,6 +152,8 @@ struct nested_vmx {
struct loaded_vmcs vmcs02;
+ struct gfn_to_pfn_cache msr_bitmap_cache; + /* * Guest pages referred to in the vmcs02 with host-physical * pointers, so we must keep them pinned while L2 runs.
From: Fred Griffoul fgriffo@amazon.co.uk
Restore functionality for guest page access tracking in pfncache, enabling automatic vCPU request generation when cache invalidation occurs through MMU notifier events.
This feature is critical for nested VMX operations where both KVM and L2 guest access guest-provided pages, such as APIC pages and posted interrupt descriptors.
This change:
- Reverts commit eefb85b3f031 ("KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()")
- Partially reverts commit a4bff3df5147 ("KVM: pfncache: remove KVM_GUEST_USES_PFN usage"). Adds kvm_gpc_init_for_vcpu() to initialize pfncache for guest mode access, instead of the usage-specific flag approach.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- include/linux/kvm_host.h | 29 +++++++++++++++++++++++++- include/linux/kvm_types.h | 1 + virt/kvm/kvm_main.c | 3 ++- virt/kvm/kvm_mm.h | 6 ++++-- virt/kvm/pfncache.c | 43 ++++++++++++++++++++++++++++++++++++--- 5 files changed, 75 insertions(+), 7 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 19b8c4bebb9c..6253cf1c38c1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1402,6 +1402,9 @@ int kvm_vcpu_write_guest(struct kvm_vcpu *vcpu, gpa_t gpa, const void *data, unsigned long len); void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn);
+void __kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm, + struct kvm_vcpu *vcpu); + /** * kvm_gpc_init - initialize gfn_to_pfn_cache. * @@ -1412,7 +1415,11 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn); * immutable attributes. Note, the cache must be zero-allocated (or zeroed by * the caller before init). */ -void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm); + +static inline void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm) +{ + __kvm_gpc_init(gpc, kvm, NULL); +}
/** * kvm_gpc_activate - prepare a cached kernel mapping and HPA for a given guest @@ -1494,6 +1501,26 @@ int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len); */ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc);
+/** + * kvm_gpc_init_for_vcpu - initialize gfn_to_pfn_cache for pin/unpin usage + * + * @gpc: struct gfn_to_pfn_cache object. + * @vcpu: vCPU that will pin and directly access this cache. + * @req: request to send when cache is invalidated while pinned. + * + * This sets up a gfn_to_pfn_cache for use by a vCPU that will directly access + * the cached physical address. When the cache is invalidated while pinned, + * the specified request will be sent to the associated vCPU to force cache + * refresh. + * + * Note, the cache must be zero-allocated (or zeroed by the caller before init). + */ +static inline void kvm_gpc_init_for_vcpu(struct gfn_to_pfn_cache *gpc, + struct kvm_vcpu *vcpu) +{ + __kvm_gpc_init(gpc, vcpu->kvm, vcpu); +} + static inline bool kvm_gpc_is_gpa_active(struct gfn_to_pfn_cache *gpc) { return gpc->active && !kvm_is_error_gpa(gpc->gpa); diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 490464c205b4..445170ea23e4 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -74,6 +74,7 @@ struct gfn_to_pfn_cache { struct kvm_memory_slot *memslot; struct kvm *kvm; struct list_head list; + struct kvm_vcpu *vcpu; rwlock_t lock; struct mutex refresh_lock; void *khva; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 226faeaa8e56..88de1eac5baf 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -760,7 +760,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * mn_active_invalidate_count (see above) instead of * mmu_invalidate_in_progress. */ - gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end); + gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, + hva_range.may_block);
/* * If one or more memslots were found and thus zapped, notify arch code diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 31defb08ccba..f1ba02084bd9 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -58,11 +58,13 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *kfp); #ifdef CONFIG_HAVE_KVM_PFNCACHE void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, - unsigned long end); + unsigned long end, + bool may_block); #else static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, - unsigned long end) + unsigned long end, + bool may_block) { } #endif /* HAVE_KVM_PFNCACHE */ diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 728d2c1b488a..543466ff40a0 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -23,9 +23,11 @@ * MMU notifier 'invalidate_range_start' hook. */ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, - unsigned long end) + unsigned long end, bool may_block) { + DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS); struct gfn_to_pfn_cache *gpc; + bool evict_vcpus = false;
spin_lock(&kvm->gpc_lock); list_for_each_entry(gpc, &kvm->gpc_list, list) { @@ -46,8 +48,21 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start,
write_lock_irq(&gpc->lock); if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && - gpc->uhva >= start && gpc->uhva < end) + gpc->uhva >= start && gpc->uhva < end) { gpc->valid = false; + + /* + * If a guest vCPU could be using the physical address, + * it needs to be forced out of guest mode. + */ + if (gpc->vcpu) { + if (!evict_vcpus) { + evict_vcpus = true; + bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS); + } + __set_bit(gpc->vcpu->vcpu_idx, vcpu_bitmap); + } + } write_unlock_irq(&gpc->lock); continue; } @@ -55,6 +70,27 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, read_unlock_irq(&gpc->lock); } spin_unlock(&kvm->gpc_lock); + + if (evict_vcpus) { + /* + * KVM needs to ensure the vCPU is fully out of guest context + * before allowing the invalidation to continue. + */ + unsigned int req = KVM_REQ_OUTSIDE_GUEST_MODE; + bool called; + + /* + * If the OOM reaper is active, then all vCPUs should have + * been stopped already, so perform the request without + * KVM_REQUEST_WAIT and be sad if any needed to be IPI'd. + */ + if (!may_block) + req &= ~KVM_REQUEST_WAIT; + + called = kvm_make_vcpus_request_mask(kvm, req, vcpu_bitmap); + + WARN_ON_ONCE(called && !may_block); + } }
static bool kvm_gpc_is_valid_len(gpa_t gpa, unsigned long uhva, @@ -382,7 +418,7 @@ int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len) return __kvm_gpc_refresh(gpc, gpc->gpa, uhva); }
-void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm) +void __kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm, struct kvm_vcpu *vcpu) { rwlock_init(&gpc->lock); mutex_init(&gpc->refresh_lock); @@ -391,6 +427,7 @@ void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm) gpc->pfn = KVM_PFN_ERR_FAULT; gpc->gpa = INVALID_GPA; gpc->uhva = KVM_HVA_ERR_BAD; + gpc->vcpu = vcpu; gpc->active = gpc->valid = false; }
From: Fred Griffoul fgriffo@amazon.co.uk
Implement state validation for nested virtualization to enable pfncache support for L1 guest pages.
This adds a new nested_ops callback 'is_nested_state_invalid()' that detects when KVM needs to reload nested virtualization state. A KVM_REQ_GET_NESTED_STATE_PAGES request is triggered to reload affected pages before L2 execution when it detects invalid state. The callback monitors L1 guest pages during guest entry/exit while the vCPU runs in IN_GUEST_MODE.
Currently, VMX implementations return false, with full support planned for the next patch.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/nested.c | 6 ++++++ arch/x86/kvm/x86.c | 14 +++++++++++++- 3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 48598d017d6f..4675e71b33a7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1960,6 +1960,7 @@ struct kvm_x86_nested_ops { struct kvm_nested_state __user *user_kvm_nested_state, struct kvm_nested_state *kvm_state); bool (*get_nested_state_pages)(struct kvm_vcpu *vcpu); + bool (*is_nested_state_invalid)(struct kvm_vcpu *vcpu); int (*write_log_dirty)(struct kvm_vcpu *vcpu, gpa_t l2_gpa);
int (*enable_evmcs)(struct kvm_vcpu *vcpu, diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 0de84b30c41d..627a6c24625d 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -3588,6 +3588,11 @@ static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu) return true; }
+static bool vmx_is_nested_state_invalid(struct kvm_vcpu *vcpu) +{ + return false; +} + static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa) { struct vmcs12 *vmcs12; @@ -7527,6 +7532,7 @@ struct kvm_x86_nested_ops vmx_nested_ops = { .get_state = vmx_get_nested_state, .set_state = vmx_set_nested_state, .get_nested_state_pages = vmx_get_nested_state_pages, + .is_nested_state_invalid = vmx_is_nested_state_invalid, .write_log_dirty = nested_vmx_write_pml_buffer, #ifdef CONFIG_KVM_HYPERV .enable_evmcs = nested_enable_evmcs, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4b8138bd4857..1a9c1171df49 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2262,12 +2262,24 @@ int kvm_emulate_monitor(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_emulate_monitor);
+static inline bool kvm_invalid_nested_state(struct kvm_vcpu *vcpu) +{ + if (is_guest_mode(vcpu) && + kvm_x86_ops.nested_ops->is_nested_state_invalid && + kvm_x86_ops.nested_ops->is_nested_state_invalid(vcpu)) { + kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); + return true; + } + return false; +} + static inline bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu) { xfer_to_guest_mode_prepare();
return READ_ONCE(vcpu->mode) == EXITING_GUEST_MODE || - kvm_request_pending(vcpu) || xfer_to_guest_mode_work_pending(); + kvm_request_pending(vcpu) || xfer_to_guest_mode_work_pending() || + kvm_invalid_nested_state(vcpu); }
static fastpath_t __handle_fastpath_wrmsr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
From: Fred Griffoul fgriffo@amazon.co.uk
Replace kvm_host_map usage with gfn_to_pfn_cache for L1 APIC virtualization pages (APIC access, virtual APIC, and posted interrupt descriptor pages) to improve performance with unmanaged guest memory.
The conversion involves several changes:
- Page loading in nested_get_vmcs12_pages(): load vmcs02 fields with pfncache PFNs after each cache has been checked and possibly activated or refreshed, during OUTSIDE_GUEST_MODE vCPU mode.
- Invalidation window handling: since nested_get_vmcs12_pages() runs in OUTSIDE_GUEST_MODE, there's a window where caches can be invalidated by MMU notifications before entering IN_GUEST_MODE. implement is_nested_state_invalid() callback to monitor cache validity between OUTSIDE_GUEST_MODE and IN_GUEST_MODE transitions. This triggers KVM_REQ_GET_NESTED_STATE_PAGES when needed.
- Cache access in event callbacks: the virtual APIC and posted interrupt descriptor pages are accessed by KVM in has_events() and check_events() nested_ops callbacks. These use the kernel HVA following the pfncache pattern of check/refresh, with both callbacks able to sleep if cache refresh is required.
This eliminates expensive memremap/memunmap cycles for each L2 VM entry/exit, providing substantial performance improvements when using unmanaged memory.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- arch/x86/kvm/vmx/nested.c | 169 +++++++++++++++++++++++++++++--------- arch/x86/kvm/vmx/vmx.h | 8 +- include/linux/kvm_host.h | 5 ++ 3 files changed, 139 insertions(+), 43 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 627a6c24625d..1f58b380585b 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -329,8 +329,18 @@ static int nested_gpc_lock(struct gfn_to_pfn_cache *gpc, gpa_t gpa) if (!kvm_gpc_check(gpc, PAGE_SIZE) || (gpc->gpa != gpa)) { read_unlock(&gpc->lock); err = kvm_gpc_activate(gpc, gpa, PAGE_SIZE); - if (err) + if (err) { + /* + * Deactivate nested state caches to prevent + * kvm_gpc_invalid() from returning true in subsequent + * is_nested_state_invalid() calls. This prevents an + * infinite loop while entering guest mode. + */ + if (gpc->vcpu) + kvm_gpc_deactivate(gpc); + return err; + }
goto retry; } @@ -343,14 +353,17 @@ static void nested_gpc_unlock(struct gfn_to_pfn_cache *gpc) read_unlock(&gpc->lock); }
-static void nested_put_vmcs12_pages(struct kvm_vcpu *vcpu) +static int nested_gpc_hpa(struct gfn_to_pfn_cache *gpc, gpa_t gpa, hpa_t *hpa) { - struct vcpu_vmx *vmx = to_vmx(vcpu); + int err; + + err = nested_gpc_lock(gpc, gpa); + if (err) + return err;
- kvm_vcpu_unmap(vcpu, &vmx->nested.apic_access_page_map); - kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map); - kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map); - vmx->nested.pi_desc = NULL; + *hpa = pfn_to_hpa(gpc->pfn); + nested_gpc_unlock(gpc); + return 0; }
/* @@ -373,6 +386,9 @@ static void free_nested(struct kvm_vcpu *vcpu) vmx->nested.smm.vmxon = false; vmx->nested.vmxon_ptr = INVALID_GPA;
+ kvm_gpc_deactivate(&vmx->nested.pi_desc_cache); + kvm_gpc_deactivate(&vmx->nested.virtual_apic_cache); + kvm_gpc_deactivate(&vmx->nested.apic_access_page_cache); kvm_gpc_deactivate(&vmx->nested.msr_bitmap_cache);
free_vpid(vmx->nested.vpid02); @@ -389,8 +405,6 @@ static void free_nested(struct kvm_vcpu *vcpu) kfree(vmx->nested.cached_shadow_vmcs12); vmx->nested.cached_shadow_vmcs12 = NULL;
- nested_put_vmcs12_pages(vcpu); - kvm_mmu_free_roots(vcpu->kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL);
nested_release_evmcs(vcpu); @@ -3477,7 +3491,8 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); - struct kvm_host_map *map; + struct gfn_to_pfn_cache *gpc; + hpa_t hpa;
if (!vcpu->arch.pdptrs_from_userspace && !nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) { @@ -3492,10 +3507,10 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { - map = &vmx->nested.apic_access_page_map; + gpc = &vmx->nested.apic_access_page_cache;
- if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->apic_access_addr), map)) { - vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(map->pfn)); + if (!nested_gpc_hpa(gpc, vmcs12->apic_access_addr, &hpa)) { + vmcs_write64(APIC_ACCESS_ADDR, hpa); } else { pr_debug_ratelimited("%s: no backing for APIC-access address in vmcs12\n", __func__); @@ -3508,10 +3523,10 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) }
if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { - map = &vmx->nested.virtual_apic_map; + gpc = &vmx->nested.virtual_apic_cache;
- if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) { - vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn)); + if (!nested_gpc_hpa(gpc, vmcs12->virtual_apic_page_addr, &hpa)) { + vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa); } else if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING) && nested_cpu_has(vmcs12, CPU_BASED_CR8_STORE_EXITING) && !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { @@ -3534,14 +3549,12 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) }
if (nested_cpu_has_posted_intr(vmcs12)) { - map = &vmx->nested.pi_desc_map; + gpc = &vmx->nested.pi_desc_cache;
- if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) { - vmx->nested.pi_desc = - (struct pi_desc *)(((void *)map->hva) + - offset_in_page(vmcs12->posted_intr_desc_addr)); + if (!nested_gpc_hpa(gpc, vmcs12->posted_intr_desc_addr & PAGE_MASK, &hpa)) { + vmx->nested.pi_desc_offset = offset_in_page(vmcs12->posted_intr_desc_addr); vmcs_write64(POSTED_INTR_DESC_ADDR, - pfn_to_hpa(map->pfn) + offset_in_page(vmcs12->posted_intr_desc_addr)); + hpa + offset_in_page(vmcs12->posted_intr_desc_addr)); } else { /* * Defer the KVM_INTERNAL_EXIT until KVM tries to @@ -3549,7 +3562,6 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) * descriptor. (Note that KVM may do this when it * should not, per the architectural specification.) */ - vmx->nested.pi_desc = NULL; pin_controls_clearbit(vmx, PIN_BASED_POSTED_INTR); } } @@ -3590,7 +3602,16 @@ static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
static bool vmx_is_nested_state_invalid(struct kvm_vcpu *vcpu) { - return false; + struct vcpu_vmx *vmx = to_vmx(vcpu); + + /* + * @vcpu is in IN_GUEST_MODE, eliminating the need for individual gpc + * locks. Since kvm_gpc_invalid() doesn't verify gpc memslot + * generation, we can also skip acquiring the srcu lock. + */ + return kvm_gpc_invalid(&vmx->nested.apic_access_page_cache) || + kvm_gpc_invalid(&vmx->nested.virtual_apic_cache) || + kvm_gpc_invalid(&vmx->nested.pi_desc_cache); }
static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa) @@ -4091,9 +4112,55 @@ void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu) } }
+static void *nested_gpc_lock_if_active(struct gfn_to_pfn_cache *gpc) +{ +retry: + read_lock(&gpc->lock); + if (!gpc->active) { + read_unlock(&gpc->lock); + return NULL; + } + + if (!kvm_gpc_check(gpc, PAGE_SIZE)) { + read_unlock(&gpc->lock); + if (kvm_gpc_refresh(gpc, PAGE_SIZE)) + return NULL; + goto retry; + } + + return gpc->khva; +} + +static struct pi_desc *nested_lock_pi_desc(struct vcpu_vmx *vmx) +{ + u8 *pi_desc_page; + + pi_desc_page = nested_gpc_lock_if_active(&vmx->nested.pi_desc_cache); + if (!pi_desc_page) + return NULL; + + return (struct pi_desc *)(pi_desc_page + vmx->nested.pi_desc_offset); +} + +static void nested_unlock_pi_desc(struct vcpu_vmx *vmx) +{ + nested_gpc_unlock(&vmx->nested.pi_desc_cache); +} + +static void *nested_lock_vapic(struct vcpu_vmx *vmx) +{ + return nested_gpc_lock_if_active(&vmx->nested.virtual_apic_cache); +} + +static void nested_unlock_vapic(struct vcpu_vmx *vmx) +{ + nested_gpc_unlock(&vmx->nested.virtual_apic_cache); +} + static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); + struct pi_desc *pi_desc; int max_irr; void *vapic_page; u16 status; @@ -4101,22 +4168,29 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) if (!vmx->nested.pi_pending) return 0;
- if (!vmx->nested.pi_desc) + pi_desc = nested_lock_pi_desc(vmx); + if (!pi_desc) goto mmio_needed;
vmx->nested.pi_pending = false;
- if (!pi_test_and_clear_on(vmx->nested.pi_desc)) + if (!pi_test_and_clear_on(pi_desc)) { + nested_unlock_pi_desc(vmx); return 0; + }
- max_irr = pi_find_highest_vector(vmx->nested.pi_desc); + max_irr = pi_find_highest_vector(pi_desc); if (max_irr > 0) { - vapic_page = vmx->nested.virtual_apic_map.hva; - if (!vapic_page) + vapic_page = nested_lock_vapic(vmx); + if (!vapic_page) { + nested_unlock_pi_desc(vmx); goto mmio_needed; + } + + __kvm_apic_update_irr(pi_desc->pir, vapic_page, &max_irr); + + nested_unlock_vapic(vmx);
- __kvm_apic_update_irr(vmx->nested.pi_desc->pir, - vapic_page, &max_irr); status = vmcs_read16(GUEST_INTR_STATUS); if ((u8)max_irr > ((u8)status & 0xff)) { status &= ~0xff; @@ -4125,6 +4199,7 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) } }
+ nested_unlock_pi_desc(vmx); nested_mark_vmcs12_pages_dirty(vcpu); return 0;
@@ -4244,8 +4319,10 @@ static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu) static bool vmx_has_nested_events(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_vmx *vmx = to_vmx(vcpu); - void *vapic = vmx->nested.virtual_apic_map.hva; + struct pi_desc *pi_desc; int max_irr, vppr; + void *vapic; + bool res = false;
if (nested_vmx_preemption_timer_pending(vcpu) || vmx->nested.mtf_pending) @@ -4264,23 +4341,33 @@ static bool vmx_has_nested_events(struct kvm_vcpu *vcpu, bool for_injection) __vmx_interrupt_blocked(vcpu)) return false;
+ vapic = nested_lock_vapic(vmx); if (!vapic) return false;
vppr = *((u32 *)(vapic + APIC_PROCPRI));
+ nested_unlock_vapic(vmx); + max_irr = vmx_get_rvi(); if ((max_irr & 0xf0) > (vppr & 0xf0)) return true;
- if (vmx->nested.pi_pending && vmx->nested.pi_desc && - pi_test_on(vmx->nested.pi_desc)) { - max_irr = pi_find_highest_vector(vmx->nested.pi_desc); - if (max_irr > 0 && (max_irr & 0xf0) > (vppr & 0xf0)) - return true; + if (vmx->nested.pi_pending) { + pi_desc = nested_lock_pi_desc(vmx); + if (!pi_desc) + return false; + + if (pi_test_on(pi_desc)) { + max_irr = pi_find_highest_vector(pi_desc); + if (max_irr > 0 && (max_irr & 0xf0) > (vppr & 0xf0)) + res = true; + } + + nested_unlock_pi_desc(vmx); }
- return false; + return res; }
/* @@ -5244,7 +5331,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, vmx_update_cpu_dirty_logging(vcpu); }
- nested_put_vmcs12_pages(vcpu); + nested_mark_vmcs12_pages_dirty(vcpu);
if (vmx->nested.reload_vmcs01_apic_access_page) { vmx->nested.reload_vmcs01_apic_access_page = false; @@ -5529,6 +5616,10 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
kvm_gpc_init(&vmx->nested.msr_bitmap_cache, vcpu->kvm);
+ kvm_gpc_init_for_vcpu(&vmx->nested.apic_access_page_cache, vcpu); + kvm_gpc_init_for_vcpu(&vmx->nested.virtual_apic_cache, vcpu); + kvm_gpc_init_for_vcpu(&vmx->nested.pi_desc_cache, vcpu); + vmx->nested.vmcs02_initialized = false; vmx->nested.vmxon = true;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index d76621403c28..9a285834ccda 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -158,11 +158,11 @@ struct nested_vmx { * Guest pages referred to in the vmcs02 with host-physical * pointers, so we must keep them pinned while L2 runs. */ - struct kvm_host_map apic_access_page_map; - struct kvm_host_map virtual_apic_map; - struct kvm_host_map pi_desc_map; + struct gfn_to_pfn_cache apic_access_page_cache; + struct gfn_to_pfn_cache virtual_apic_cache; + struct gfn_to_pfn_cache pi_desc_cache;
- struct pi_desc *pi_desc; + u64 pi_desc_offset; bool pi_pending; u16 posted_intr_nv;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6253cf1c38c1..b05aace9e295 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1531,6 +1531,11 @@ static inline bool kvm_gpc_is_hva_active(struct gfn_to_pfn_cache *gpc) return gpc->active && kvm_is_error_gpa(gpc->gpa); }
+static inline bool kvm_gpc_invalid(struct gfn_to_pfn_cache *gpc) +{ + return gpc->active && !gpc->valid; +} + void kvm_sigset_activate(struct kvm_vcpu *vcpu); void kvm_sigset_deactivate(struct kvm_vcpu *vcpu);
From: Fred Griffoul fgriffo@amazon.co.uk
Introduce selftest to verify nested VMX APIC virtualization page cache invalidation and refresh mechanisms for pfncache implementation.
The test exercises the nested VMX APIC cache invalidation path through:
- L2 guest setup: creates a nested environment where L2 accesses the APIC access page that is cached by KVM using pfncache.
- Cache invalidation triggers: a separate update thread periodically invalidates the cached pages using either: - madvise(MADV_DONTNEED) to trigger MMU notifications. - vm_mem_region_move() to trigger memslot changes.
The test validates that: - L2 can successfully access APIC page before and after invalidation. - KVM properly handles cache refresh without guest-visible errors. - Both MMU notification and memslot change invalidation paths work correctly.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/x86/vmx_apic_update_test.c | 302 ++++++++++++++++++ 2 files changed, 303 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/vmx_apic_update_test.c
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 148d427ff24b..3431568d837e 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -137,6 +137,7 @@ TEST_GEN_PROGS_x86 += x86/max_vcpuid_cap_test TEST_GEN_PROGS_x86 += x86/triple_fault_event_test TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test TEST_GEN_PROGS_x86 += x86/aperfmperf_test +TEST_GEN_PROGS_x86 += x86/vmx_apic_update_test TEST_GEN_PROGS_x86 += access_tracking_perf_test TEST_GEN_PROGS_x86 += coalesced_io_test TEST_GEN_PROGS_x86 += dirty_log_perf_test diff --git a/tools/testing/selftests/kvm/x86/vmx_apic_update_test.c b/tools/testing/selftests/kvm/x86/vmx_apic_update_test.c new file mode 100644 index 000000000000..1b5b69627a01 --- /dev/null +++ b/tools/testing/selftests/kvm/x86/vmx_apic_update_test.c @@ -0,0 +1,302 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * vmx_apic_update_test + * + * Copyright (C) 2025, Amazon.com, Inc. or its affiliates. All Rights Reserved. + * + * Test L2 guest APIC access page writes with concurrent MMU + * notification and memslot move updates. + */ +#include <pthread.h> +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "vmx.h" + +#define VAPIC_GPA 0xc0000000 +#define VAPIC_SLOT 1 + +#define L2_GUEST_STACK_SIZE 64 + +#define L2_DELAY (100) + +static void l2_guest_code(void) +{ + uint32_t *vapic_addr = (uint32_t *) (VAPIC_GPA + 0x80); + + /* Unroll the loop to avoid any compiler side effect */ + + WRITE_ONCE(*vapic_addr, 1 << 0); + udelay(msecs_to_usecs(L2_DELAY)); + + WRITE_ONCE(*vapic_addr, 1 << 1); + udelay(msecs_to_usecs(L2_DELAY)); + + WRITE_ONCE(*vapic_addr, 1 << 2); + udelay(msecs_to_usecs(L2_DELAY)); + + WRITE_ONCE(*vapic_addr, 1 << 3); + udelay(msecs_to_usecs(L2_DELAY)); + + WRITE_ONCE(*vapic_addr, 1 << 4); + udelay(msecs_to_usecs(L2_DELAY)); + + WRITE_ONCE(*vapic_addr, 1 << 5); + udelay(msecs_to_usecs(L2_DELAY)); + + WRITE_ONCE(*vapic_addr, 1 << 6); + udelay(msecs_to_usecs(L2_DELAY)); + + WRITE_ONCE(*vapic_addr, 0); + udelay(msecs_to_usecs(L2_DELAY)); + + /* Exit to L1 */ + vmcall(); +} + +static void l1_guest_code(struct vmx_pages *vmx_pages) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + uint32_t control, exit_reason; + + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); + GUEST_ASSERT(load_vmcs(vmx_pages)); + prepare_vmcs(vmx_pages, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + /* Enable APIC access */ + control = vmreadz(CPU_BASED_VM_EXEC_CONTROL); + control |= CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; + vmwrite(CPU_BASED_VM_EXEC_CONTROL, control); + control = vmreadz(SECONDARY_VM_EXEC_CONTROL); + control |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; + vmwrite(SECONDARY_VM_EXEC_CONTROL, control); + vmwrite(APIC_ACCESS_ADDR, VAPIC_GPA); + + GUEST_SYNC1(0); + GUEST_ASSERT(!vmlaunch()); +again: + exit_reason = vmreadz(VM_EXIT_REASON); + if (exit_reason == EXIT_REASON_APIC_ACCESS) { + uint64_t guest_rip = vmreadz(GUEST_RIP); + uint64_t instr_len = vmreadz(VM_EXIT_INSTRUCTION_LEN); + + vmwrite(GUEST_RIP, guest_rip + instr_len); + GUEST_ASSERT(!vmresume()); + goto again; + } + + GUEST_SYNC1(exit_reason); + GUEST_ASSERT(exit_reason == EXIT_REASON_VMCALL); + GUEST_DONE(); +} + +static const char *progname; +static int update_period_ms = L2_DELAY / 4; + +struct update_control { + pthread_mutex_t mutex; + pthread_cond_t start_cond; + struct kvm_vm *vm; + bool running; + bool started; + int updates; +}; + +static void wait_for_start_signal(struct update_control *ctrl) +{ + pthread_mutex_lock(&ctrl->mutex); + while (!ctrl->started) + pthread_cond_wait(&ctrl->start_cond, &ctrl->mutex); + + pthread_mutex_unlock(&ctrl->mutex); + printf("%s: starting update\n", progname); +} + +static bool is_running(struct update_control *ctrl) +{ + return READ_ONCE(ctrl->running); +} + +static void set_running(struct update_control *ctrl, bool running) +{ + WRITE_ONCE(ctrl->running, running); +} + +static void signal_thread_start(struct update_control *ctrl) +{ + pthread_mutex_lock(&ctrl->mutex); + if (!ctrl->started) { + ctrl->started = true; + pthread_cond_signal(&ctrl->start_cond); + } + pthread_mutex_unlock(&ctrl->mutex); +} + +static void *update_madvise(void *arg) +{ + struct update_control *ctrl = arg; + void *hva; + + wait_for_start_signal(ctrl); + + hva = addr_gpa2hva(ctrl->vm, VAPIC_GPA); + memset(hva, 0x45, ctrl->vm->page_size); + + while (is_running(ctrl)) { + usleep(update_period_ms * 1000); + madvise(hva, ctrl->vm->page_size, MADV_DONTNEED); + ctrl->updates++; + } + + return NULL; +} + +static void *update_move_memslot(void *arg) +{ + struct update_control *ctrl = arg; + uint64_t gpa = VAPIC_GPA; + + wait_for_start_signal(ctrl); + + while (is_running(ctrl)) { + usleep(update_period_ms * 1000); + gpa += 0x10000; + vm_mem_region_move(ctrl->vm, VAPIC_SLOT, gpa); + ctrl->updates++; + } + + return NULL; +} + +static void run(void * (*update)(void *), const char *name) +{ + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + struct vmx_pages *vmx; + struct update_control ctrl; + struct ucall uc; + vm_vaddr_t vmx_pages_gva; + pthread_t update_thread; + bool done = false; + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + + /* Allocate VMX pages */ + vmx = vcpu_alloc_vmx(vm, &vmx_pages_gva); + + /* Allocate memory and create VAPIC memslot */ + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, VAPIC_GPA, + VAPIC_SLOT, 1, 0); + + /* Allocate guest page table */ + virt_map(vm, VAPIC_GPA, VAPIC_GPA, 1); + + /* Set up nested EPT */ + prepare_eptp(vmx, vm, 0); + nested_map_memslot(vmx, vm, 0); + nested_map_memslot(vmx, vm, VAPIC_SLOT); + nested_map(vmx, vm, VAPIC_GPA, VAPIC_GPA, vm->page_size); + + vcpu_args_set(vcpu, 1, vmx_pages_gva); + + pthread_mutex_init(&ctrl.mutex, NULL); + pthread_cond_init(&ctrl.start_cond, NULL); + ctrl.vm = vm; + ctrl.running = true; + ctrl.started = false; + ctrl.updates = 0; + + pthread_create(&update_thread, NULL, update, &ctrl); + + printf("%s: running %s (tsc_khz %lu)\n", progname, name, guest_tsc_khz); + + while (!done) { + vcpu_run(vcpu); + + switch (vcpu->run->exit_reason) { + case KVM_EXIT_IO: + switch (get_ucall(vcpu, &uc)) { + case UCALL_SYNC: + printf("%s: sync(%ld)\n", progname, uc.args[0]); + if (uc.args[0] == 0) + signal_thread_start(&ctrl); + break; + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + /* NOT REACHED */ + case UCALL_DONE: + done = true; + break; + default: + TEST_ASSERT(false, "Unknown ucall %lu", uc.cmd); + } + break; + case KVM_EXIT_MMIO: + /* Handle APIC MMIO access after memslot move */ + printf + ("%s: APIC MMIO access at 0x%llx (memslot move effect)\n", + progname, vcpu->run->mmio.phys_addr); + break; + default: + TEST_FAIL("%s: Unexpected exit reason: %d (flags 0x%x)", + progname, + vcpu->run->exit_reason, vcpu->run->flags); + } + } + + set_running(&ctrl, false); + if (!ctrl.started) + signal_thread_start(&ctrl); + pthread_join(update_thread, NULL); + printf("%s: completed with %d updates\n", progname, ctrl.updates); + + pthread_mutex_destroy(&ctrl.mutex); + pthread_cond_destroy(&ctrl.start_cond); + kvm_vm_free(vm); +} + +int main(int argc, char *argv[]) +{ + int opt_madvise = 0; + int opt_memslot_move = 0; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); + TEST_REQUIRE(kvm_cpu_has_ept()); + + if (argc == 1) { + opt_madvise = 1; + opt_memslot_move = 1; + } else { + int opt; + + while ((opt = getopt(argc, argv, "amp:")) != -1) { + switch (opt) { + case 'a': + opt_madvise = 1; + break; + case 'm': + opt_memslot_move = 1; + break; + case 'p': + update_period_ms = atoi(optarg); + break; + default: + exit(1); + } + } + } + + TEST_ASSERT(opt_madvise + || opt_memslot_move, "No update test configured"); + + progname = argv[0]; + + if (opt_madvise) + run(update_madvise, "madvise"); + + if (opt_memslot_move) + run(update_move_memslot, "move memslot"); + + return 0; +}
From: Fred Griffoul fgriffo@amazon.co.uk
Cache enlightened VMCS control fields to prevent TOCTOU races where the guest could modify hv_clean_fields or hv_enlightenments_control between multiple accesses during nested VM-entry.
The cached values ensure consistent behavior across: - The evmcs-to-vmcs12 copy operations - MSR bitmap validation - Clean field checks in prepare_vmcs02_rare()
This eliminates potential guest-induced inconsistencies in nested virtualization state management.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- arch/x86/kvm/vmx/hyperv.c | 5 ++-- arch/x86/kvm/vmx/hyperv.h | 20 +++++++++++++ arch/x86/kvm/vmx/nested.c | 62 ++++++++++++++++++++++++--------------- arch/x86/kvm/vmx/vmx.h | 5 +++- 4 files changed, 65 insertions(+), 27 deletions(-)
diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c index fa41d036acd4..961b91b9bd64 100644 --- a/arch/x86/kvm/vmx/hyperv.c +++ b/arch/x86/kvm/vmx/hyperv.c @@ -213,12 +213,11 @@ bool nested_evmcs_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu) { struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); - struct hv_enlightened_vmcs *evmcs = vmx->nested.hv_evmcs;
- if (!hv_vcpu || !evmcs) + if (!hv_vcpu || !nested_vmx_is_evmptr12_valid(vmx)) return false;
- if (!evmcs->hv_enlightenments_control.nested_flush_hypercall) + if (!vmx->nested.hv_flush_hypercall) return false;
return hv_vcpu->vp_assist_page.nested_control.features.directhypercall; diff --git a/arch/x86/kvm/vmx/hyperv.h b/arch/x86/kvm/vmx/hyperv.h index 11a339009781..3c7fea501ca5 100644 --- a/arch/x86/kvm/vmx/hyperv.h +++ b/arch/x86/kvm/vmx/hyperv.h @@ -52,6 +52,16 @@ static inline bool guest_cpu_cap_has_evmcs(struct kvm_vcpu *vcpu) to_vmx(vcpu)->nested.enlightened_vmcs_enabled; }
+static inline u32 nested_evmcs_clean_fields(struct vcpu_vmx *vmx) +{ + return vmx->nested.hv_clean_fields; +} + +static inline bool nested_evmcs_msr_bitmap(struct vcpu_vmx *vmx) +{ + return vmx->nested.hv_msr_bitmap; +} + u64 nested_get_evmptr(struct kvm_vcpu *vcpu); uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu); int nested_enable_evmcs(struct kvm_vcpu *vcpu, @@ -85,6 +95,16 @@ static inline struct hv_enlightened_vmcs *nested_vmx_evmcs(struct vcpu_vmx *vmx) { return NULL; } + +static inline u32 nested_evmcs_clean_fields(struct vcpu_vmx *vmx) +{ + return 0; +} + +static inline bool nested_evmcs_msr_bitmap(struct vcpu_vmx *vmx) +{ + return false; +} #endif
#endif /* __KVM_X86_VMX_HYPERV_H */ diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 1f58b380585b..aec150612818 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -235,6 +235,9 @@ static inline void nested_release_evmcs(struct kvm_vcpu *vcpu) kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map); vmx->nested.hv_evmcs = NULL; vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID; + vmx->nested.hv_clean_fields = 0; + vmx->nested.hv_msr_bitmap = false; + vmx->nested.hv_flush_hypercall = false;
if (hv_vcpu) { hv_vcpu->nested.pa_page_gpa = INVALID_GPA; @@ -686,10 +689,10 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, * and tells KVM (L0) there were no changes in MSR bitmap for L2. */ if (!vmx->nested.force_msr_bitmap_recalc) { - struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); - - if (evmcs && evmcs->hv_enlightenments_control.msr_bitmap && - evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP) + if (nested_vmx_is_evmptr12_valid(vmx) && + nested_evmcs_msr_bitmap(vmx) && + (nested_evmcs_clean_fields(vmx) + & HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP)) return true; }
@@ -2163,10 +2166,11 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx) * instruction. */ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( - struct kvm_vcpu *vcpu, bool from_launch) + struct kvm_vcpu *vcpu, bool from_launch, bool copy) { #ifdef CONFIG_KVM_HYPERV struct vcpu_vmx *vmx = to_vmx(vcpu); + struct hv_enlightened_vmcs *evmcs; bool evmcs_gpa_changed = false; u64 evmcs_gpa;
@@ -2246,6 +2250,22 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( vmx->nested.force_msr_bitmap_recalc = true; }
+ /* Cache evmcs fields to avoid reading evmcs after copy to vmcs12 */ + evmcs = vmx->nested.hv_evmcs; + vmx->nested.hv_clean_fields = evmcs->hv_clean_fields; + vmx->nested.hv_flush_hypercall = evmcs->hv_enlightenments_control.nested_flush_hypercall; + vmx->nested.hv_msr_bitmap = evmcs->hv_enlightenments_control.msr_bitmap; + + if (copy) { + struct vmcs12 *vmcs12 = get_vmcs12(vcpu); + + if (likely(!vmcs12->hdr.shadow_vmcs)) { + copy_enlightened_to_vmcs12(vmx, vmx->nested.hv_clean_fields); + /* Enlightened VMCS doesn't have launch state */ + vmcs12->launch_state = !from_launch; + } + } + return EVMPTRLD_SUCCEEDED; #else return EVMPTRLD_DISABLED; @@ -2613,10 +2633,12 @@ static void vmcs_write_cet_state(struct kvm_vcpu *vcpu, u64 s_cet,
static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) { - struct hv_enlightened_vmcs *hv_evmcs = nested_vmx_evmcs(vmx); + u32 hv_clean_fields = 0;
- if (!hv_evmcs || !(hv_evmcs->hv_clean_fields & - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) { + if (nested_vmx_is_evmptr12_valid(vmx)) + hv_clean_fields = nested_evmcs_clean_fields(vmx); + + if (!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) {
vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector); vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector); @@ -2658,8 +2680,7 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) vmx_segment_cache_clear(vmx); }
- if (!hv_evmcs || !(hv_evmcs->hv_clean_fields & - HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1)) { + if (!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1)) { vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs); vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, vmcs12->guest_pending_dbg_exceptions); @@ -2750,7 +2771,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, enum vm_entry_failure_code *entry_failure_code) { struct vcpu_vmx *vmx = to_vmx(vcpu); - struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); + struct hv_enlightened_vmcs *evmcs; bool load_guest_pdptrs_vmcs12 = false;
if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) { @@ -2758,7 +2779,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, vmx->nested.dirty_vmcs12 = false;
load_guest_pdptrs_vmcs12 = !nested_vmx_is_evmptr12_valid(vmx) || - !(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); + !(nested_evmcs_clean_fields(vmx) + & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); }
if (vmx->nested.nested_run_pending && @@ -2887,7 +2909,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * bits when it changes a field in eVMCS. Mark all fields as clean * here. */ - if (nested_vmx_is_evmptr12_valid(vmx)) + evmcs = nested_vmx_evmcs(vmx); + if (evmcs) evmcs->hv_clean_fields |= HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
return 0; @@ -3470,7 +3493,7 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu) if (guest_cpu_cap_has_evmcs(vcpu) && vmx->nested.hv_evmcs_vmptr == EVMPTR_MAP_PENDING) { enum nested_evmptrld_status evmptrld_status = - nested_vmx_handle_enlightened_vmptrld(vcpu, false); + nested_vmx_handle_enlightened_vmptrld(vcpu, false, false);
if (evmptrld_status == EVMPTRLD_VMFAIL || evmptrld_status == EVMPTRLD_ERROR) @@ -3864,7 +3887,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) if (!nested_vmx_check_permission(vcpu)) return 1;
- evmptrld_status = nested_vmx_handle_enlightened_vmptrld(vcpu, launch); + evmptrld_status = nested_vmx_handle_enlightened_vmptrld(vcpu, launch, true); if (evmptrld_status == EVMPTRLD_ERROR) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; @@ -3890,15 +3913,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) if (CC(vmcs12->hdr.shadow_vmcs)) return nested_vmx_failInvalid(vcpu);
- if (nested_vmx_is_evmptr12_valid(vmx)) { - struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); - - copy_enlightened_to_vmcs12(vmx, evmcs->hv_clean_fields); - /* Enlightened VMCS doesn't have launch state */ - vmcs12->launch_state = !launch; - } else if (enable_shadow_vmcs) { + if (!nested_vmx_is_evmptr12_valid(vmx) && enable_shadow_vmcs) copy_shadow_to_vmcs12(vmx); - }
/* * The nested entry process starts with enforcing various prerequisites diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 9a285834ccda..87708af502f3 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -205,8 +205,11 @@ struct nested_vmx {
#ifdef CONFIG_KVM_HYPERV gpa_t hv_evmcs_vmptr; - struct kvm_host_map hv_evmcs_map; + u32 hv_clean_fields; + bool hv_msr_bitmap; + bool hv_flush_hypercall; struct hv_enlightened_vmcs *hv_evmcs; + struct kvm_host_map hv_evmcs_map; #endif };
From: Fred Griffoul fgriffo@amazon.co.uk
Replace the eVMCS kvm_host_map with a gfn_to_pfn_cache to properly handle memslot changes and unify with other pfncaches in nVMX.
The change introduces proper locking/unlocking semantics for eVMCS access through nested_lock_evmcs() and nested_unlock_evmcs() helpers.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- arch/x86/kvm/vmx/hyperv.h | 21 ++++---- arch/x86/kvm/vmx/nested.c | 109 ++++++++++++++++++++++++++------------ arch/x86/kvm/vmx/vmx.h | 3 +- 3 files changed, 88 insertions(+), 45 deletions(-)
diff --git a/arch/x86/kvm/vmx/hyperv.h b/arch/x86/kvm/vmx/hyperv.h index 3c7fea501ca5..3b6fcf8dff64 100644 --- a/arch/x86/kvm/vmx/hyperv.h +++ b/arch/x86/kvm/vmx/hyperv.h @@ -37,11 +37,6 @@ static inline bool nested_vmx_is_evmptr12_set(struct vcpu_vmx *vmx) return evmptr_is_set(vmx->nested.hv_evmcs_vmptr); }
-static inline struct hv_enlightened_vmcs *nested_vmx_evmcs(struct vcpu_vmx *vmx) -{ - return vmx->nested.hv_evmcs; -} - static inline bool guest_cpu_cap_has_evmcs(struct kvm_vcpu *vcpu) { /* @@ -70,6 +65,8 @@ void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 * int nested_evmcs_check_controls(struct vmcs12 *vmcs12); bool nested_evmcs_l2_tlb_flush_enabled(struct kvm_vcpu *vcpu); void vmx_hv_inject_synthetic_vmexit_post_tlb_flush(struct kvm_vcpu *vcpu); +struct hv_enlightened_vmcs *nested_lock_evmcs(struct vcpu_vmx *vmx); +void nested_unlock_evmcs(struct vcpu_vmx *vmx); #else static inline bool evmptr_is_valid(u64 evmptr) { @@ -91,11 +88,6 @@ static inline bool nested_vmx_is_evmptr12_set(struct vcpu_vmx *vmx) return false; }
-static inline struct hv_enlightened_vmcs *nested_vmx_evmcs(struct vcpu_vmx *vmx) -{ - return NULL; -} - static inline u32 nested_evmcs_clean_fields(struct vcpu_vmx *vmx) { return 0; @@ -105,6 +97,15 @@ static inline bool nested_evmcs_msr_bitmap(struct vcpu_vmx *vmx) { return false; } + +static inline struct hv_enlightened_vmcs *nested_lock_evmcs(struct vcpu_vmx *vmx) +{ + return NULL; +} + +static inline void nested_unlock_evmcs(struct vcpu_vmx *vmx) +{ +} #endif
#endif /* __KVM_X86_VMX_HYPERV_H */ diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index aec150612818..d910508e3c22 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -232,8 +232,6 @@ static inline void nested_release_evmcs(struct kvm_vcpu *vcpu) struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu);
- kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map); - vmx->nested.hv_evmcs = NULL; vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID; vmx->nested.hv_clean_fields = 0; vmx->nested.hv_msr_bitmap = false; @@ -265,7 +263,7 @@ static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr) !evmptr_is_valid(nested_get_evmptr(vcpu))) return false;
- if (nested_vmx_evmcs(vmx) && vmptr == vmx->nested.hv_evmcs_vmptr) + if (vmptr == vmx->nested.hv_evmcs_vmptr) nested_release_evmcs(vcpu);
return true; @@ -393,6 +391,9 @@ static void free_nested(struct kvm_vcpu *vcpu) kvm_gpc_deactivate(&vmx->nested.virtual_apic_cache); kvm_gpc_deactivate(&vmx->nested.apic_access_page_cache); kvm_gpc_deactivate(&vmx->nested.msr_bitmap_cache); +#ifdef CONFIG_KVM_HYPERV + kvm_gpc_deactivate(&vmx->nested.hv_evmcs_cache); +#endif
free_vpid(vmx->nested.vpid02); vmx->nested.posted_intr_nv = -1; @@ -1735,11 +1736,12 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx) vmcs_load(vmx->loaded_vmcs->vmcs); }
-static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields) +static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, + struct hv_enlightened_vmcs *evmcs, + u32 hv_clean_fields) { #ifdef CONFIG_KVM_HYPERV struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12; - struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(&vmx->vcpu);
/* HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE */ @@ -1987,7 +1989,7 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx) { #ifdef CONFIG_KVM_HYPERV struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12; - struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); + struct hv_enlightened_vmcs *evmcs = nested_lock_evmcs(vmx);
/* * Should not be changed by KVM: @@ -2155,6 +2157,7 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
evmcs->guest_bndcfgs = vmcs12->guest_bndcfgs;
+ nested_unlock_evmcs(vmx); return; #else /* CONFIG_KVM_HYPERV */ KVM_BUG_ON(1, vmx->vcpu.kvm); @@ -2171,6 +2174,8 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( #ifdef CONFIG_KVM_HYPERV struct vcpu_vmx *vmx = to_vmx(vcpu); struct hv_enlightened_vmcs *evmcs; + struct gfn_to_pfn_cache *gpc; + enum nested_evmptrld_status status = EVMPTRLD_SUCCEEDED; bool evmcs_gpa_changed = false; u64 evmcs_gpa;
@@ -2183,17 +2188,19 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( return EVMPTRLD_DISABLED; }
+ gpc = &vmx->nested.hv_evmcs_cache; + if (nested_gpc_lock(gpc, evmcs_gpa)) { + nested_release_evmcs(vcpu); + return EVMPTRLD_ERROR; + } + + evmcs = gpc->khva; + if (unlikely(evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) { vmx->nested.current_vmptr = INVALID_GPA;
nested_release_evmcs(vcpu);
- if (kvm_vcpu_map(vcpu, gpa_to_gfn(evmcs_gpa), - &vmx->nested.hv_evmcs_map)) - return EVMPTRLD_ERROR; - - vmx->nested.hv_evmcs = vmx->nested.hv_evmcs_map.hva; - /* * Currently, KVM only supports eVMCS version 1 * (== KVM_EVMCS_VERSION) and thus we expect guest to set this @@ -2216,10 +2223,11 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( * eVMCS version or VMCS12 revision_id as valid values for first * u32 field of eVMCS. */ - if ((vmx->nested.hv_evmcs->revision_id != KVM_EVMCS_VERSION) && - (vmx->nested.hv_evmcs->revision_id != VMCS12_REVISION)) { + if ((evmcs->revision_id != KVM_EVMCS_VERSION) && + (evmcs->revision_id != VMCS12_REVISION)) { nested_release_evmcs(vcpu); - return EVMPTRLD_VMFAIL; + status = EVMPTRLD_VMFAIL; + goto unlock; }
vmx->nested.hv_evmcs_vmptr = evmcs_gpa; @@ -2244,14 +2252,11 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( * between different L2 guests as KVM keeps a single VMCS12 per L1. */ if (from_launch || evmcs_gpa_changed) { - vmx->nested.hv_evmcs->hv_clean_fields &= - ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; - + evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; vmx->nested.force_msr_bitmap_recalc = true; }
/* Cache evmcs fields to avoid reading evmcs after copy to vmcs12 */ - evmcs = vmx->nested.hv_evmcs; vmx->nested.hv_clean_fields = evmcs->hv_clean_fields; vmx->nested.hv_flush_hypercall = evmcs->hv_enlightenments_control.nested_flush_hypercall; vmx->nested.hv_msr_bitmap = evmcs->hv_enlightenments_control.msr_bitmap; @@ -2260,13 +2265,15 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
if (likely(!vmcs12->hdr.shadow_vmcs)) { - copy_enlightened_to_vmcs12(vmx, vmx->nested.hv_clean_fields); + copy_enlightened_to_vmcs12(vmx, evmcs, vmx->nested.hv_clean_fields); /* Enlightened VMCS doesn't have launch state */ vmcs12->launch_state = !from_launch; } }
- return EVMPTRLD_SUCCEEDED; +unlock: + nested_gpc_unlock(gpc); + return status; #else return EVMPTRLD_DISABLED; #endif @@ -2771,7 +2778,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, enum vm_entry_failure_code *entry_failure_code) { struct vcpu_vmx *vmx = to_vmx(vcpu); - struct hv_enlightened_vmcs *evmcs; bool load_guest_pdptrs_vmcs12 = false;
if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) { @@ -2909,9 +2915,13 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * bits when it changes a field in eVMCS. Mark all fields as clean * here. */ - evmcs = nested_vmx_evmcs(vmx); - if (evmcs) + if (nested_vmx_is_evmptr12_valid(vmx)) { + struct hv_enlightened_vmcs *evmcs; + + evmcs = nested_lock_evmcs(vmx); evmcs->hv_clean_fields |= HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; + nested_unlock_evmcs(vmx); + }
return 0; } @@ -4147,6 +4157,18 @@ static void *nested_gpc_lock_if_active(struct gfn_to_pfn_cache *gpc) return gpc->khva; }
+#ifdef CONFIG_KVM_HYPERV +struct hv_enlightened_vmcs *nested_lock_evmcs(struct vcpu_vmx *vmx) +{ + return nested_gpc_lock_if_active(&vmx->nested.hv_evmcs_cache); +} + +void nested_unlock_evmcs(struct vcpu_vmx *vmx) +{ + nested_gpc_unlock(&vmx->nested.hv_evmcs_cache); +} +#endif + static struct pi_desc *nested_lock_pi_desc(struct vcpu_vmx *vmx) { u8 *pi_desc_page; @@ -5636,6 +5658,9 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu) kvm_gpc_init_for_vcpu(&vmx->nested.virtual_apic_cache, vcpu); kvm_gpc_init_for_vcpu(&vmx->nested.pi_desc_cache, vcpu);
+#ifdef CONFIG_KVM_HYPERV + kvm_gpc_init(&vmx->nested.hv_evmcs_cache, vcpu->kvm); +#endif vmx->nested.vmcs02_initialized = false; vmx->nested.vmxon = true;
@@ -5887,6 +5912,8 @@ static int handle_vmread(struct kvm_vcpu *vcpu) /* Read the field, zero-extended to a u64 value */ value = vmcs12_read_any(vmcs12, field, offset); } else { + struct hv_enlightened_vmcs *evmcs; + /* * Hyper-V TLFS (as of 6.0b) explicitly states, that while an * enlightened VMCS is active VMREAD/VMWRITE instructions are @@ -5905,7 +5932,9 @@ static int handle_vmread(struct kvm_vcpu *vcpu) return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
/* Read the field, zero-extended to a u64 value */ - value = evmcs_read_any(nested_vmx_evmcs(vmx), field, offset); + evmcs = nested_lock_evmcs(vmx); + value = evmcs_read_any(evmcs, field, offset); + nested_unlock_evmcs(vmx); }
/* @@ -6935,6 +6964,27 @@ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu) return true; }
+static void vmx_get_enlightened_to_vmcs12(struct vcpu_vmx *vmx) +{ +#ifdef CONFIG_KVM_HYPERV + struct hv_enlightened_vmcs *evmcs; + struct kvm_vcpu *vcpu = &vmx->vcpu; + + kvm_vcpu_srcu_read_lock(vcpu); + evmcs = nested_lock_evmcs(vmx); + /* + * L1 hypervisor is not obliged to keep eVMCS + * clean fields data always up-to-date while + * not in guest mode, 'hv_clean_fields' is only + * supposed to be actual upon vmentry so we need + * to ignore it here and do full copy. + */ + copy_enlightened_to_vmcs12(vmx, evmcs, 0); + nested_unlock_evmcs(vmx); + kvm_vcpu_srcu_read_unlock(vcpu); +#endif /* CONFIG_KVM_HYPERV */ +} + static int vmx_get_nested_state(struct kvm_vcpu *vcpu, struct kvm_nested_state __user *user_kvm_nested_state, u32 user_data_size) @@ -7025,14 +7075,7 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu, copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu)); if (!vmx->nested.need_vmcs12_to_shadow_sync) { if (nested_vmx_is_evmptr12_valid(vmx)) - /* - * L1 hypervisor is not obliged to keep eVMCS - * clean fields data always up-to-date while - * not in guest mode, 'hv_clean_fields' is only - * supposed to be actual upon vmentry so we need - * to ignore it here and do full copy. - */ - copy_enlightened_to_vmcs12(vmx, 0); + vmx_get_enlightened_to_vmcs12(vmx); else if (enable_shadow_vmcs) copy_shadow_to_vmcs12(vmx); } diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 87708af502f3..4da5a42b0c60 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -208,8 +208,7 @@ struct nested_vmx { u32 hv_clean_fields; bool hv_msr_bitmap; bool hv_flush_hypercall; - struct hv_enlightened_vmcs *hv_evmcs; - struct kvm_host_map hv_evmcs_map; + struct gfn_to_pfn_cache hv_evmcs_cache; #endif };
From: Fred Griffoul fgriffo@amazon.co.uk
Add infrastructure to persist nested virtualization state when L2 vCPUs are switched on an L1 vCPU or migrated between L1 vCPUs.
The nested context table uses a hash table for fast lookup by nested control block GPA (VMPTR for VMX, VMCB for SVM) and maintains a free list for context management.
The kvm_nested_context_load() function searches for a context indexed by the target GPA; if not found, it allocates a new context up to the configured maximum. If at capacity, it recycles the oldest context from the free list.
The oversubscription is hardcoded to support up to 8 L2 vCPUs per L1 vCPU.
The kvm_nested_context_clear() function moves the context to the free list while keeping it in the hash table for potential reuse.
This allows nested hypervisors to multiplex multiple L2 vCPUs on L1 vCPUs without losing cached nested state, significantly improving performance for workloads with frequent L2 context switches.
This patch adds the basic infrastructure. Subsequent patches will add the nested VMX and SVM specific support to populate and utilize the cached nested state.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- arch/x86/include/asm/kvm_host.h | 31 +++++ arch/x86/include/uapi/asm/kvm.h | 2 + arch/x86/kvm/Makefile | 2 +- arch/x86/kvm/nested.c | 199 ++++++++++++++++++++++++++++++++ arch/x86/kvm/x86.c | 5 +- 5 files changed, 237 insertions(+), 2 deletions(-) create mode 100644 arch/x86/kvm/nested.c
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4675e71b33a7..75f3cd82a073 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1379,6 +1379,28 @@ enum kvm_mmu_type { KVM_NR_MMU_TYPES, };
+struct kvm_nested_context { + gpa_t gpa; + struct hlist_node hnode; + struct list_head lru_link; + struct kvm_vcpu *vcpu; +}; + +struct kvm_nested_context_table { + spinlock_t lock; + u32 count; + struct list_head lru_list; + DECLARE_HASHTABLE(hash, 8); +}; + +void kvm_nested_context_clear(struct kvm_vcpu *vcpu, gpa_t gpa); +struct kvm_nested_context *kvm_nested_context_load( + struct kvm_vcpu *vcpu, + gpa_t gpa); + +int kvm_nested_context_table_init(struct kvm *kvm); +void kvm_nested_context_table_destroy(struct kvm *kvm); + struct kvm_arch { unsigned long n_used_mmu_pages; unsigned long n_requested_mmu_pages; @@ -1618,6 +1640,9 @@ struct kvm_arch { * current VM. */ int cpu_dirty_log_size; + + /* Cache for nested contexts */ + struct kvm_nested_context_table *nested_context_table; };
struct kvm_vm_stat { @@ -1640,6 +1665,8 @@ struct kvm_vm_stat { u64 nx_lpage_splits; u64 max_mmu_page_hash_collisions; u64 max_mmu_rmap_size; + u64 nested_context_recycle; + u64 nested_context_reuse; };
struct kvm_vcpu_stat { @@ -1967,6 +1994,10 @@ struct kvm_x86_nested_ops { uint16_t *vmcs_version); uint16_t (*get_evmcs_version)(struct kvm_vcpu *vcpu); void (*hv_inject_synthetic_vmexit_post_tlb_flush)(struct kvm_vcpu *vcpu); + + struct kvm_nested_context *(*alloc_context)(struct kvm_vcpu *vcpu); + void (*free_context)(struct kvm_nested_context *ctx); + void (*reset_context)(struct kvm_nested_context *ctx); };
struct kvm_x86_init_ops { diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index d420c9c066d4..637ed9286f8e 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -1042,4 +1042,6 @@ struct kvm_tdx_init_mem_region { __u64 nr_pages; };
+#define KVM_NESTED_OVERSUB_RATIO 8 + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index c4b8950c7abe..2a5289cb5bd1 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -6,7 +6,7 @@ ccflags-$(CONFIG_KVM_WERROR) += -Werror include $(srctree)/virt/kvm/Makefile.kvm
kvm-y += x86.o emulate.o irq.o lapic.o cpuid.o pmu.o mtrr.o \ - debugfs.o mmu/mmu.o mmu/page_track.o mmu/spte.o + debugfs.o nested.o mmu/mmu.o mmu/page_track.o mmu/spte.o
kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o kvm-$(CONFIG_KVM_IOAPIC) += i8259.o i8254.o ioapic.o diff --git a/arch/x86/kvm/nested.c b/arch/x86/kvm/nested.c new file mode 100644 index 000000000000..6e4e95567427 --- /dev/null +++ b/arch/x86/kvm/nested.c @@ -0,0 +1,199 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <linux/kvm_host.h> + +static struct kvm_nested_context_table *kvm_nested_context_table_alloc(void) +{ + struct kvm_nested_context_table *table; + + table = kzalloc(sizeof(*table), GFP_KERNEL_ACCOUNT); + if (!table) + return NULL; + + spin_lock_init(&table->lock); + INIT_LIST_HEAD(&table->lru_list); + hash_init(table->hash); + return table; +} + +static void kvm_nested_context_table_free(struct kvm_nested_context_table + *table) +{ + kfree(table); +} + +int kvm_nested_context_table_init(struct kvm *kvm) +{ + struct kvm_nested_context_table *table; + + if (!kvm_x86_ops.nested_ops->alloc_context || + !kvm_x86_ops.nested_ops->free_context || + !kvm_x86_ops.nested_ops->reset_context) + return -EINVAL; + + table = kvm_nested_context_table_alloc(); + if (!table) + return -ENOMEM; + + kvm->arch.nested_context_table = table; + return 0; +} + +void kvm_nested_context_table_destroy(struct kvm *kvm) +{ + struct kvm_nested_context_table *table; + struct kvm_nested_context *ctx; + struct hlist_node *tmp; + int bkt; + + table = kvm->arch.nested_context_table; + if (!table) + return; + + hash_for_each_safe(table->hash, bkt, tmp, ctx, hnode) { + hash_del(&ctx->hnode); + kvm_x86_ops.nested_ops->free_context(ctx); + } + + kvm_nested_context_table_free(table); +} + +static unsigned int kvm_nested_context_max(struct kvm *kvm) +{ + return KVM_NESTED_OVERSUB_RATIO * atomic_read(&kvm->online_vcpus); +} + +static struct kvm_nested_context *__kvm_nested_context_find(struct kvm_nested_context_table + *table, gpa_t gpa) +{ + struct kvm_nested_context *ctx; + + hash_for_each_possible(table->hash, ctx, hnode, gpa) { + if (ctx->gpa == gpa) + return ctx; + } + + return NULL; +} + +static struct kvm_nested_context *kvm_nested_context_find(struct + kvm_nested_context_table + *table, + struct kvm_vcpu *vcpu, + gpa_t gpa) +{ + struct kvm_nested_context *ctx; + + ctx = __kvm_nested_context_find(table, gpa); + if (!ctx) + return NULL; + + WARN_ON_ONCE(ctx->vcpu && ctx->vcpu != vcpu); + + /* Remove from the LRU list if not attached to a vcpu */ + if (!ctx->vcpu) + list_del(&ctx->lru_link); + + return ctx; +} + +static struct kvm_nested_context *kvm_nested_context_recycle(struct + kvm_nested_context_table + *table) +{ + struct kvm_nested_context *ctx; + + if (list_empty(&table->lru_list)) + return NULL; + + ctx = + list_first_entry(&table->lru_list, struct kvm_nested_context, + lru_link); + list_del(&ctx->lru_link); + hash_del(&ctx->hnode); + return ctx; +} + +static void kvm_nested_context_insert(struct kvm_nested_context_table *table, + struct kvm_nested_context *ctx, gpa_t gpa) +{ + hash_add(table->hash, &ctx->hnode, gpa); + ctx->gpa = gpa; +} + +struct kvm_nested_context *kvm_nested_context_load(struct kvm_vcpu *vcpu, + gpa_t gpa) +{ + struct kvm_nested_context_table *table; + struct kvm_nested_context *ctx, *new_ctx = NULL; + struct kvm *vm = vcpu->kvm; + bool reset = false; + + table = vcpu->kvm->arch.nested_context_table; + if (WARN_ON_ONCE(!table)) + return false; +retry: + spin_lock(&table->lock); + ctx = kvm_nested_context_find(table, vcpu, gpa); + if (!ctx) { + /* At capacity? Recycle the LRU context */ + if (table->count >= kvm_nested_context_max(vcpu->kvm)) { + ctx = kvm_nested_context_recycle(table); + if (unlikely(!ctx)) + goto finish; + + kvm_nested_context_insert(table, ctx, gpa); + ++vm->stat.nested_context_recycle; + reset = true; + + } else if (new_ctx) { + ++table->count; + ctx = new_ctx; + kvm_nested_context_insert(table, ctx, gpa); + new_ctx = NULL; + + } else { + /* Allocate a new context without holding the lock */ + spin_unlock(&table->lock); + new_ctx = kvm_x86_ops.nested_ops->alloc_context(vcpu); + if (unlikely(!new_ctx)) + return NULL; + + goto retry; + } + } else + ++vm->stat.nested_context_reuse; + + ctx->vcpu = vcpu; +finish: + spin_unlock(&table->lock); + + if (new_ctx) + kvm_x86_ops.nested_ops->free_context(new_ctx); + + if (reset) + kvm_x86_ops.nested_ops->reset_context(ctx); + + return ctx; +} + +void kvm_nested_context_clear(struct kvm_vcpu *vcpu, gpa_t gpa) +{ + struct kvm_nested_context_table *table; + struct kvm_nested_context *ctx; + + table = vcpu->kvm->arch.nested_context_table; + if (WARN_ON_ONCE(!table)) + return; + + spin_lock(&table->lock); + ctx = __kvm_nested_context_find(table, gpa); + if (ctx && ctx->vcpu) { + /* + * Move to LRU list but keep it in the hash table for possible future + * reuse. + */ + list_add_tail(&ctx->lru_link, &table->lru_list); + ctx->vcpu = NULL; + } + spin_unlock(&table->lock); +} diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1a9c1171df49..db13b1921aff 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -255,7 +255,9 @@ const struct _kvm_stats_desc kvm_vm_stats_desc[] = { STATS_DESC_ICOUNTER(VM, pages_1g), STATS_DESC_ICOUNTER(VM, nx_lpage_splits), STATS_DESC_PCOUNTER(VM, max_mmu_rmap_size), - STATS_DESC_PCOUNTER(VM, max_mmu_page_hash_collisions) + STATS_DESC_PCOUNTER(VM, max_mmu_page_hash_collisions), + STATS_DESC_COUNTER(VM, nested_context_recycle), + STATS_DESC_COUNTER(VM, nested_context_reuse) };
const struct kvm_stats_header kvm_vm_stats_header = { @@ -13311,6 +13313,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm_page_track_cleanup(kvm); kvm_xen_destroy_vm(kvm); kvm_hv_destroy_vm(kvm); + kvm_nested_context_table_destroy(kvm); kvm_x86_call(vm_destroy)(kvm); }
From: Fred Griffoul fgriffo@amazon.co.uk
Extend the nested context infrastructure to preserve gfn_to_pfn_cache objects for nested VMX using kvm_nested_context_load() and kvm_nested_context_clear() functions.
The VMX nested context stores gfn_to_pfn_cache structs for: - MSR permission bitmaps - APIC access page - Virtual APIC page - Posted interrupt descriptor - Enlightened VMCS
For traditional nested VMX, those pfn caches are loaded upon 'vmptrld' instruction emulation and the context is cleared upon 'vmclear'. This follows the normal L2 vCPU migration sequence of 'vmclear/vmptrld/vmlaunch'.
For enlightened VMCS (eVMCS) support, both functions are called when detecting a change in the eVMCS GPA, ensuring proper context management for Hyper-V nested scenarios.
By preserving the gfn_to_pfn_cache objects across L2 context switches, we avoid costly cache refresh operations, significantly improving nested virtualization performance for workloads with frequent L2 vCPU multiplexing on an L1 vCPU or L2 vCPUs migrations between L1 vCPUs.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- arch/x86/kvm/vmx/nested.c | 155 +++++++++++++++++++++++++++++--------- arch/x86/kvm/vmx/vmx.c | 8 ++ arch/x86/kvm/vmx/vmx.h | 10 +-- include/linux/kvm_host.h | 2 +- 4 files changed, 134 insertions(+), 41 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index d910508e3c22..69c3bcb325f1 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -226,6 +226,93 @@ static void vmx_disable_shadow_vmcs(struct vcpu_vmx *vmx) vmx->nested.need_vmcs12_to_shadow_sync = false; }
+struct vmx_nested_context { + struct kvm_nested_context base; + struct gfn_to_pfn_cache msr_bitmap_cache; + struct gfn_to_pfn_cache apic_access_page_cache; + struct gfn_to_pfn_cache virtual_apic_cache; + struct gfn_to_pfn_cache pi_desc_cache; +#ifdef CONFIG_KVM_HYPERV + struct gfn_to_pfn_cache evmcs_cache; +#endif +}; + +static inline struct vmx_nested_context *to_vmx_nested_context( + struct kvm_nested_context *base) +{ + return base ? container_of(base, struct vmx_nested_context, base) : NULL; +} + +static struct kvm_nested_context *vmx_nested_context_alloc(struct kvm_vcpu *vcpu) +{ + struct vmx_nested_context *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL_ACCOUNT); + if (!ctx) + return NULL; + + kvm_gpc_init(&ctx->msr_bitmap_cache, vcpu->kvm); + kvm_gpc_init_for_vcpu(&ctx->apic_access_page_cache, vcpu); + kvm_gpc_init_for_vcpu(&ctx->virtual_apic_cache, vcpu); + kvm_gpc_init_for_vcpu(&ctx->pi_desc_cache, vcpu); +#ifdef CONFIG_KVM_HYPERV + kvm_gpc_init(&ctx->evmcs_cache, vcpu->kvm); +#endif + return &ctx->base; +} + +static void vmx_nested_context_reset(struct kvm_nested_context *base) +{ + /* + * Skip pfncache reinitialization: active ones will be refreshed on + * access. + */ +} + +static void vmx_nested_context_free(struct kvm_nested_context *base) +{ + struct vmx_nested_context *ctx = to_vmx_nested_context(base); + + kvm_gpc_deactivate(&ctx->pi_desc_cache); + kvm_gpc_deactivate(&ctx->virtual_apic_cache); + kvm_gpc_deactivate(&ctx->apic_access_page_cache); + kvm_gpc_deactivate(&ctx->msr_bitmap_cache); +#ifdef CONFIG_KVM_HYPERV + kvm_gpc_deactivate(&ctx->evmcs_cache); +#endif + kfree(ctx); +} + +static void vmx_nested_context_load(struct vcpu_vmx *vmx, gpa_t vmptr) +{ + struct vmx_nested_context *ctx; + + ctx = to_vmx_nested_context(kvm_nested_context_load(&vmx->vcpu, vmptr)); + if (!ctx) { + /* + * The cache could not be allocated. In the unlikely case of no + * available memory, an error will be returned to L1 when + * mapping the vmcs12 pages. More likely the current pfncaches + * will be reused (and refreshed since their GPAs do not + * match). + */ + return; + } + + vmx->nested.msr_bitmap_cache = &ctx->msr_bitmap_cache; + vmx->nested.apic_access_page_cache = &ctx->apic_access_page_cache; + vmx->nested.virtual_apic_cache = &ctx->virtual_apic_cache; + vmx->nested.pi_desc_cache = &ctx->pi_desc_cache; +#ifdef CONFIG_KVM_HYPERV + vmx->nested.hv_evmcs_cache = &ctx->evmcs_cache; +#endif +} + +static void vmx_nested_context_clear(struct vcpu_vmx *vmx, gpa_t vmptr) +{ + kvm_nested_context_clear(&vmx->vcpu, vmptr); +} + static inline void nested_release_evmcs(struct kvm_vcpu *vcpu) { #ifdef CONFIG_KVM_HYPERV @@ -325,6 +412,9 @@ static int nested_gpc_lock(struct gfn_to_pfn_cache *gpc, gpa_t gpa)
if (!PAGE_ALIGNED(gpa)) return -EINVAL; + + if (WARN_ON_ONCE(!gpc)) + return -ENOENT; retry: read_lock(&gpc->lock); if (!kvm_gpc_check(gpc, PAGE_SIZE) || (gpc->gpa != gpa)) { @@ -387,14 +477,6 @@ static void free_nested(struct kvm_vcpu *vcpu) vmx->nested.smm.vmxon = false; vmx->nested.vmxon_ptr = INVALID_GPA;
- kvm_gpc_deactivate(&vmx->nested.pi_desc_cache); - kvm_gpc_deactivate(&vmx->nested.virtual_apic_cache); - kvm_gpc_deactivate(&vmx->nested.apic_access_page_cache); - kvm_gpc_deactivate(&vmx->nested.msr_bitmap_cache); -#ifdef CONFIG_KVM_HYPERV - kvm_gpc_deactivate(&vmx->nested.hv_evmcs_cache); -#endif - free_vpid(vmx->nested.vpid02); vmx->nested.posted_intr_nv = -1; vmx->nested.current_vmptr = INVALID_GPA; @@ -697,7 +779,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, return true; }
- gpc = &vmx->nested.msr_bitmap_cache; + gpc = vmx->nested.msr_bitmap_cache; if (nested_gpc_lock(gpc, vmcs12->msr_bitmap)) return false;
@@ -2188,7 +2270,13 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( return EVMPTRLD_DISABLED; }
- gpc = &vmx->nested.hv_evmcs_cache; + if (evmcs_gpa != vmx->nested.hv_evmcs_vmptr) { + vmx_nested_context_clear(vmx, vmx->nested.hv_evmcs_vmptr); + vmx_nested_context_load(vmx, evmcs_gpa); + evmcs_gpa_changed = true; + } + + gpc = vmx->nested.hv_evmcs_cache; if (nested_gpc_lock(gpc, evmcs_gpa)) { nested_release_evmcs(vcpu); return EVMPTRLD_ERROR; @@ -2196,9 +2284,8 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
evmcs = gpc->khva;
- if (unlikely(evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) { + if (evmcs_gpa_changed) { vmx->nested.current_vmptr = INVALID_GPA; - nested_release_evmcs(vcpu);
/* @@ -2232,7 +2319,6 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
vmx->nested.hv_evmcs_vmptr = evmcs_gpa;
- evmcs_gpa_changed = true; /* * Unlike normal vmcs12, enlightened vmcs12 is not fully * reloaded from guest's memory (read only fields, fields not @@ -3540,7 +3626,7 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { - gpc = &vmx->nested.apic_access_page_cache; + gpc = vmx->nested.apic_access_page_cache;
if (!nested_gpc_hpa(gpc, vmcs12->apic_access_addr, &hpa)) { vmcs_write64(APIC_ACCESS_ADDR, hpa); @@ -3556,7 +3642,7 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) }
if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { - gpc = &vmx->nested.virtual_apic_cache; + gpc = vmx->nested.virtual_apic_cache;
if (!nested_gpc_hpa(gpc, vmcs12->virtual_apic_page_addr, &hpa)) { vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, hpa); @@ -3582,7 +3668,7 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) }
if (nested_cpu_has_posted_intr(vmcs12)) { - gpc = &vmx->nested.pi_desc_cache; + gpc = vmx->nested.pi_desc_cache;
if (!nested_gpc_hpa(gpc, vmcs12->posted_intr_desc_addr & PAGE_MASK, &hpa)) { vmx->nested.pi_desc_offset = offset_in_page(vmcs12->posted_intr_desc_addr); @@ -3642,9 +3728,9 @@ static bool vmx_is_nested_state_invalid(struct kvm_vcpu *vcpu) * locks. Since kvm_gpc_invalid() doesn't verify gpc memslot * generation, we can also skip acquiring the srcu lock. */ - return kvm_gpc_invalid(&vmx->nested.apic_access_page_cache) || - kvm_gpc_invalid(&vmx->nested.virtual_apic_cache) || - kvm_gpc_invalid(&vmx->nested.pi_desc_cache); + return kvm_gpc_invalid(vmx->nested.apic_access_page_cache) || + kvm_gpc_invalid(vmx->nested.virtual_apic_cache) || + kvm_gpc_invalid(vmx->nested.pi_desc_cache); }
static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa) @@ -4140,6 +4226,8 @@ void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu)
static void *nested_gpc_lock_if_active(struct gfn_to_pfn_cache *gpc) { + if (!gpc) + return NULL; retry: read_lock(&gpc->lock); if (!gpc->active) { @@ -4160,12 +4248,12 @@ static void *nested_gpc_lock_if_active(struct gfn_to_pfn_cache *gpc) #ifdef CONFIG_KVM_HYPERV struct hv_enlightened_vmcs *nested_lock_evmcs(struct vcpu_vmx *vmx) { - return nested_gpc_lock_if_active(&vmx->nested.hv_evmcs_cache); + return nested_gpc_lock_if_active(vmx->nested.hv_evmcs_cache); }
void nested_unlock_evmcs(struct vcpu_vmx *vmx) { - nested_gpc_unlock(&vmx->nested.hv_evmcs_cache); + nested_gpc_unlock(vmx->nested.hv_evmcs_cache); } #endif
@@ -4173,7 +4261,7 @@ static struct pi_desc *nested_lock_pi_desc(struct vcpu_vmx *vmx) { u8 *pi_desc_page;
- pi_desc_page = nested_gpc_lock_if_active(&vmx->nested.pi_desc_cache); + pi_desc_page = nested_gpc_lock_if_active(vmx->nested.pi_desc_cache); if (!pi_desc_page) return NULL;
@@ -4182,17 +4270,17 @@ static struct pi_desc *nested_lock_pi_desc(struct vcpu_vmx *vmx)
static void nested_unlock_pi_desc(struct vcpu_vmx *vmx) { - nested_gpc_unlock(&vmx->nested.pi_desc_cache); + nested_gpc_unlock(vmx->nested.pi_desc_cache); }
static void *nested_lock_vapic(struct vcpu_vmx *vmx) { - return nested_gpc_lock_if_active(&vmx->nested.virtual_apic_cache); + return nested_gpc_lock_if_active(vmx->nested.virtual_apic_cache); }
static void nested_unlock_vapic(struct vcpu_vmx *vmx) { - nested_gpc_unlock(&vmx->nested.virtual_apic_cache); + nested_gpc_unlock(vmx->nested.virtual_apic_cache); }
static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) @@ -5651,16 +5739,6 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu) HRTIMER_MODE_ABS_PINNED);
vmx->nested.vpid02 = allocate_vpid(); - - kvm_gpc_init(&vmx->nested.msr_bitmap_cache, vcpu->kvm); - - kvm_gpc_init_for_vcpu(&vmx->nested.apic_access_page_cache, vcpu); - kvm_gpc_init_for_vcpu(&vmx->nested.virtual_apic_cache, vcpu); - kvm_gpc_init_for_vcpu(&vmx->nested.pi_desc_cache, vcpu); - -#ifdef CONFIG_KVM_HYPERV - kvm_gpc_init(&vmx->nested.hv_evmcs_cache, vcpu->kvm); -#endif vmx->nested.vmcs02_initialized = false; vmx->nested.vmxon = true;
@@ -5856,6 +5934,8 @@ static int handle_vmclear(struct kvm_vcpu *vcpu) &zero, sizeof(zero)); }
+ vmx_nested_context_clear(vmx, vmptr); + return nested_vmx_succeed(vcpu); }
@@ -6100,6 +6180,8 @@ static void set_current_vmptr(struct vcpu_vmx *vmx, gpa_t vmptr) } vmx->nested.dirty_vmcs12 = true; vmx->nested.force_msr_bitmap_recalc = true; + + vmx_nested_context_load(vmx, vmptr); }
/* Emulate the VMPTRLD instruction */ @@ -7689,4 +7771,7 @@ struct kvm_x86_nested_ops vmx_nested_ops = { .get_evmcs_version = nested_get_evmcs_version, .hv_inject_synthetic_vmexit_post_tlb_flush = vmx_hv_inject_synthetic_vmexit_post_tlb_flush, #endif + .alloc_context = vmx_nested_context_alloc, + .free_context = vmx_nested_context_free, + .reset_context = vmx_nested_context_reset, }; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 546272a5d34d..30b13241ae45 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7666,6 +7666,14 @@ int vmx_vm_init(struct kvm *kvm)
if (enable_pml) kvm->arch.cpu_dirty_log_size = PML_LOG_NR_ENTRIES; + + if (nested) { + int err; + + err = kvm_nested_context_table_init(kvm); + if (err) + return err; + } return 0; }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 4da5a42b0c60..56b96e50290f 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -152,15 +152,15 @@ struct nested_vmx {
struct loaded_vmcs vmcs02;
- struct gfn_to_pfn_cache msr_bitmap_cache; + struct gfn_to_pfn_cache *msr_bitmap_cache;
/* * Guest pages referred to in the vmcs02 with host-physical * pointers, so we must keep them pinned while L2 runs. */ - struct gfn_to_pfn_cache apic_access_page_cache; - struct gfn_to_pfn_cache virtual_apic_cache; - struct gfn_to_pfn_cache pi_desc_cache; + struct gfn_to_pfn_cache *apic_access_page_cache; + struct gfn_to_pfn_cache *virtual_apic_cache; + struct gfn_to_pfn_cache *pi_desc_cache;
u64 pi_desc_offset; bool pi_pending; @@ -208,7 +208,7 @@ struct nested_vmx { u32 hv_clean_fields; bool hv_msr_bitmap; bool hv_flush_hypercall; - struct gfn_to_pfn_cache hv_evmcs_cache; + struct gfn_to_pfn_cache *hv_evmcs_cache; #endif };
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b05aace9e295..97e0b949e412 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1533,7 +1533,7 @@ static inline bool kvm_gpc_is_hva_active(struct gfn_to_pfn_cache *gpc)
static inline bool kvm_gpc_invalid(struct gfn_to_pfn_cache *gpc) { - return gpc->active && !gpc->valid; + return gpc && gpc->active && !gpc->valid; }
void kvm_sigset_activate(struct kvm_vcpu *vcpu);
From: Fred Griffoul fgriffo@amazon.co.uk
Add selftest to validate nested VMX context switching between multiple L2 vCPUs running on the same L1 vCPU. The test exercises both direct VMX interface (using vmptrld/vmclear operations) and enlightened VMCS (eVMCS) interface for Hyper-V nested scenarios.
The test creates multiple VMCS structures and switches between them to verify that the nested_context kvm counters are correct, according to the number of L2 vCPUs and the number of switches.
Signed-off-by: Fred Griffoul fgriffo@amazon.co.uk --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../selftests/kvm/x86/vmx_l2_switch_test.c | 416 ++++++++++++++++++ 2 files changed, 417 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c
diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 3431568d837e..5d47afa5789b 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -138,6 +138,7 @@ TEST_GEN_PROGS_x86 += x86/triple_fault_event_test TEST_GEN_PROGS_x86 += x86/recalc_apic_map_test TEST_GEN_PROGS_x86 += x86/aperfmperf_test TEST_GEN_PROGS_x86 += x86/vmx_apic_update_test +TEST_GEN_PROGS_x86 += x86/vmx_l2_switch_test TEST_GEN_PROGS_x86 += access_tracking_perf_test TEST_GEN_PROGS_x86 += coalesced_io_test TEST_GEN_PROGS_x86 += dirty_log_perf_test diff --git a/tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c b/tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c new file mode 100644 index 000000000000..5ec0da2f8386 --- /dev/null +++ b/tools/testing/selftests/kvm/x86/vmx_l2_switch_test.c @@ -0,0 +1,416 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Test nested VMX context switching between multiple VMCS + */ + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "vmx.h" + +#define L2_GUEST_STACK_SIZE 64 +#define L2_VCPU_MAX 16 + +struct l2_vcpu_config { + vm_vaddr_t hv_pages_gva; /* Guest VA for eVMCS */ + vm_vaddr_t vmx_pages_gva; /* Guest VA for VMX pages */ + unsigned long stack[L2_GUEST_STACK_SIZE]; + uint16_t vpid; +}; + +struct l1_test_config { + struct l2_vcpu_config l2_vcpus[L2_VCPU_MAX]; + uint64_t hypercall_gpa; + uint32_t nr_l2_vcpus; + uint32_t nr_switches; + bool enable_vpid; + bool use_evmcs; + bool sched_only; +}; + +static void l2_guest(void) +{ + while (1) + vmcall(); +} + +static void run_l2_guest_evmcs(struct hyperv_test_pages *hv_pages, + struct vmx_pages *vmx, + void *guest_rip, + void *guest_rsp, + uint16_t vpid) +{ + GUEST_ASSERT(load_evmcs(hv_pages)); + prepare_vmcs(vmx, guest_rip, guest_rsp); + current_evmcs->hv_enlightenments_control.msr_bitmap = 1; + vmwrite(VIRTUAL_PROCESSOR_ID, vpid); + + GUEST_ASSERT(!vmlaunch()); + GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_VMCALL); + current_evmcs->guest_rip += 3; /* vmcall */ + + GUEST_ASSERT(!vmresume()); + GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_VMCALL); +} + +static void run_l2_guest_vmx_migrate(struct vmx_pages *vmx, + void *guest_rip, + void *guest_rsp, + uint16_t vpid, + bool start) +{ + uint32_t control; + + /* + * Emulate L2 vCPU migration: vmptrld/vmlaunch/vmclear + */ + + if (start) + GUEST_ASSERT(load_vmcs(vmx)); + else + GUEST_ASSERT(!vmptrld(vmx->vmcs_gpa)); + + prepare_vmcs(vmx, guest_rip, guest_rsp); + + control = vmreadz(CPU_BASED_VM_EXEC_CONTROL); + control |= CPU_BASED_USE_MSR_BITMAPS; + vmwrite(CPU_BASED_VM_EXEC_CONTROL, control); + vmwrite(VIRTUAL_PROCESSOR_ID, vpid); + + GUEST_ASSERT(!vmlaunch()); + GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_VMCALL); + + GUEST_ASSERT(vmptrstz() == vmx->vmcs_gpa); + GUEST_ASSERT(!vmclear(vmx->vmcs_gpa)); +} + +static void run_l2_guest_vmx_sched(struct vmx_pages *vmx, + void *guest_rip, + void *guest_rsp, + uint16_t vpid, + bool start) +{ + /* + * Emulate L2 vCPU multiplexing: vmptrld/vmresume + */ + + if (start) { + uint32_t control; + + GUEST_ASSERT(load_vmcs(vmx)); + prepare_vmcs(vmx, guest_rip, guest_rsp); + + control = vmreadz(CPU_BASED_VM_EXEC_CONTROL); + control |= CPU_BASED_USE_MSR_BITMAPS; + vmwrite(CPU_BASED_VM_EXEC_CONTROL, control); + vmwrite(VIRTUAL_PROCESSOR_ID, vpid); + + GUEST_ASSERT(!vmlaunch()); + } else { + GUEST_ASSERT(!vmptrld(vmx->vmcs_gpa)); + GUEST_ASSERT(!vmresume()); + } + + GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_VMCALL); + + vmwrite(GUEST_RIP, + vmreadz(GUEST_RIP) + vmreadz(VM_EXIT_INSTRUCTION_LEN)); +} + +static void l1_guest_evmcs(struct l1_test_config *config) +{ + struct hyperv_test_pages *hv_pages; + struct vmx_pages *vmx_pages; + uint32_t i, j; + + /* Initialize Hyper-V MSRs */ + wrmsr(HV_X64_MSR_GUEST_OS_ID, HYPERV_LINUX_OS_ID); + wrmsr(HV_X64_MSR_HYPERCALL, config->hypercall_gpa); + + /* Enable VP assist page */ + hv_pages = (struct hyperv_test_pages *)config->l2_vcpus[0].hv_pages_gva; + enable_vp_assist(hv_pages->vp_assist_gpa, hv_pages->vp_assist); + + /* Enable evmcs */ + evmcs_enable(); + + vmx_pages = (struct vmx_pages *)config->l2_vcpus[0].vmx_pages_gva; + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); + + for (i = 0; i < config->nr_switches; i++) { + for (j = 0; j < config->nr_l2_vcpus; j++) { + struct l2_vcpu_config *l2 = &config->l2_vcpus[j]; + + hv_pages = (struct hyperv_test_pages *)l2->hv_pages_gva; + vmx_pages = (struct vmx_pages *)l2->vmx_pages_gva; + + run_l2_guest_evmcs(hv_pages, vmx_pages, l2_guest, + &l2->stack[L2_GUEST_STACK_SIZE], + l2->vpid); + } + } + + GUEST_DONE(); +} + +static void l1_guest_vmx(struct l1_test_config *config) +{ + struct vmx_pages *vmx_pages; + uint32_t i, j; + + vmx_pages = (struct vmx_pages *)config->l2_vcpus[0].vmx_pages_gva; + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); + + for (i = 0; i < config->nr_switches; i++) { + for (j = 0; j < config->nr_l2_vcpus; j++) { + struct l2_vcpu_config *l2 = &config->l2_vcpus[j]; + + vmx_pages = (struct vmx_pages *)l2->vmx_pages_gva; + + if (config->sched_only) + run_l2_guest_vmx_sched(vmx_pages, l2_guest, + &l2->stack[L2_GUEST_STACK_SIZE], + l2->vpid, i == 0); + else + run_l2_guest_vmx_migrate(vmx_pages, l2_guest, + &l2->stack[L2_GUEST_STACK_SIZE], + l2->vpid, i == 0); + } + } + + if (config->sched_only) { + for (j = 0; j < config->nr_l2_vcpus; j++) { + struct l2_vcpu_config *l2 = &config->l2_vcpus[j]; + + vmx_pages = (struct vmx_pages *)l2->vmx_pages_gva; + vmclear(vmx_pages->vmcs_gpa); + } + } + + GUEST_DONE(); +} + +static void vcpu_clone_hyperv_test_pages(struct kvm_vm *vm, + vm_vaddr_t src_gva, + vm_vaddr_t *dst_gva) +{ + struct hyperv_test_pages *src, *dst; + vm_vaddr_t evmcs_gva; + + *dst_gva = vm_vaddr_alloc_page(vm); + + src = addr_gva2hva(vm, src_gva); + dst = addr_gva2hva(vm, *dst_gva); + memcpy(dst, src, sizeof(*dst)); + + /* Allocate a new evmcs page */ + evmcs_gva = vm_vaddr_alloc_page(vm); + dst->enlightened_vmcs = (void *)evmcs_gva; + dst->enlightened_vmcs_hva = addr_gva2hva(vm, evmcs_gva); + dst->enlightened_vmcs_gpa = addr_gva2gpa(vm, evmcs_gva); +} + +static void prepare_vcpu(struct kvm_vm *vm, struct kvm_vcpu *vcpu, + uint32_t nr_l2_vcpus, uint32_t nr_switches, + bool enable_vpid, bool use_evmcs, + bool sched_only) +{ + vm_vaddr_t config_gva; + struct l1_test_config *config; + vm_vaddr_t hypercall_page_gva = 0; + uint32_t i; + + TEST_ASSERT(nr_l2_vcpus <= L2_VCPU_MAX, + "Too many L2 vCPUs: %u (max %u)", nr_l2_vcpus, L2_VCPU_MAX); + + /* Allocate config structure in guest memory */ + config_gva = vm_vaddr_alloc(vm, sizeof(*config), 0x1000); + config = addr_gva2hva(vm, config_gva); + memset(config, 0, sizeof(*config)); + + if (use_evmcs) { + /* Allocate hypercall page */ + hypercall_page_gva = vm_vaddr_alloc_page(vm); + memset(addr_gva2hva(vm, hypercall_page_gva), 0, getpagesize()); + config->hypercall_gpa = addr_gva2gpa(vm, hypercall_page_gva); + + /* Enable Hyper-V enlightenments */ + vcpu_set_hv_cpuid(vcpu); + vcpu_enable_evmcs(vcpu); + } + + /* Allocate resources for each L2 vCPU */ + for (i = 0; i < nr_l2_vcpus; i++) { + vm_vaddr_t vmx_pages_gva; + + /* Allocate VMX pages (needed for both VMX and eVMCS) */ + vcpu_alloc_vmx(vm, &vmx_pages_gva); + config->l2_vcpus[i].vmx_pages_gva = vmx_pages_gva; + + if (use_evmcs) { + vm_vaddr_t hv_pages_gva; + + /* Allocate or clone hyperv_test_pages */ + if (i == 0) { + vcpu_alloc_hyperv_test_pages(vm, &hv_pages_gva); + } else { + vm_vaddr_t first_hv_gva = + config->l2_vcpus[0].hv_pages_gva; + vcpu_clone_hyperv_test_pages(vm, first_hv_gva, + &hv_pages_gva); + } + config->l2_vcpus[i].hv_pages_gva = hv_pages_gva; + } + + /* Set VPID */ + config->l2_vcpus[i].vpid = enable_vpid ? (i + 3) : 0; + } + + config->nr_l2_vcpus = nr_l2_vcpus; + config->nr_switches = nr_switches; + config->enable_vpid = enable_vpid; + config->use_evmcs = use_evmcs; + config->sched_only = use_evmcs ? false : sched_only; + + /* Pass single pointer to config structure */ + vcpu_args_set(vcpu, 1, config_gva); + + if (use_evmcs) + vcpu_set_msr(vcpu, HV_X64_MSR_VP_INDEX, vcpu->id); +} + +static bool opt_enable_vpid = true; +static const char *progname; + +static void check_stats(struct kvm_vm *vm, + uint32_t nr_l2_vcpus, + uint32_t nr_switches, + bool use_evmcs, + bool sched_only) +{ + uint64_t reuse = 0; + uint64_t recycle = 0; + + reuse = vm_get_stat(vm, nested_context_reuse); + recycle = vm_get_stat(vm, nested_context_recycle); + + if (nr_l2_vcpus <= KVM_NESTED_OVERSUB_RATIO) { + GUEST_ASSERT_EQ(reuse, nr_l2_vcpus * (nr_switches - 1)); + GUEST_ASSERT_EQ(recycle, 0); + } else { + if (sched_only) { + /* + * When scheduling only no L2 vCPU vmcs is cleared so + * we reuse up to the max. number of contexts, but we + * cannot recycle any of them. + */ + GUEST_ASSERT_EQ(reuse, + KVM_NESTED_OVERSUB_RATIO * + (nr_switches - 1)); + GUEST_ASSERT_EQ(recycle, 0); + } else { + /* + * When migration we cycle in LRU order so no context + * can be reused they are all recycled. + */ + GUEST_ASSERT_EQ(reuse, 0); + GUEST_ASSERT_EQ(recycle, + (nr_l2_vcpus * nr_switches) - + KVM_NESTED_OVERSUB_RATIO); + } + } + + printf("%s %u switches with %u L2 vCPUS (%s) reuse %" PRIu64 + " recycle %" PRIu64 "\n", progname, nr_switches, nr_l2_vcpus, + use_evmcs ? "evmcs" : (sched_only ? "vmx sched" : "vmx migrate"), + reuse, recycle); +} + +static void run_test(uint32_t nr_l2_vcpus, uint32_t nr_switches, + bool use_evmcs, bool sched_only) +{ + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + struct ucall uc; + + vm = vm_create_with_one_vcpu(&vcpu, use_evmcs + ? l1_guest_evmcs : l1_guest_vmx); + + prepare_vcpu(vm, vcpu, nr_l2_vcpus, nr_switches, + opt_enable_vpid, use_evmcs, sched_only); + + for (;;) { + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + + switch (get_ucall(vcpu, &uc)) { + case UCALL_DONE: + goto done; + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + default: + TEST_FAIL("Unexpected ucall: %lu", uc.cmd); + } + } + +done: + check_stats(vm, nr_l2_vcpus, nr_switches, use_evmcs, sched_only); + kvm_vm_free(vm); +} + +int main(int argc, char *argv[]) +{ + uint32_t opt_nr_l2_vcpus = 0; + uint32_t opt_nr_switches = 0; + bool opt_sched_only = true; + int opt; + int i; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); + + progname = argv[0]; + + while ((opt = getopt(argc, argv, "c:rs:v")) != -1) { + switch (opt) { + case 'c': + opt_nr_l2_vcpus = atoi_paranoid(optarg); + break; + case 'r': + opt_sched_only = false; + break; + case 's': + opt_nr_switches = atoi_paranoid(optarg); + break; + case 'v': + opt_enable_vpid = false; + break; + default: + break; + } + } + + if (opt_nr_l2_vcpus && opt_nr_switches) { + run_test(opt_nr_l2_vcpus, opt_nr_switches, false, + opt_sched_only); + + if (kvm_has_cap(KVM_CAP_HYPERV_ENLIGHTENED_VMCS)) + run_test(opt_nr_l2_vcpus, opt_nr_switches, + true, false); + } else { + /* VMX vmlaunch */ + for (i = 2; i <= 16; i++) + run_test(i, 4, false, false); + + /* VMX vmresume */ + for (i = 2; i <= 16; i++) + run_test(i, 4, false, true); + + /* eVMCS */ + if (kvm_has_cap(KVM_CAP_HYPERV_ENLIGHTENED_VMCS)) { + for (i = 2; i <= 16; i++) + run_test(i, 4, true, false); + } + } + + return 0; +}
linux-kselftest-mirror@lists.linaro.org