The upcoming new Idle HLT Intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by the hypervisor only if there are no pending V_INTR and V_NMI events for the vCPU. When the vCPU is expected to service the pending V_INTR and V_NMI events, the Idle HLT intercept won’t trigger. The feature allows the hypervisor to determine if the vCPU is actually idle and reduces wasteful VMEXITs.
Presence of the Idle HLT Intercept feature is indicated via CPUID function Fn8000_000A_EDX[30].
Document for the Idle HLT intercept feature is available at [1].
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024, Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT). https://bugzilla.kernel.org/attachment.cgi?id=306250
Testing Done: - Added a selftest to test the Idle HLT intercept functionality. - Compile and functionality testing for the Idle HLT intercept selftest are only done for x86_64. - Tested SEV and SEV-ES guest for the Idle HLT intercept functionality.
v2 -> v3 - Incorporated Andrew's suggestion to structure vcpu_stat_types in a way that each architecture can share the generic types and also provide its own.
v1 -> v2 - Done changes in svm_idle_hlt_test based on the review comments from Sean. - Added an enum based approach to get binary stats in vcpu_get_stat() which doesn't use string to get stat data based on the comments from Sean. - Added self_halt() and cli() helpers based on the comments from Sean.
Manali Shukla (5): x86/cpufeatures: Add CPUID feature bit for Idle HLT intercept KVM: SVM: Add Idle HLT intercept support KVM: selftests: Add safe_halt() and cli() helpers to common code KVM: selftests: Add an interface to read the data of named vcpu stat KVM: selftests: KVM: SVM: Add Idle HLT intercept test
arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/svm.h | 1 + arch/x86/include/uapi/asm/svm.h | 2 + arch/x86/kvm/svm/svm.c | 11 ++- tools/testing/selftests/kvm/Makefile | 1 + .../testing/selftests/kvm/include/kvm_util.h | 44 +++++++++ .../kvm/include/x86_64/kvm_util_arch.h | 40 +++++++++ .../selftests/kvm/include/x86_64/processor.h | 18 ++++ tools/testing/selftests/kvm/lib/kvm_util.c | 32 +++++++ .../selftests/kvm/x86_64/svm_idle_hlt_test.c | 89 +++++++++++++++++++ 10 files changed, 236 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c
base-commit: d91a9cc16417b8247213a0144a1f0fd61dc855dd
From: Manali Shukla Manali.Shukla@amd.com
The Idle HLT Intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by the hypervisor only if there are no pending events (V_INTR and V_NMI) for the vCPU. When the vCPU is expected to service the pending events (V_INTR and V_NMI), the Idle HLT intercept won’t trigger. The feature allows the hypervisor to determine if the vCPU is idle and reduces wasteful VMEXITs.
Presence of Idle HLT intercept feature for guests is indicated via CPUID function 0x8000000A_EDX[30].
Signed-off-by: Manali Shukla Manali.Shukla@amd.com --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index a38f8f9ba657..a8c5dec042dc 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -381,6 +381,7 @@ #define X86_FEATURE_V_SPEC_CTRL (15*32+20) /* Virtual SPEC_CTRL */ #define X86_FEATURE_VNMI (15*32+25) /* Virtual NMI */ #define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* "" SVME addr check */ +#define X86_FEATURE_IDLE_HLT (15*32+30) /* "" IDLE HLT intercept */
/* Intel-defined CPU features, CPUID level 0x00000007:0 (ECX), word 16 */ #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/
base-commit: d91a9cc16417b8247213a0144a1f0fd61dc855dd
On Tue, May 28, 2024 at 04:19:22AM +0000, Manali Shukla wrote:
From: Manali Shukla Manali.Shukla@amd.com
The Idle HLT Intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by the hypervisor only if there are no pending events (V_INTR and V_NMI) for the vCPU. When the vCPU is expected to service the pending events (V_INTR and V_NMI), the Idle HLT intercept won’t trigger. The feature allows the hypervisor to determine if the vCPU is idle and reduces wasteful VMEXITs.
Presence of Idle HLT intercept feature for guests is indicated via CPUID function 0x8000000A_EDX[30].
Signed-off-by: Manali Shukla Manali.Shukla@amd.com
arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index a38f8f9ba657..a8c5dec042dc 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -381,6 +381,7 @@ #define X86_FEATURE_V_SPEC_CTRL (15*32+20) /* Virtual SPEC_CTRL */ #define X86_FEATURE_VNMI (15*32+25) /* Virtual NMI */ #define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* "" SVME addr check */ +#define X86_FEATURE_IDLE_HLT (15*32+30) /* "" IDLE HLT intercept */ /* Intel-defined CPU features, CPUID level 0x00000007:0 (ECX), word 16 */ #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/
base-commit: d91a9cc16417b8247213a0144a1f0fd61dc855dd
Acked-by: Borislav Petkov (AMD) bp@alien8.de
From: Manali Shukla Manali.Shukla@amd.com
Execution of the HLT instruction by a vCPU can be intercepted by the hypervisor by setting the HLT-Intercept Bit in VMCB, thus resulting in a VMEXIT. It can be possible that soon after the VMEXIT, hypervisor observes that there are pending V_INTR and V_NMI events for the vCPU and causes it to perform a VMRUN to service those events. In that case VMEXIT is wasteful.
The Idle HLT intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by hypervisor only if there are no pending V_INTR and V_NMI events for the vCPU. The Idle HLT intercept will not be triggerred, when vCPU is expected to have pending events (V_INTR and V_NMI).
The feature allows the hypervisor to determine whether vCPU is idle and reduces wasteful VMEXITs.
Details about the Idle HLT intercept feature can be found in AMD APM [1].
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024, Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT). https://bugzilla.kernel.org/attachment.cgi?id=306250
Signed-off-by: Manali Shukla Manali.Shukla@amd.com --- arch/x86/include/asm/svm.h | 1 + arch/x86/include/uapi/asm/svm.h | 2 ++ arch/x86/kvm/svm/svm.c | 11 ++++++++--- 3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 728c98175b9c..3a91928a4060 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -116,6 +116,7 @@ enum { INTERCEPT_INVPCID, INTERCEPT_MCOMMIT, INTERCEPT_TLBSYNC, + INTERCEPT_IDLE_HLT = 166, };
diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h index 80e1df482337..9910f86a2cef 100644 --- a/arch/x86/include/uapi/asm/svm.h +++ b/arch/x86/include/uapi/asm/svm.h @@ -95,6 +95,7 @@ #define SVM_EXIT_CR14_WRITE_TRAP 0x09e #define SVM_EXIT_CR15_WRITE_TRAP 0x09f #define SVM_EXIT_INVPCID 0x0a2 +#define SVM_EXIT_IDLE_HLT 0x0a6 #define SVM_EXIT_NPF 0x400 #define SVM_EXIT_AVIC_INCOMPLETE_IPI 0x401 #define SVM_EXIT_AVIC_UNACCELERATED_ACCESS 0x402 @@ -223,6 +224,7 @@ { SVM_EXIT_CR4_WRITE_TRAP, "write_cr4_trap" }, \ { SVM_EXIT_CR8_WRITE_TRAP, "write_cr8_trap" }, \ { SVM_EXIT_INVPCID, "invpcid" }, \ + { SVM_EXIT_IDLE_HLT, "idle-halt" }, \ { SVM_EXIT_NPF, "npf" }, \ { SVM_EXIT_AVIC_INCOMPLETE_IPI, "avic_incomplete_ipi" }, \ { SVM_EXIT_AVIC_UNACCELERATED_ACCESS, "avic_unaccelerated_access" }, \ diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 0f3b59da0d4a..223c670bf986 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1289,8 +1289,12 @@ static void init_vmcb(struct kvm_vcpu *vcpu) svm_set_intercept(svm, INTERCEPT_MWAIT); }
- if (!kvm_hlt_in_guest(vcpu->kvm)) - svm_set_intercept(svm, INTERCEPT_HLT); + if (!kvm_hlt_in_guest(vcpu->kvm)) { + if (cpu_feature_enabled(X86_FEATURE_IDLE_HLT)) + svm_set_intercept(svm, INTERCEPT_IDLE_HLT); + else + svm_set_intercept(svm, INTERCEPT_HLT); + }
control->iopm_base_pa = __sme_set(iopm_base); control->msrpm_base_pa = __sme_set(__pa(svm->msrpm)); @@ -3291,6 +3295,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_CR4_WRITE_TRAP] = cr_trap, [SVM_EXIT_CR8_WRITE_TRAP] = cr_trap, [SVM_EXIT_INVPCID] = invpcid_interception, + [SVM_EXIT_IDLE_HLT] = kvm_emulate_halt, [SVM_EXIT_NPF] = npf_interception, [SVM_EXIT_RSM] = rsm_interception, [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception, @@ -3453,7 +3458,7 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code) return interrupt_window_interception(vcpu); else if (exit_code == SVM_EXIT_INTR) return intr_interception(vcpu); - else if (exit_code == SVM_EXIT_HLT) + else if (exit_code == SVM_EXIT_HLT || exit_code == SVM_EXIT_IDLE_HLT) return kvm_emulate_halt(vcpu); else if (exit_code == SVM_EXIT_NPF) return npf_interception(vcpu);
Add safe_halt() and cli() helpers to processor.h to make them broadly available in KVM selftests.
Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Manali Shukla manali.shukla@amd.com --- .../selftests/kvm/include/x86_64/processor.h | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index 8eb57de0b587..f74f31df96d2 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -1305,6 +1305,23 @@ static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size, GUEST_ASSERT(!ret); }
+/* + * Execute HLT in an STI interrupt shadow to ensure that a pending IRQ that's + * intended to be a wake event arrives *after* HLT is executed. Modern CPUs, + * except for a few oddballs that KVM is unlikely to run on, block IRQs for one + * instruction after STI, *if* RFLAGS.IF=0 before STI. Note, Intel CPUs may + * block other events beyond regular IRQs, e.g. may block NMIs and SMIs too. + */ +static inline void safe_halt(void) +{ + asm volatile("sti; hlt"); +} + +static inline void cli(void) +{ + asm volatile ("cli"); +} + void __vm_xsave_require_permission(uint64_t xfeature, const char *name);
#define vm_xsave_require_permission(xfeature) \
From: Manali Shukla Manali.Shukla@amd.com
The interface is used to read the data values of a specified vcpu stat from the currenly available binary stats interface.
Signed-off-by: Manali Shukla Manali.Shukla@amd.com --- .../kvm/include/kvm_arch_vcpu_states.h | 49 +++++++++++++++++++ .../testing/selftests/kvm/include/kvm_util.h | 34 +++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 32 ++++++++++++ 3 files changed, 115 insertions(+) create mode 100644 tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h
diff --git a/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h b/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h new file mode 100644 index 000000000000..755ff7de53d9 --- /dev/null +++ b/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Arch-specific stats are added to the kvm_arch_vcpu_states.h. Sequence + * of arch-specific vcpu_stat_type should be same as they are declared in + * arch-specific kvm_vcpu_stat. + */ +#ifdef __x86_64__ +#define KVM_X86_VCPU_STATE(x) KVM_VCPU_STATE(x) + +KVM_X86_VCPU_STATE(PF_TAKEN) +KVM_X86_VCPU_STATE(PF_FIXED) +KVM_X86_VCPU_STATE(PF_EMULATE) +KVM_X86_VCPU_STATE(PF_SPURIOUS) +KVM_X86_VCPU_STATE(PF_FAST) +KVM_X86_VCPU_STATE(PF_MMIO_SPTE_CREATED) +KVM_X86_VCPU_STATE(PF_GUEST) +KVM_X86_VCPU_STATE(TLB_FLUSH) +KVM_X86_VCPU_STATE(INVLPG) +KVM_X86_VCPU_STATE(EXITS) +KVM_X86_VCPU_STATE(IO_EXITS) +KVM_X86_VCPU_STATE(MMIO_EXITS) +KVM_X86_VCPU_STATE(SIGNAL_EXITS) +KVM_X86_VCPU_STATE(IRQ_WINDOW_EXITS) +KVM_X86_VCPU_STATE(NMI_WINDOW_EXITS) +KVM_X86_VCPU_STATE(LD_FLUSH) +KVM_X86_VCPU_STATE(HALT_EXITS) +KVM_X86_VCPU_STATE(REQUEST_IRQ_EXITS) +KVM_X86_VCPU_STATE(IRQ_EXITS) +KVM_X86_VCPU_STATE(HOST_STATE_RELOAD) +KVM_X86_VCPU_STATE(FPU_RELOAD) +KVM_X86_VCPU_STATE(INSN_EMULATION) +KVM_X86_VCPU_STATE(INSN_EMULATION_FAIL) +KVM_X86_VCPU_STATE(HYPERCALLS) +KVM_X86_VCPU_STATE(IRQ_INJECTIONS) +KVM_X86_VCPU_STATE(NMI_INJECTIONS) +KVM_X86_VCPU_STATE(REQ_EVENT) +KVM_X86_VCPU_STATE(NESTED_RUN) +KVM_X86_VCPU_STATE(DIRECTED_YIELD_ATTEMPTED) +KVM_X86_VCPU_STATE(DIRECTED_YIELD_SUCCESSFUL) +KVM_X86_VCPU_STATE(PREEMPTION_REPORTED) +KVM_X86_VCPU_STATE(PREEMPTION_OTHER) +KVM_X86_VCPU_STATE(GUEST_MODE) +KVM_X86_VCPU_STATE(NOTIFY_WINDOW_EXITS) + +#undef KVM_X86_VCPU_STATE +#undef KVM_VCPU_STATE +#endif + diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h index 63c2aaae51f3..fe66b1736060 100644 --- a/tools/testing/selftests/kvm/include/kvm_util.h +++ b/tools/testing/selftests/kvm/include/kvm_util.h @@ -518,6 +518,40 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, const char *stat_name) return data; }
+/* + * Ensure that the sequence of the enum vcpu_stat_types matches the order of + * kvm_vcpu_stats_desc[]. Otherwise, vcpu_get_stat() may return incorrect data + * because __vcpu_get_stat() uses the enum type as an index to get the + * descriptor for a given stat and then uses read_stat_data() to get the stats + * from the descriptor. + */ +enum vcpu_stat_types { + HALT_SUCCESSFUL_POLL, + HALT_ATTEMPTED_POLL, + HALT_POLL_INVALID, + HALT_WAKEUP, + HALT_POLL_SUCCESS_NS, + HALT_POLL_FAIL_NS, + HALT_WAIT_NS, + HALT_POLL_SUCCESS_HIST, + HALT_POLL_FAIL_HIST, + HALT_WAIT_HIST, + BLOCKING, +#define KVM_VCPU_STATE(x) x, +#include "kvm_arch_vcpu_states.h" +}; + +void __vcpu_get_stat(struct kvm_vcpu *vcpu, enum vcpu_stat_types type, uint64_t *data, + size_t max_elements); + +static inline uint64_t vcpu_get_stat(struct kvm_vcpu *vcpu, enum vcpu_stat_types type) +{ + uint64_t data; + + __vcpu_get_stat(vcpu, type, &data, 1); + return data; +} + void vm_create_irqchip(struct kvm_vm *vm);
static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size, diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index ad00e4761886..d4b6a352c1ae 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -2264,6 +2264,38 @@ void read_stat_data(int stats_fd, struct kvm_stats_header *header, desc->name, size, ret); }
+/* + * Read the data of the named vcpu stat + * + * Input Args: + * vcpu - the vcpu for which the stat should be read + * stat_name - the name of the stat to read + * max_elements - the maximum number of 8-byte values to read into data + * + * Output Args: + * data - the buffer into which stat data should be read + * + * Read the data values of a specified stat from the binary stats interface. + */ +void __vcpu_get_stat(struct kvm_vcpu *vcpu, enum vcpu_stat_types type, uint64_t *data, + size_t max_elements) +{ + int vcpu_stats_fd; + struct kvm_stats_header header; + struct kvm_stats_desc *desc, *t_desc; + size_t size_desc; + + vcpu_stats_fd = vcpu_get_stats_fd(vcpu); + read_stats_header(vcpu_stats_fd, &header); + + desc = read_stats_descriptors(vcpu_stats_fd, &header); + size_desc = get_stats_descriptor_size(&header); + + t_desc = (void *)desc + (type * size_desc); + read_stat_data(vcpu_stats_fd, &header, t_desc, + data, max_elements); +} + /* * Read the data of the named stat *
On Tue, May 28, 2024, Manali Shukla wrote:
From: Manali Shukla Manali.Shukla@amd.com
The interface is used to read the data values of a specified vcpu stat from the currenly available binary stats interface.
Signed-off-by: Manali Shukla Manali.Shukla@amd.com
.../kvm/include/kvm_arch_vcpu_states.h | 49 +++++++++++++++++++ .../testing/selftests/kvm/include/kvm_util.h | 34 +++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 32 ++++++++++++ 3 files changed, 115 insertions(+) create mode 100644 tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h
diff --git a/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h b/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h new file mode 100644 index 000000000000..755ff7de53d9 --- /dev/null +++ b/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 */
+/*
- Arch-specific stats are added to the kvm_arch_vcpu_states.h. Sequence
- of arch-specific vcpu_stat_type should be same as they are declared in
- arch-specific kvm_vcpu_stat.
- */
+#ifdef __x86_64__
This is backwards. If you want arch specific stats, put it them in an arch specific header.
+#define KVM_X86_VCPU_STATE(x) KVM_VCPU_STATE(x)
+KVM_X86_VCPU_STATE(PF_TAKEN)
I'm pretty sure you want KVM_VCPU_STAT, KVM_X86_VCPU_STAT, kvm_arch_vcpu_states.h, etc.
+KVM_X86_VCPU_STATE(PF_FIXED)
...
+/*
- Ensure that the sequence of the enum vcpu_stat_types matches the order of
- kvm_vcpu_stats_desc[]. Otherwise, vcpu_get_stat() may return incorrect data
- because __vcpu_get_stat() uses the enum type as an index to get the
- descriptor for a given stat and then uses read_stat_data() to get the stats
- from the descriptor.
This isn't maintainable. Unless I'm missing something, the _order_ of KVM's stats isn't ABI, and blindly reading an entry and hoping its the right one is doomed to fail.
I don't see any reason whatsoever to diverge from the core functionality of __vm_get_stat(). The only difference should be the origin of the stats file and header.
I do see a lot of room for improvement, but that can and should be done for both VM and vCPU stats. E.g. provide an API (and a container/struct?) to get a direct pointer to stat so that selftests don't have to walk all descriptors when they're reading the same stat over and over.
And to detect typos at compile time, {vcpu,vm}_get_stat() could either play macro games or use enums and array to detect usage of a stat that doesn't exist. E.g.
static inline uint64_t vm_get_stat(struct kvm_vm *vm, int stat) { uint64_t data;
__vm_get_stat(vm, kvm_vm_stats[stat], &data, 1); return data; }
or
#define vm_get_stat(vm, stat) \ ({ \ uin64_t __data; \ \ <concatenation trickery to trigger compiler error if the stat doesn't exit> __vm_get_stat(vm, #stat, &data, 1); \ data; \ })
I'd probably vote for macro games, e.g. so that it's all but impossible to pass a per-VM stat into vcpu_get_stat(), and vice versa.
Hi Sean,
Thank you for reviewing my patches. Sorry for the delay in response.
On 8/13/2024 10:07 PM, Sean Christopherson wrote:
On Tue, May 28, 2024, Manali Shukla wrote:
From: Manali Shukla Manali.Shukla@amd.com
The interface is used to read the data values of a specified vcpu stat from the currenly available binary stats interface.
Signed-off-by: Manali Shukla Manali.Shukla@amd.com
.../kvm/include/kvm_arch_vcpu_states.h | 49 +++++++++++++++++++ .../testing/selftests/kvm/include/kvm_util.h | 34 +++++++++++++ tools/testing/selftests/kvm/lib/kvm_util.c | 32 ++++++++++++ 3 files changed, 115 insertions(+) create mode 100644 tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h
diff --git a/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h b/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h new file mode 100644 index 000000000000..755ff7de53d9 --- /dev/null +++ b/tools/testing/selftests/kvm/include/kvm_arch_vcpu_states.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0 */
+/*
- Arch-specific stats are added to the kvm_arch_vcpu_states.h. Sequence
- of arch-specific vcpu_stat_type should be same as they are declared in
- arch-specific kvm_vcpu_stat.
- */
+#ifdef __x86_64__
This is backwards. If you want arch specific stats, put it them in an arch specific header.
+#define KVM_X86_VCPU_STATE(x) KVM_VCPU_STATE(x)
+KVM_X86_VCPU_STATE(PF_TAKEN)
I'm pretty sure you want KVM_VCPU_STAT, KVM_X86_VCPU_STAT, kvm_arch_vcpu_states.h, etc.
+KVM_X86_VCPU_STATE(PF_FIXED)
...
+/*
- Ensure that the sequence of the enum vcpu_stat_types matches the order of
- kvm_vcpu_stats_desc[]. Otherwise, vcpu_get_stat() may return incorrect data
- because __vcpu_get_stat() uses the enum type as an index to get the
- descriptor for a given stat and then uses read_stat_data() to get the stats
- from the descriptor.
This isn't maintainable. Unless I'm missing something, the _order_ of KVM's stats isn't ABI, and blindly reading an entry and hoping its the right one is doomed to fail.
I don't see any reason whatsoever to diverge from the core functionality of __vm_get_stat(). The only difference should be the origin of the stats file and header.
I do see a lot of room for improvement, but that can and should be done for both VM and vCPU stats. E.g. provide an API (and a container/struct?) to get a direct pointer to stat so that selftests don't have to walk all descriptors when they're reading the same stat over and over.
And to detect typos at compile time, {vcpu,vm}_get_stat() could either play macro games or use enums and array to detect usage of a stat that doesn't exist. E.g.
static inline uint64_t vm_get_stat(struct kvm_vm *vm, int stat) { uint64_t data;
__vm_get_stat(vm, kvm_vm_stats[stat], &data, 1); return data; }
or
#define vm_get_stat(vm, stat) \ ({ \ uin64_t __data; \ \ <concatenation trickery to trigger compiler error if the stat doesn't exit> __vm_get_stat(vm, #stat, &data, 1); \ data; \ })
I'd probably vote for macro games, e.g. so that it's all but impossible to pass a per-VM stat into vcpu_get_stat(), and vice versa.
All the review comments from this patch are taken care in [1].
[1] https://lore.kernel.org/kvm/20241021062226.108657-1-manali.shukla@amd.com/T/...
- Manali
From: Manali Shukla Manali.Shukla@amd.com
Execution of the HLT instruction results in VMEXIT. Hypervisor observes pending V_INTR and V_NMI events just after VMEXIT generated by HLT for the vCPU and causes VM entry to service the pending events. The Idle HLT intercept feature avoids the wasteful VMEXIT during halt if there are pending V_INTR and V_NMI events for the vCPU.
The selftest for Idle HLT intercept instruments above-mentioned scenario.
Signed-off-by: Manali Shukla Manali.Shukla@amd.com --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/include/x86_64/processor.h | 1 + .../selftests/kvm/x86_64/svm_idle_hlt_test.c | 89 +++++++++++++++++++ 3 files changed, 91 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 6de9994971c9..bd97586d7c04 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -93,6 +93,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test TEST_GEN_PROGS_x86_64 += x86_64/smm_test TEST_GEN_PROGS_x86_64 += x86_64/state_test TEST_GEN_PROGS_x86_64 += x86_64/vmx_preemption_timer_test +TEST_GEN_PROGS_x86_64 += x86_64/svm_idle_hlt_test TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test TEST_GEN_PROGS_x86_64 += x86_64/svm_int_ctl_test TEST_GEN_PROGS_x86_64 += x86_64/svm_nested_shutdown_test diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index f74f31df96d2..0036937b1be4 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -192,6 +192,7 @@ struct kvm_x86_cpu_feature { #define X86_FEATURE_PAUSEFILTER KVM_X86_CPU_FEATURE(0x8000000A, 0, EDX, 10) #define X86_FEATURE_PFTHRESHOLD KVM_X86_CPU_FEATURE(0x8000000A, 0, EDX, 12) #define X86_FEATURE_VGIF KVM_X86_CPU_FEATURE(0x8000000A, 0, EDX, 16) +#define X86_FEATURE_IDLE_HLT KVM_X86_CPU_FEATURE(0x8000000A, 0, EDX, 30) #define X86_FEATURE_SEV KVM_X86_CPU_FEATURE(0x8000001F, 0, EAX, 1) #define X86_FEATURE_SEV_ES KVM_X86_CPU_FEATURE(0x8000001F, 0, EAX, 3)
diff --git a/tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c b/tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c new file mode 100644 index 000000000000..594caac7194b --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c @@ -0,0 +1,89 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2024 Advanced Micro Devices, Inc. + * + */ +#include <kvm_util.h> +#include <processor.h> +#include <test_util.h> +#include "svm_util.h" +#include "apic.h" + +#define VINTR_VECTOR 0x30 +#define NUM_ITERATIONS 1000 + +static bool irq_received; + +/* + * The guest code instruments the scenario where there is a V_INTR pending + * event available while hlt instruction is executed. The HLT VM Exit doesn't + * occur in above-mentioned scenario if Idle HLT intercept feature is enabled. + */ + +static void guest_code(void) +{ + uint32_t icr_val; + int i; + + xapic_enable(); + + icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | VINTR_VECTOR); + + for (i = 0; i < NUM_ITERATIONS; i++) { + cli(); + xapic_write_reg(APIC_ICR, icr_val); + safe_halt(); + GUEST_ASSERT(READ_ONCE(irq_received)); + WRITE_ONCE(irq_received, false); + } + GUEST_DONE(); +} + +static void guest_vintr_handler(struct ex_regs *regs) +{ + WRITE_ONCE(irq_received, true); + xapic_write_reg(APIC_EOI, 0x00); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vm *vm; + struct kvm_vcpu *vcpu; + struct ucall uc; + uint64_t halt_exits, vintr_exits; + + /* Check the extension for binary stats */ + TEST_REQUIRE(this_cpu_has(X86_FEATURE_IDLE_HLT)); + TEST_REQUIRE(kvm_has_cap(KVM_CAP_BINARY_STATS_FD)); + + vm = vm_create_with_one_vcpu(&vcpu, guest_code); + + vm_install_exception_handler(vm, VINTR_VECTOR, guest_vintr_handler); + virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA); + + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + + halt_exits = vcpu_get_stat(vcpu, HALT_EXITS); + vintr_exits = vcpu_get_stat(vcpu, IRQ_WINDOW_EXITS); + + switch (get_ucall(vcpu, &uc)) { + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + /* NOT REACHED */ + case UCALL_DONE: + break; + + default: + TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd); + } + + TEST_ASSERT_EQ(halt_exits, 0); + pr_debug("Guest executed VINTR followed by halts: %d times.\n" + "The guest exited due to halt: %ld times and number\n" + "of vintr exits: %ld.\n", + NUM_ITERATIONS, halt_exits, vintr_exits); + + kvm_vm_free(vm); + return 0; +}
+static void guest_code(void) +{
- uint32_t icr_val;
- int i;
- xapic_enable();
- icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | VINTR_VECTOR);
- for (i = 0; i < NUM_ITERATIONS; i++) {
cli();
xapic_write_reg(APIC_ICR, icr_val);
safe_halt();
GUEST_ASSERT(READ_ONCE(irq_received));
WRITE_ONCE(irq_received, false);
any reason to use READ/WRITE_ONCE here?
- }
- GUEST_DONE();
+}
+static void guest_vintr_handler(struct ex_regs *regs) +{
- WRITE_ONCE(irq_received, true);
- xapic_write_reg(APIC_EOI, 0x00);
+}
+int main(int argc, char *argv[]) +{
- struct kvm_vm *vm;
- struct kvm_vcpu *vcpu;
- struct ucall uc;
- uint64_t halt_exits, vintr_exits;
- /* Check the extension for binary stats */
- TEST_REQUIRE(this_cpu_has(X86_FEATURE_IDLE_HLT));
IIUC, this test assumes that the IDLE_HLT feature is enabled for guests if it is supported by the CPU. But this isn't true in some cases:
1. an old KVM runs on new hardware 2. the feature bit is somehow cleared, e.g., by "clearcpuid" cmdline
- TEST_REQUIRE(kvm_has_cap(KVM_CAP_BINARY_STATS_FD));
- vm = vm_create_with_one_vcpu(&vcpu, guest_code);
- vm_install_exception_handler(vm, VINTR_VECTOR, guest_vintr_handler);
- virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA);
- vcpu_run(vcpu);
- TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
- halt_exits = vcpu_get_stat(vcpu, HALT_EXITS);
- vintr_exits = vcpu_get_stat(vcpu, IRQ_WINDOW_EXITS);
- switch (get_ucall(vcpu, &uc)) {
- case UCALL_ABORT:
REPORT_GUEST_ASSERT(uc);
/* NOT REACHED */
- case UCALL_DONE:
break;
- default:
TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
- }
- TEST_ASSERT_EQ(halt_exits, 0);
- pr_debug("Guest executed VINTR followed by halts: %d times.\n"
"The guest exited due to halt: %ld times and number\n"
"of vintr exits: %ld.\n",
NUM_ITERATIONS, halt_exits, vintr_exits);
- kvm_vm_free(vm);
- return 0;
+}
2.34.1
Hi Chao, Thank you for reviewing my patches.
On 5/28/2024 1:16 PM, Chao Gao wrote:
+static void guest_code(void) +{
- uint32_t icr_val;
- int i;
- xapic_enable();
- icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | VINTR_VECTOR);
- for (i = 0; i < NUM_ITERATIONS; i++) {
cli();
xapic_write_reg(APIC_ICR, icr_val);
safe_halt();
GUEST_ASSERT(READ_ONCE(irq_received));
WRITE_ONCE(irq_received, false);
any reason to use READ/WRITE_ONCE here?
This is done to ensure that irq is already received at this point, as irq_received is set to true in guest_vintr_handler.
- }
- GUEST_DONE();
+}
+static void guest_vintr_handler(struct ex_regs *regs) +{
- WRITE_ONCE(irq_received, true);
- xapic_write_reg(APIC_EOI, 0x00);
+}
+int main(int argc, char *argv[]) +{
- struct kvm_vm *vm;
- struct kvm_vcpu *vcpu;
- struct ucall uc;
- uint64_t halt_exits, vintr_exits;
- /* Check the extension for binary stats */
- TEST_REQUIRE(this_cpu_has(X86_FEATURE_IDLE_HLT));
IIUC, this test assumes that the IDLE_HLT feature is enabled for guests if it is supported by the CPU. But this isn't true in some cases:
I understand you are intending to create a capability for IDLE HLT intercept feature, but in my opinion, the IDLE Halt intercept feature doesn't require user space to do anything for the feature itself.
- an old KVM runs on new hardware
- the feature bit is somehow cleared, e.g., by "clearcpuid" cmdline
In both the case, the test case will fail.
In General, if the feature bit for the Idle halt feature is cleared somehow, or new KVM runs on old hardware, the idle halt exits will be replaced with halt exits.
If the old KVM runs on new hardware, the idle halt feature is never enabled via KVM. So, the presence of a pending V_INTR event during the execution of the halt instruction won't result into the idle-halt exit; rather, it will result in a halt exit followed by a vintr exit.
-Manali
- TEST_REQUIRE(kvm_has_cap(KVM_CAP_BINARY_STATS_FD));
- vm = vm_create_with_one_vcpu(&vcpu, guest_code);
- vm_install_exception_handler(vm, VINTR_VECTOR, guest_vintr_handler);
- virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA);
- vcpu_run(vcpu);
- TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
- halt_exits = vcpu_get_stat(vcpu, HALT_EXITS);
- vintr_exits = vcpu_get_stat(vcpu, IRQ_WINDOW_EXITS);
- switch (get_ucall(vcpu, &uc)) {
- case UCALL_ABORT:
REPORT_GUEST_ASSERT(uc);
/* NOT REACHED */
- case UCALL_DONE:
break;
- default:
TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
- }
- TEST_ASSERT_EQ(halt_exits, 0);
- pr_debug("Guest executed VINTR followed by halts: %d times.\n"
"The guest exited due to halt: %ld times and number\n"
"of vintr exits: %ld.\n",
NUM_ITERATIONS, halt_exits, vintr_exits);
- kvm_vm_free(vm);
- return 0;
+}
2.34.1
On Thu, May 30, 2024 at 06:49:56PM +0530, Manali Shukla wrote:
Hi Chao, Thank you for reviewing my patches.
On 5/28/2024 1:16 PM, Chao Gao wrote:
+static void guest_code(void) +{
- uint32_t icr_val;
- int i;
- xapic_enable();
- icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | VINTR_VECTOR);
- for (i = 0; i < NUM_ITERATIONS; i++) {
cli();
xapic_write_reg(APIC_ICR, icr_val);
safe_halt();
GUEST_ASSERT(READ_ONCE(irq_received));
WRITE_ONCE(irq_received, false);
any reason to use READ/WRITE_ONCE here?
This is done to ensure that irq is already received at this point, as irq_received is set to true in guest_vintr_handler.
OK. so, READ_ONCE() is to ensure that irq_received is always read directly from memory. Otherwise, the compiler might assume it remains false (in the 2nd and subsequent iterations) and apply some optimizations.
However, I don't understand why WRITE_ONCE() is necessary here. Is it to prevent the compiler from merging all writes to irq_received across iterations into a single write (e.g., simply drop writes in the 2nd and subsequent iterations)? I'm not sure.
I suggest adding one comment here because it isn't obvious to everyone.
- }
- GUEST_DONE();
+}
+static void guest_vintr_handler(struct ex_regs *regs) +{
- WRITE_ONCE(irq_received, true);
- xapic_write_reg(APIC_EOI, 0x00);
+}
+int main(int argc, char *argv[]) +{
- struct kvm_vm *vm;
- struct kvm_vcpu *vcpu;
- struct ucall uc;
- uint64_t halt_exits, vintr_exits;
- /* Check the extension for binary stats */
- TEST_REQUIRE(this_cpu_has(X86_FEATURE_IDLE_HLT));
IIUC, this test assumes that the IDLE_HLT feature is enabled for guests if it is supported by the CPU. But this isn't true in some cases:
I understand you are intending to create a capability for IDLE HLT intercept feature, but in my opinion, the IDLE Halt intercept feature doesn't require user space to do anything for the feature itself.
Yes, I agree. Actually, I was thinking about:
1. make the feature bit visible from /proc/cpuinfo by removing the leading "" from the comment following the bit definition in patch 1
2. parse /proc/cpuinfo to determine if this IDLE_HLT feature is supported by the kernel
But I am not sure if it's worth it. I'll defer to maintainers.
Hi Chao,
Thanks for reviewing the patches.
On 5/31/2024 12:19 PM, Chao Gao wrote:
On Thu, May 30, 2024 at 06:49:56PM +0530, Manali Shukla wrote:
Hi Chao, Thank you for reviewing my patches.
On 5/28/2024 1:16 PM, Chao Gao wrote:
+static void guest_code(void) +{
- uint32_t icr_val;
- int i;
- xapic_enable();
- icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | VINTR_VECTOR);
- for (i = 0; i < NUM_ITERATIONS; i++) {
cli();
xapic_write_reg(APIC_ICR, icr_val);
safe_halt();
GUEST_ASSERT(READ_ONCE(irq_received));
WRITE_ONCE(irq_received, false);
any reason to use READ/WRITE_ONCE here?
This is done to ensure that irq is already received at this point, as irq_received is set to true in guest_vintr_handler.
OK. so, READ_ONCE() is to ensure that irq_received is always read directly from memory. Otherwise, the compiler might assume it remains false (in the 2nd and subsequent iterations) and apply some optimizations.
However, I don't understand why WRITE_ONCE() is necessary here. Is it to prevent the compiler from merging all writes to irq_received across iterations into a single write (e.g., simply drop writes in the 2nd and subsequent iterations)? I'm not sure.
Compiler optimizing this out is one case. If WRITE_ONCE to irq_received is not called, the test will not be able to figure out that whether irq_received has a stale "true" from the previous iteration (maybe the vintr interrupt handler did not get invoked) or a fresh "true" from the current iteration.
I suggest adding one comment here because it isn't obvious to everyone.
Sure I will add the comment in V4.
- }
- GUEST_DONE();
+}
+static void guest_vintr_handler(struct ex_regs *regs) +{
- WRITE_ONCE(irq_received, true);
- xapic_write_reg(APIC_EOI, 0x00);
+}
+int main(int argc, char *argv[]) +{
- struct kvm_vm *vm;
- struct kvm_vcpu *vcpu;
- struct ucall uc;
- uint64_t halt_exits, vintr_exits;
- /* Check the extension for binary stats */
- TEST_REQUIRE(this_cpu_has(X86_FEATURE_IDLE_HLT));
IIUC, this test assumes that the IDLE_HLT feature is enabled for guests if it is supported by the CPU. But this isn't true in some cases:
I understand you are intending to create a capability for IDLE HLT intercept feature, but in my opinion, the IDLE Halt intercept feature doesn't require user space to do anything for the feature itself.
Yes, I agree. Actually, I was thinking about:
make the feature bit visible from /proc/cpuinfo by removing the leading "" from the comment following the bit definition in patch 1
parse /proc/cpuinfo to determine if this IDLE_HLT feature is supported by the kernel
But I am not sure if it's worth it. I'll defer to maintainers.
Ack.
-Manali
On Fri, May 31, 2024, Chao Gao wrote:
On Thu, May 30, 2024 at 06:49:56PM +0530, Manali Shukla wrote:
- /* Check the extension for binary stats */
- TEST_REQUIRE(this_cpu_has(X86_FEATURE_IDLE_HLT));
IIUC, this test assumes that the IDLE_HLT feature is enabled for guests if it is supported by the CPU. But this isn't true in some cases:
I understand you are intending to create a capability for IDLE HLT intercept feature, but in my opinion, the IDLE Halt intercept feature doesn't require user space to do anything for the feature itself.
Yes, I agree. Actually, I was thinking about:
make the feature bit visible from /proc/cpuinfo by removing the leading "" from the comment following the bit definition in patch 1
parse /proc/cpuinfo to determine if this IDLE_HLT feature is supported by the kernel
Neither of these is sufficient/correct. E.g. they'll get false positives if run on a kernel that recognizes IDLE_HLT, but that doesn't have KVM support for enabling the feature.
The canonical way to check for features in KVM selftests is kvm_cpu_has(), which looks at KVM_GET_SUPPORTED_CPUID (by default, selftests VMs enable all features, i.e. reflect the result of KVM_GET_SUPPORTED_CPUID into KVM_SET_CPUID2).
But I am not sure if it's worth it. I'll defer to maintainers.
On Tue, Aug 13, 2024, Sean Christopherson wrote:
On Fri, May 31, 2024, Chao Gao wrote:
On Thu, May 30, 2024 at 06:49:56PM +0530, Manali Shukla wrote:
- /* Check the extension for binary stats */
- TEST_REQUIRE(this_cpu_has(X86_FEATURE_IDLE_HLT));
IIUC, this test assumes that the IDLE_HLT feature is enabled for guests if it is supported by the CPU. But this isn't true in some cases:
I understand you are intending to create a capability for IDLE HLT intercept feature, but in my opinion, the IDLE Halt intercept feature doesn't require user space to do anything for the feature itself.
Yes, I agree. Actually, I was thinking about:
make the feature bit visible from /proc/cpuinfo by removing the leading "" from the comment following the bit definition in patch 1
parse /proc/cpuinfo to determine if this IDLE_HLT feature is supported by the kernel
Neither of these is sufficient/correct. E.g. they'll get false positives if run on a kernel that recognizes IDLE_HLT, but that doesn't have KVM support for enabling the feature.
The canonical way to check for features in KVM selftests is kvm_cpu_has(), which looks at KVM_GET_SUPPORTED_CPUID (by default, selftests VMs enable all features, i.e. reflect the result of KVM_GET_SUPPORTED_CPUID into KVM_SET_CPUID2).
Never mind, brain fart. That only works for features KVM is exposing to the guest, this is purely a KVM/host feature.
That said, doesn't this test also require that AVIC be enabled? Oh, or does this happen to work because KVM uses V_INTR to detect interrupt windows?
On Tue, May 28, 2024 at 6:19 AM Manali Shukla manali.shukla@amd.com wrote:
The upcoming new Idle HLT Intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by the hypervisor only if there are no pending V_INTR and V_NMI events for the vCPU. When the vCPU is expected to service the pending V_INTR and V_NMI events, the Idle HLT intercept won’t trigger. The feature allows the hypervisor to determine if the vCPU is actually idle and reduces wasteful VMEXITs.
Does this have an effect on the number of vmexits for KVM, unless AVIC is enabled? Can you write a testcase for kvm-unit-tests' vmexit.flat that shows an improvement?
The reason I am wondering is because KVM does not really use V_INTR injection. The "idle HLT" intercept basically differs from the basic HLT trigger only in how it handles an STI;HLT sequence, as in that case the interrupt can be injected directly and the HLT vmexit is suppressed. But in that circumstance KVM would anyway use a V_INTR intercept to detect the opening of the interrupt injection window (and then the interrupt uses event injection rather than V_INTR). Again, this is only true if AVIC is disabled, but that is the default.
So unless I'm wrong in my analysis above, I'm not sure this series, albeit small, is really worth it. As things stand, it would be more interesting to enable this for nested VMs, especially Hyper-V which does use V_INTR and V_TPL; even better, _emulating_ it on older processors would reduce the L2->L0->L1->L0->L2 path to a less-expensive L2->L0->L2 vmexit.
Paolo
Presence of the Idle HLT Intercept feature is indicated via CPUID function Fn8000_000A_EDX[30].
Document for the Idle HLT intercept feature is available at [1].
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024, Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT). https://bugzilla.kernel.org/attachment.cgi?id=306250
Testing Done:
- Added a selftest to test the Idle HLT intercept functionality.
- Compile and functionality testing for the Idle HLT intercept selftest are only done for x86_64.
- Tested SEV and SEV-ES guest for the Idle HLT intercept functionality.
v2 -> v3
- Incorporated Andrew's suggestion to structure vcpu_stat_types in a way that each architecture can share the generic types and also provide its own.
v1 -> v2
- Done changes in svm_idle_hlt_test based on the review comments from Sean.
- Added an enum based approach to get binary stats in vcpu_get_stat() which doesn't use string to get stat data based on the comments from Sean.
- Added self_halt() and cli() helpers based on the comments from Sean.
Manali Shukla (5): x86/cpufeatures: Add CPUID feature bit for Idle HLT intercept KVM: SVM: Add Idle HLT intercept support KVM: selftests: Add safe_halt() and cli() helpers to common code KVM: selftests: Add an interface to read the data of named vcpu stat KVM: selftests: KVM: SVM: Add Idle HLT intercept test
arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/svm.h | 1 + arch/x86/include/uapi/asm/svm.h | 2 + arch/x86/kvm/svm/svm.c | 11 ++- tools/testing/selftests/kvm/Makefile | 1 + .../testing/selftests/kvm/include/kvm_util.h | 44 +++++++++ .../kvm/include/x86_64/kvm_util_arch.h | 40 +++++++++ .../selftests/kvm/include/x86_64/processor.h | 18 ++++ tools/testing/selftests/kvm/lib/kvm_util.c | 32 +++++++ .../selftests/kvm/x86_64/svm_idle_hlt_test.c | 89 +++++++++++++++++++ 10 files changed, 236 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c
base-commit: d91a9cc16417b8247213a0144a1f0fd61dc855dd
2.34.1
On Tue, May 28, 2024, Paolo Bonzini wrote:
On Tue, May 28, 2024 at 6:19 AM Manali Shukla manali.shukla@amd.com wrote:
The upcoming new Idle HLT Intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by the hypervisor only if there are no pending V_INTR and V_NMI events for the vCPU. When the vCPU is expected to service the pending V_INTR and V_NMI events, the Idle HLT intercept won’t trigger. The feature allows the hypervisor to determine if the vCPU is actually idle and reduces wasteful VMEXITs.
Does this have an effect on the number of vmexits for KVM, unless AVIC is enabled? Can you write a testcase for kvm-unit-tests' vmexit.flat that shows an improvement?
The reason I am wondering is because KVM does not really use V_INTR injection. The "idle HLT" intercept basically differs from the basic HLT trigger only in how it handles an STI;HLT sequence, as in that case the interrupt can be injected directly and the HLT vmexit is suppressed. But in that circumstance KVM would anyway use a V_INTR intercept to detect the opening of the interrupt injection window (and then the interrupt uses event injection rather than V_INTR). Again, this is only true if AVIC is disabled, but that is the default.
So unless I'm wrong in my analysis above, I'm not sure this series, albeit small, is really worth it.
But aren't we hoping to enable x2AVIC by default sooner than later?
As things stand, it would be more interesting to enable this for nested VMs, especially Hyper-V which does use V_INTR and V_TPL; even better, _emulating_ it on older processors would reduce the L2->L0->L1->L0->L2 path to a less-expensive L2->L0->L2 vmexit.
On 6/4/2024 6:17 AM, Sean Christopherson wrote:
On Tue, May 28, 2024, Paolo Bonzini wrote:
On Tue, May 28, 2024 at 6:19 AM Manali Shukla manali.shukla@amd.com wrote:
The upcoming new Idle HLT Intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by the hypervisor only if there are no pending V_INTR and V_NMI events for the vCPU. When the vCPU is expected to service the pending V_INTR and V_NMI events, the Idle HLT intercept won’t trigger. The feature allows the hypervisor to determine if the vCPU is actually idle and reduces wasteful VMEXITs.
Does this have an effect on the number of vmexits for KVM, unless AVIC is enabled? Can you write a testcase for kvm-unit-tests' vmexit.flat that shows an improvement?
The reason I am wondering is because KVM does not really use V_INTR injection. The "idle HLT" intercept basically differs from the basic HLT trigger only in how it handles an STI;HLT sequence, as in that case the interrupt can be injected directly and the HLT vmexit is suppressed. But in that circumstance KVM would anyway use a V_INTR intercept to detect the opening of the interrupt injection window (and then the interrupt uses event injection rather than V_INTR). Again, this is only true if AVIC is disabled, but that is the default.
So unless I'm wrong in my analysis above, I'm not sure this series, albeit small, is really worth it.
But aren't we hoping to enable x2AVIC by default sooner than later?
The idle halt intercept feature not only suppresses HLT exit when a V_INTR event is pending during the execution of halt instruction, but it also suppresses HLT exit when a V_NMI event is pending during the execution of halt instruction. This capability will be advantageous in IBS virtualization and PMC virtualization functionalities, as both rely on VNMI for delivering virtualized interrupts from IBS and PMC hardware.
As things stand, it would be more interesting to enable this for nested VMs, especially Hyper-V which does use V_INTR and V_TPL; even better, _emulating_ it on older processors would reduce the L2->L0->L1->L0->L2 path to a less-expensive L2->L0->L2 vmexit.
Hi Paolo,
Thank you for reviewing my patches.
On 5/28/2024 3:52 PM, Paolo Bonzini wrote:
On Tue, May 28, 2024 at 6:19 AM Manali Shukla manali.shukla@amd.com wrote:
The upcoming new Idle HLT Intercept feature allows for the HLT instruction execution by a vCPU to be intercepted by the hypervisor only if there are no pending V_INTR and V_NMI events for the vCPU. When the vCPU is expected to service the pending V_INTR and V_NMI events, the Idle HLT intercept won’t trigger. The feature allows the hypervisor to determine if the vCPU is actually idle and reduces wasteful VMEXITs.
Does this have an effect on the number of vmexits for KVM, unless AVIC is enabled? Can you write a testcase for kvm-unit-tests' vmexit.flat that shows an improvement?
I have measured the total numbers of vmexits (using perf kvm stat report), while running the the test case I have written for the idle halt intercept in vmexit.flat.
Without idle halt ------------------------------------------------------------------------------------------------- | Event name | Samples | Sample% | Time (ns) | Time% | ------------------------------------------------------------------------------------------------- | msr | 524213 | 49.00% | 592573933 | 64.00 % | ------------------------------------------------------------------------------------------------- | hlt | 262080 | 24.00% | 154429476 | 16.00% | ------------------------------------------------------------------------------------------------- | vintr | 262080 | 24.00% | 163724208 | 16.00% | -------------------------------------------------------------------------------------------------
With idle halt
----------------------------------------------------------------------------------------------- | Event name | Samples | Sample% | Time (ns) | Time% | ----------------------------------------------------------------------------------------------- | msr | 524213 | 66.00% | 502011088 | 75.00 % | ----------------------------------------------------------------------------------------------- | vintr | 262080 | 33.00% | 147515295 | 22.00% | ----------------------------------------------------------------------------------------------- | io | 1879 | 0.00% | 8784729 | 1.00% | -----------------------------------------------------------------------------------------------
Below is data for the average of 10 runs of idle halt test case in vmexit.flat ---------------------------------------------------------------------------------- | idle halt (on/off) | full test run | ipi test run | eoi test run | ---------------------------------------------------------------------------------- | on | 5048 .4 | 1289.2 | 1140.6 | ---------------------------------------------------------------------------------- | off | 4806.1 | 1318.6 | 1165.8 | ----------------------------------------------------------------------------------
The "ipi test run" when the idle halt is enabled, takes less time (~2.28% )to finish as compared to the "ipi test run" when the idle halt is disabled.
The "eoi test run" when the idle halt is enabled, takes less time (~2.20% )to finish as compared to the "eoi test run" when the idle halt is disabled.
The "full test run" when the idle halt is enabled, takes more time (~5.4%) as compared to the "full test run" when the idle halt is disabled. (Seems a bit odd, I have not yet begun to investigate this behavior)
Snippet of the Test case: +static void idle_hlt_test(void) +{ + x = 0; + cli(); + apic_self_ipi(IPI_TEST_VECTOR); + safe_halt(); + if (x != 1) printf("%d", x); +} +
The reason I am wondering is because KVM does not really use V_INTR injection. The "idle HLT" intercept basically differs from the basic HLT trigger only in how it handles an STI;HLT sequence, as in that case the interrupt can be injected directly and the HLT vmexit is suppressed. But in that circumstance KVM would anyway use a V_INTR intercept to detect the opening of the interrupt injection window (and then the interrupt uses event injection rather than V_INTR). Again, this is only true if AVIC is disabled, but that is the default.
I have taken traces to analyze it further.
With idle halt enabled 220.696238: kvm_apic_ipi: dst b0 vec 176 (Fixed|physical|assert|edge|self) 220.696238: kvm_apic_accept_irq: apicid 0 vec 176 (Fixed|edge) 220.696238: kvm_apic: apic_write APIC_ICR = 0xb0000440b0 220.696238: kvm_msr: msr_write 830 = 0xb0000440b0 220.696238: kvm_entry: vcpu 0, rip 0x406a89 220.696239: kvm_exit: vcpu 0 reason vintr rip 0x4004ae info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 220.696239: kvm_inj_virq: IRQ 0xb0 220.696240: kvm_entry: vcpu 0, rip 0x4004ae 220.696240: kvm_exit: vcpu 0 reason msr rip 0x406a74 info1 0x0000000000000001 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 220.696240: kvm_apic: apic_write APIC_EOI = 0x0 220.696240: kvm_eoi: apicid 0 vector 176
Without idle halt enabled 6204.951631: kvm_apic_ipi: dst b0 vec 176 (Fixed|physical|assert|edge|self) 6204.951631: kvm_apic_accept_irq: apicid 0 vec 176 (Fixed|edge) 6204.951631: kvm_apic: apic_write APIC_ICR = 0xb0000440b0 6204.951631: kvm_msr: msr_write 830 = 0xb0000440b0 6204.951631: kvm_entry: vcpu 0, rip 0x406a89 6204.951632: kvm_exit: vcpu 0 reason hlt rip 0x4004ad info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 6204.951632: kvm_entry: vcpu 0, rip 0x4004ae 6204.951632: kvm_exit: vcpu 0 reason vintr rip 0x4004ae info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 6204.951632: kvm_inj_virq: IRQ 0xb0 6204.951632: kvm_entry: vcpu 0, rip 0x4004ae 6204.951633: kvm_exit: vcpu 0 reason msr rip 0x406a74 info1 0x0000000000000001 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000 6204.951633: kvm_apic: apic_write APIC_EOI = 0x0 6204.951633: kvm_eoi: apicid 0 vector 176
From the traces, it is seen that hlt exit is avoided while the idle halt intercept feature is enabled.
So unless I'm wrong in my analysis above, I'm not sure this series, albeit small, is really worth it. As things stand, it would be more interesting to enable this for nested VMs, especially Hyper-V which does use V_INTR and V_TPL; even better, _emulating_ it on older processors would reduce the L2->L0->L1->L0->L2 path to a less-expensive L2->L0->L2 vmexit.
I am yet to try to test the idle halt intercept feature on nested guest.
- Manali
Paolo
Presence of the Idle HLT Intercept feature is indicated via CPUID function Fn8000_000A_EDX[30].
Document for the Idle HLT intercept feature is available at [1].
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024, Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT). https://bugzilla.kernel.org/attachment.cgi?id=306250
Testing Done:
- Added a selftest to test the Idle HLT intercept functionality.
- Compile and functionality testing for the Idle HLT intercept selftest are only done for x86_64.
- Tested SEV and SEV-ES guest for the Idle HLT intercept functionality.
v2 -> v3
- Incorporated Andrew's suggestion to structure vcpu_stat_types in a way that each architecture can share the generic types and also provide its own.
v1 -> v2
- Done changes in svm_idle_hlt_test based on the review comments from Sean.
- Added an enum based approach to get binary stats in vcpu_get_stat() which doesn't use string to get stat data based on the comments from Sean.
- Added self_halt() and cli() helpers based on the comments from Sean.
Manali Shukla (5): x86/cpufeatures: Add CPUID feature bit for Idle HLT intercept KVM: SVM: Add Idle HLT intercept support KVM: selftests: Add safe_halt() and cli() helpers to common code KVM: selftests: Add an interface to read the data of named vcpu stat KVM: selftests: KVM: SVM: Add Idle HLT intercept test
arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/svm.h | 1 + arch/x86/include/uapi/asm/svm.h | 2 + arch/x86/kvm/svm/svm.c | 11 ++- tools/testing/selftests/kvm/Makefile | 1 + .../testing/selftests/kvm/include/kvm_util.h | 44 +++++++++ .../kvm/include/x86_64/kvm_util_arch.h | 40 +++++++++ .../selftests/kvm/include/x86_64/processor.h | 18 ++++ tools/testing/selftests/kvm/lib/kvm_util.c | 32 +++++++ .../selftests/kvm/x86_64/svm_idle_hlt_test.c | 89 +++++++++++++++++++ 10 files changed, 236 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_idle_hlt_test.c
base-commit: d91a9cc16417b8247213a0144a1f0fd61dc855dd
2.34.1
On Tue, Jun 04, 2024, Manali Shukla wrote:
On 5/28/2024 3:52 PM, Paolo Bonzini wrote:
Does this have an effect on the number of vmexits for KVM, unless AVIC is enabled?
Ah, I suspect it will (as Manali's trace shows), because KVM will pend a V_INTR (V_IRQ in KVM's world) in order to detect the interrupt window. And while KVM will still exit on the V_INTR, it'll avoid an exit on HLT.
Of course, we could (should?) address that in KVM by clearing the V_INTR (and its intercept) when there are no pending, injectable IRQs at the end of kvm_check_and_inject_events(). VMX would benefit from that change as well.
I think it's just this? Because enabling an IRQ window for userspace happens after this.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index af6c8cf6a37a..373c850cc325 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10556,9 +10556,11 @@ static int kvm_check_and_inject_events(struct kvm_vcpu *vcpu, WARN_ON(kvm_x86_call(interrupt_allowed)(vcpu, true) < 0); } } - if (kvm_cpu_has_injectable_intr(vcpu)) - kvm_x86_call(enable_irq_window)(vcpu); } + if (kvm_cpu_has_injectable_intr(vcpu)) + kvm_x86_call(enable_irq_window)(vcpu); + else + kvm_x86_call(disable_irq_window)(vcpu);
if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->has_events &&
Snippet of the Test case: +static void idle_hlt_test(void) +{
x = 0;
cli();
apic_self_ipi(IPI_TEST_VECTOR);
safe_halt();
if (x != 1) printf("%d", x);
+}
This isn't very representative of real world behavior. In practice, the window for a wake event to arrive between CLI and STI;HLT is quite small, i.e. having a V_INTR (or V_NMI) pending when HLT is executed is fairly uncommon.
A more compelling benchmark would be something like a netperf latency test.
I honestly don't know how high of a bar we should set for this feature. On one hand, it's a tiny amount of enabling. On the other hand, it would be extremely unfortunate if this somehow caused latency/throughput regressions, which seems highly improbably, but never say never...
Hi Sean,
Thank you for reviewing my patches. Sorry for the delay in response.
On 8/13/2024 9:49 PM, Sean Christopherson wrote:
On Tue, Jun 04, 2024, Manali Shukla wrote:
On 5/28/2024 3:52 PM, Paolo Bonzini wrote:
Does this have an effect on the number of vmexits for KVM, unless AVIC is enabled?
Ah, I suspect it will (as Manali's trace shows), because KVM will pend a V_INTR (V_IRQ in KVM's world) in order to detect the interrupt window. And while KVM will still exit on the V_INTR, it'll avoid an exit on HLT.
Of course, we could (should?) address that in KVM by clearing the V_INTR (and its intercept) when there are no pending, injectable IRQs at the end of kvm_check_and_inject_events(). VMX would benefit from that change as well.
I think it's just this? Because enabling an IRQ window for userspace happens after this.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index af6c8cf6a37a..373c850cc325 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10556,9 +10556,11 @@ static int kvm_check_and_inject_events(struct kvm_vcpu *vcpu, WARN_ON(kvm_x86_call(interrupt_allowed)(vcpu, true) < 0); } }
if (kvm_cpu_has_injectable_intr(vcpu))
kvm_x86_call(enable_irq_window)(vcpu); }
if (kvm_cpu_has_injectable_intr(vcpu))
kvm_x86_call(enable_irq_window)(vcpu);
else
kvm_x86_call(disable_irq_window)(vcpu);
if (is_guest_mode(vcpu) && kvm_x86_ops.nested_ops->has_events &&
IIUC, this is already addressed in [2].
Snippet of the Test case: +static void idle_hlt_test(void) +{
x = 0;
cli();
apic_self_ipi(IPI_TEST_VECTOR);
safe_halt();
if (x != 1) printf("%d", x);
+}
This isn't very representative of real world behavior. In practice, the window for a wake event to arrive between CLI and STI;HLT is quite small, i.e. having a V_INTR (or V_NMI) pending when HLT is executed is fairly uncommon.
A more compelling benchmark would be something like a netperf latency test.
I honestly don't know how high of a bar we should set for this feature. On one hand, it's a tiny amount of enabling. On the other hand, it would be extremely unfortunate if this somehow caused latency/throughput regressions, which seems highly improbably, but never say never...
I have added netperf data for normal guest and nested guest in V4 [1].
[1]: https://lore.kernel.org/kvm/20241022054810.23369-1-manali.shukla@amd.com/T/#...
[2]: https://lore.kernel.org/all/20240802195120.325560-1-seanjc@google.com/
- Manali
linux-kselftest-mirror@lists.linaro.org