Misbehaving guests can cause bus locks to degrade the performance of a system. Non-WB (write-back) and misaligned locked RMW (read-modify-write) instructions are referred to as "bus locks" and require system wide synchronization among all processors to guarantee the atomicity. The bus locks can impose notable performance penalties for all processors within the system.
Support for the Bus Lock Threshold is indicated by CPUID Fn8000_000A_EDX[29] BusLockThreshold=1, the VMCB provides a Bus Lock Threshold enable bit and an unsigned 16-bit Bus Lock Threshold count.
VMCB intercept bit VMCB Offset Bits Function 14h 5 Intercept bus lock operations
Bus lock threshold count VMCB Offset Bits Function 120h 15:0 Bus lock counter
During VMRUN, the bus lock threshold count is fetched and stored in an internal count register. Prior to executing a bus lock within the guest, the processor verifies the count in the bus lock register. If the count is greater than zero, the processor executes the bus lock, reducing the count. However, if the count is zero, the bus lock operation is not performed, and instead, a Bus Lock Threshold #VMEXIT is triggered to transfer control to the Virtual Machine Monitor (VMM).
A Bus Lock Threshold #VMEXIT is reported to the VMM with VMEXIT code 0xA5h, VMEXIT_BUSLOCK. EXITINFO1 and EXITINFO2 are set to 0 on a VMEXIT_BUSLOCK. On a #VMEXIT, the processor writes the current value of the Bus Lock Threshold Counter to the VMCB.
More details about the Bus Lock Threshold feature can be found in AMD APM [1].
v2 -> v3 - Drop parch to add virt tag in /proc/cpuinfo. - Incorporated Tom's review comments.
v1 -> v2 - Incorporated misc review comments from Sean. - Removed bus_lock_counter module parameter. - Set the value of bus_lock_counter to zero by default and reload the value by 1 in bus lock exit handler. - Add documentation for the behavioral difference for KVM_EXIT_BUS_LOCK. - Improved selftest for buslock to work on SVM and VMX. - Rewrite the commit messages.
Patches are prepared on kvm-next/next (efbc6bd090f4).
Testing done: - Added a selftest for the Bus Lock Threshold functionality. - The bus lock threshold selftest has been tested on both Intel and AMD platforms. - Tested the Bus Lock Threshold functionality on SEV, SEV-ES, SEV-SNP guests. - Tested the Bus Lock Threshold functionality on nested guests.
v1: https://lore.kernel.org/kvm/20240709175145.9986-4-manali.shukla@amd.com/T/ v2: https://lore.kernel.org/kvm/20241001063413.687787-4-manali.shukla@amd.com/T/
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024, Vol 2, 15.14.5 Bus Lock Threshold. https://bugzilla.kernel.org/attachment.cgi?id=306250
Manali Shukla (2): x86/cpufeatures: Add CPUID feature bit for the Bus Lock Threshold KVM: X86: Add documentation about behavioral difference for KVM_EXIT_BUS_LOCK
Nikunj A Dadhania (2): KVM: SVM: Enable Bus lock threshold exit KVM: selftests: Add bus lock exit test
Documentation/virt/kvm/api.rst | 4 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/svm.h | 5 +- arch/x86/include/uapi/asm/svm.h | 2 + arch/x86/kvm/svm/nested.c | 10 ++ arch/x86/kvm/svm/svm.c | 27 ++++ tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_buslock_test.c | 130 ++++++++++++++++++ 8 files changed, 179 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c
base-commit: efbc6bd090f48ccf64f7a8dd5daea775821d57ec
Misbehaving guests can cause bus locks to degrade the performance of the system. The Bus Lock Threshold feature can be used to address this issue by providing capability to the hypervisor to limit guest's ability to generate bus lock, thereby preventing system slowdown due to performance penalities.
When the Bus Lock Threshold feature is enabled, the processor checks the bus lock threshold count before executing the buslock and decides whether to trigger bus lock exit or not.
The value of the bus lock threshold count '0' generates bus lock exits, and if the value is greater than '0', the bus lock is executed successfully and the bus lock threshold count is decremented.
Presence of the Bus Lock threshold feature is indicated via CPUID function 0x8000000A_EDX[29].
Signed-off-by: Manali Shukla manali.shukla@amd.com --- arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index dd4682857c12..77fa8e743ccc 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -382,6 +382,7 @@ #define X86_FEATURE_V_SPEC_CTRL (15*32+20) /* "v_spec_ctrl" Virtual SPEC_CTRL */ #define X86_FEATURE_VNMI (15*32+25) /* "vnmi" Virtual NMI */ #define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* SVME addr check */ +#define X86_FEATURE_BUS_LOCK_THRESHOLD (15*32+29) /* "buslock" Bus lock threshold */
/* Intel-defined CPU features, CPUID level 0x00000007:0 (ECX), word 16 */ #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* "avx512vbmi" AVX512 Vector Bit Manipulation instructions*/
base-commit: efbc6bd090f48ccf64f7a8dd5daea775821d57ec
On October 4, 2024 7:33:38 AM GMT+02:00, Manali Shukla manali.shukla@amd.com wrote:
Misbehaving guests can cause bus locks to degrade the performance of the system. The Bus Lock Threshold feature can be used to address this issue by providing capability to the hypervisor to limit guest's ability to generate bus lock, thereby preventing system slowdown due to performance penalities.
When the Bus Lock Threshold feature is enabled, the processor checks the bus lock threshold count before executing the buslock and decides whether to trigger bus lock exit or not.
The value of the bus lock threshold count '0' generates bus lock exits, and if the value is greater than '0', the bus lock is executed successfully and the bus lock threshold count is decremented.
Presence of the Bus Lock threshold feature is indicated via CPUID function 0x8000000A_EDX[29].
Signed-off-by: Manali Shukla manali.shukla@amd.com
arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index dd4682857c12..77fa8e743ccc 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -382,6 +382,7 @@ #define X86_FEATURE_V_SPEC_CTRL (15*32+20) /* "v_spec_ctrl" Virtual SPEC_CTRL */ #define X86_FEATURE_VNMI (15*32+25) /* "vnmi" Virtual NMI */ #define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* SVME addr check */ +#define X86_FEATURE_BUS_LOCK_THRESHOLD (15*32+29) /* "buslock" Bus lock threshold */
Why does this feature flag need to be visible in /proc/cpuinfo?
On 10/6/2024 11:56 AM, Borislav Petkov wrote:
On October 4, 2024 7:33:38 AM GMT+02:00, Manali Shukla manali.shukla@amd.com wrote:
Misbehaving guests can cause bus locks to degrade the performance of the system. The Bus Lock Threshold feature can be used to address this issue by providing capability to the hypervisor to limit guest's ability to generate bus lock, thereby preventing system slowdown due to performance penalities.
When the Bus Lock Threshold feature is enabled, the processor checks the bus lock threshold count before executing the buslock and decides whether to trigger bus lock exit or not.
The value of the bus lock threshold count '0' generates bus lock exits, and if the value is greater than '0', the bus lock is executed successfully and the bus lock threshold count is decremented.
Presence of the Bus Lock threshold feature is indicated via CPUID function 0x8000000A_EDX[29].
Signed-off-by: Manali Shukla manali.shukla@amd.com
arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index dd4682857c12..77fa8e743ccc 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -382,6 +382,7 @@ #define X86_FEATURE_V_SPEC_CTRL (15*32+20) /* "v_spec_ctrl" Virtual SPEC_CTRL */ #define X86_FEATURE_VNMI (15*32+25) /* "vnmi" Virtual NMI */ #define X86_FEATURE_SVME_ADDR_CHK (15*32+28) /* SVME addr check */ +#define X86_FEATURE_BUS_LOCK_THRESHOLD (15*32+29) /* "buslock" Bus lock threshold */
Why does this feature flag need to be visible in /proc/cpuinfo?
This was already discussed on [v1] of the same series but never concluded. As suggested in the discussion [v1], I have added "buslock" to be enumerated in "/proc/cpuinfo". I really don't have any strong opinions about adding or removing it from "/proc/cpuinfo". So, I would let maintainers decide how to go about it.
v1: https://lore.kernel.org/kvm/76355a11-a0ba-4a28-bf51-454facfd59e5@amd.com/T/#...
-Manali
From: Nikunj A Dadhania nikunj@amd.com
Virtual machines can exploit bus locks to degrade the performance of the system. Bus locks can be caused by Non-WB(Write back) and misaligned locked RMW (Read-modify-Write) instructions and require systemwide synchronization among all processors which can result into significant performance penalties.
To address this issue, the Bus Lock Threshold feature is introduced to provide ability to hypervisor to restrict guests' capability of initiating mulitple buslocks, thereby preventing system slowdowns.
Support for the buslock threshold is indicated via CPUID function 0x8000000A_EDX[29].
On the processors that support the Bus Lock Threshold feature, the VMCB provides a Bus Lock Threshold enable bit and an unsigned 16-bit Bus Lock threshold count.
VMCB intercept bit VMCB Offset Bits Function 14h 5 Intercept bus lock operations
Bus lock threshold count VMCB Offset Bits Function 120h 15:0 Bus lock counter
When a VMRUN instruction is executed, the bus lock threshold count is loaded into an internal count register. Before the processor executes a bus lock in the guest, it checks the value of this register:
- If the value is greater than '0', the processor successfully executes the bus lock and decrements the count.
- If the value is '0', the bus lock is not executed, and a #VMEXIT to the VMM is taken.
The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT code A5h, SVM_EXIT_BUS_LOCK.
This implementation is set up to intercept every guest bus lock. The initial value of the Bus Lock Threshold counter is '0' in the VMCB, so the very first bus lock will exit, set the Bus Lock Threshold counter to '1' (so that the offending instruction can make forward progress) but then the value is at '0' again so the next bus lock will exit.
The generated SVM_EXIT_BUS_LOCK in kvm will exit to user space by setting the KVM_RUN_BUS_LOCK flag in vcpu->run->flags, indicating to the user space that a bus lock has been detected in the guest.
Use the already available KVM capability KVM_CAP_X86_BUS_LOCK_EXIT to enable the feature. This feature is disabled by default, it can be enabled explicitly from user space.
More details about the Bus Lock Threshold feature can be found in AMD APM [1].
[1]: AMD64 Architecture Programmer's Manual Pub. 24593, April 2024, Vol 2, 15.14.5 Bus Lock Threshold. https://bugzilla.kernel.org/attachment.cgi?id=306250
Signed-off-by: Nikunj A Dadhania nikunj@amd.com Co-developed-by: Manali Shukla manali.shukla@amd.com Signed-off-by: Manali Shukla manali.shukla@amd.com --- arch/x86/include/asm/svm.h | 5 ++++- arch/x86/include/uapi/asm/svm.h | 2 ++ arch/x86/kvm/svm/nested.c | 10 ++++++++++ arch/x86/kvm/svm/svm.c | 27 +++++++++++++++++++++++++++ 4 files changed, 43 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 2b59b9951c90..d1819c564b1c 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -116,6 +116,7 @@ enum { INTERCEPT_INVPCID, INTERCEPT_MCOMMIT, INTERCEPT_TLBSYNC, + INTERCEPT_BUSLOCK, };
@@ -158,7 +159,9 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u64 avic_physical_id; /* Offset 0xf8 */ u8 reserved_7[8]; u64 vmsa_pa; /* Used for an SEV-ES guest */ - u8 reserved_8[720]; + u8 reserved_8[16]; + u16 bus_lock_counter; /* Offset 0x120 */ + u8 reserved_9[702]; /* * Offset 0x3e0, 32 bytes reserved * for use by hypervisor/software. diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h index 1814b413fd57..abf6aed88cee 100644 --- a/arch/x86/include/uapi/asm/svm.h +++ b/arch/x86/include/uapi/asm/svm.h @@ -95,6 +95,7 @@ #define SVM_EXIT_CR14_WRITE_TRAP 0x09e #define SVM_EXIT_CR15_WRITE_TRAP 0x09f #define SVM_EXIT_INVPCID 0x0a2 +#define SVM_EXIT_BUS_LOCK 0x0a5 #define SVM_EXIT_NPF 0x400 #define SVM_EXIT_AVIC_INCOMPLETE_IPI 0x401 #define SVM_EXIT_AVIC_UNACCELERATED_ACCESS 0x402 @@ -224,6 +225,7 @@ { SVM_EXIT_CR4_WRITE_TRAP, "write_cr4_trap" }, \ { SVM_EXIT_CR8_WRITE_TRAP, "write_cr8_trap" }, \ { SVM_EXIT_INVPCID, "invpcid" }, \ + { SVM_EXIT_BUS_LOCK, "buslock" }, \ { SVM_EXIT_NPF, "npf" }, \ { SVM_EXIT_AVIC_INCOMPLETE_IPI, "avic_incomplete_ipi" }, \ { SVM_EXIT_AVIC_UNACCELERATED_ACCESS, "avic_unaccelerated_access" }, \ diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d5314cb7dff4..ca1c42201894 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -669,6 +669,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
+ /* + * The bus lock threshold count should keep running across nested + * transitions. Copy the buslock threshold count from vmcb01 to vmcb02. + */ + vmcb02->control.bus_lock_counter = vmcb01->control.bus_lock_counter; /* Done at vmrun: asid. */
/* Also overwritten later if necessary. */ @@ -1035,6 +1040,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
}
+ /* + * The bus lock threshold count should keep running across nested + * transitions. Copy the buslock threshold count from vmcb02 to vmcb01. + */ + vmcb01->control.bus_lock_counter = vmcb02->control.bus_lock_counter; nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr);
svm_switch_vmcb(svm, &svm->vmcb01); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9df3e1e5ae81..0d0407f1ee6a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1372,6 +1372,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu) svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; }
+ if (vcpu->kvm->arch.bus_lock_detection_enabled) + svm_set_intercept(svm, INTERCEPT_BUSLOCK); + if (sev_guest(vcpu->kvm)) sev_init_vmcb(svm);
@@ -3286,6 +3289,24 @@ static int invpcid_interception(struct kvm_vcpu *vcpu) return kvm_handle_invpcid(vcpu, type, gva); }
+static int bus_lock_exit(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK; + vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK; + + /* + * Reload the counter with value greater than '0'. + * The bus lock exit on SVM happens with RIP pointing to the guilty + * instruction. So, reloading the value of bus_lock_counter to '0' + * results into generating continuous bus lock exits. + */ + svm->vmcb->control.bus_lock_counter = 1; + + return 0; +} + static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_CR4_WRITE_TRAP] = cr_trap, [SVM_EXIT_CR8_WRITE_TRAP] = cr_trap, [SVM_EXIT_INVPCID] = invpcid_interception, + [SVM_EXIT_BUS_LOCK] = bus_lock_exit, [SVM_EXIT_NPF] = npf_interception, [SVM_EXIT_RSM] = rsm_interception, [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception, @@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); }
+ if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) { + pr_info("Bus Lock Threashold supported\n"); + kvm_caps.has_bus_lock_exit = true; + } + /* CPUID 0x80000008 */ if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD))
On Fri, Oct 04, 2024, Manali Shukla wrote:
When a VMRUN instruction is executed, the bus lock threshold count is loaded into an internal count register. Before the processor executes a bus lock in the guest, it checks the value of this register:
If the value is greater than '0', the processor successfully executes the bus lock and decrements the count.
If the value is '0', the bus lock is not executed, and a #VMEXIT to the VMM is taken.
The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT code A5h, SVM_EXIT_BUS_LOCK.
I vote to split this into two patches: one to add the architectural collateral, with the above as the changelog, and a second to actually implement support in KVM. Having the above background is useful, but it makes it quite hard to find information on the KVM design and implementation.
This implementation is set up to intercept every guest bus lock.
"This implementation" is a variation on "This patch". Drop it, and simply state what the patch is doing.
initial value of the Bus Lock Threshold counter is '0' in the VMCB, so the very first bus lock will exit, set the Bus Lock Threshold counter to '1' (so that the offending instruction can make forward progress) but then the value is at '0' again so the next bus lock will exit.
The generated SVM_EXIT_BUS_LOCK in kvm will exit to user space by
s/kvm/KVM
And again, the tone is wrong.
Something is what I would aim for:
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.
The biggest difference between the two features is that Threshold is fault-like, whereas Detection is trap-like. To allow the guest to make forward progress, Threshold provides a per-VMCB counter which is decremented every time a bus lock occurs, and a VM-Exit is triggered if and only if the counter is '0'.
To provide Detection-like semantics, initialize the counter to '0', i.e. exit on every bus lock, and when re-executing the guilty instruction, set the counter to '1' to effectively step past the instruction.
Note, in the unlikely scenario that re-executing the instruction doesn't trigger a bus lock, e.g. because the guest has changed memory types or patched the guilty instruction, the bus lock counter will be left at '1', i.e. the guest will be able to do a bus lock on a different instruction. In a perfect world, KVM would ensure the counter is '0' if the guest has made forward progress, e.g. if RIP has changed. But trying to close that hole would incur non-trivial complexity, for marginal benefit; the intent of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks, not to allow for precise detection of problematic guest code. And, it's simply not feasible to fully close the hole, e.g. if an interrupt arrives before the original instruction can re-execute, the guest could step past a different bus lock.
setting the KVM_RUN_BUS_LOCK flag in vcpu->run->flags, indicating to the user space that a bus lock has been detected in the guest.
Use the already available KVM capability KVM_CAP_X86_BUS_LOCK_EXIT to enable the feature. This feature is disabled by default, it can be enabled explicitly from user space.
More details about the Bus Lock Threshold feature can be found in AMD APM [1].
...
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d5314cb7dff4..ca1c42201894 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -669,6 +669,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
- /*
* The bus lock threshold count should keep running across nested
* transitions. Copy the buslock threshold count from vmcb01 to vmcb02.
No, it should be reset to '0'. The bus lock can't have been on VMRUN, because KVM is emulating the VMRUN. That leaves two possibilities: the bus lock happened in L1 on an instruction before VMRUN, or the bus lock happened in _an_ L2, before a nested VM-Exit to L1 occurred.
In the first case, the bus lock firmly happened on an instruction in the past. Even if vmcb01's counter is still '1', e.g. because the guilty instruction got patched, the vCPU has clearly made forward progress and so KVM should reset vmcb02's counter to '0'.
In the second case, KVM has no idea if L2 has made forward progress. The only way to _try to_ detect if L2 has made forward progress would to be to track the counter on a per-vmcb12 basis, but even that is flawed because KVM doesn't have visibility into L1's management of L2.
I do think we may need to stash vmcb02's counter though. E.g. if userspace rate- limits the vCPU, then it's entirely possible that L1's tick interrupt is pending by the time userspace re-runs the vCPU. If KVM unconditionally clears the counter on VMRUN, then when L1 re-enters L2, the same instruction will trigger a VM-Exit and the entire cycle starts over.
Anything we do is going to be imperfect, but the best idea I can come up with is also relatively simple, especially in conjunction with my suggestion below. If KVM tracks the RIP that triggered the bus lock, then on nested VM-Exit KVM can propagate that RIP into svm_nested_state as appropriate. E.g.
if (vmcb02->control.bus_lock_counter && svm->bus_lock_rip == vmcb02->save.rip) svm->nested.bus_lock_rip = svm->bus_lock_rip; else svm->nested.bus_lock_rip = INVALID_GVA; /* or '0', as much as that bugs me */
and then on nested VMRUN
if (svm->nested.bus_lock_rip == vmcb02->save.rip) { vmcb02->control.bus_lock_counter = 1; svm->bus_lock_rip = svm->nested.bus_lock_rip; } else { vmcb02->control.bus_lock_counter = 0; }
svm->nested.bus_lock_rip = INVALID_GVA;
Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the same RIP, then KVM will allow a bus lock for the wrong vCPU. But the odds of that happening are absurdly low unless L1 is deliberately doing weird things, and so I don't think
*/
- vmcb02->control.bus_lock_counter = vmcb01->control.bus_lock_counter; /* Done at vmrun: asid. */
/* Also overwritten later if necessary. */ @@ -1035,6 +1040,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm) }
- /*
* The bus lock threshold count should keep running across nested
* transitions. Copy the buslock threshold count from vmcb02 to vmcb01.
*/
- vmcb01->control.bus_lock_counter = vmcb02->control.bus_lock_counter;
KVM should definitely reset the counter to '0' on a nested VM-Exit, because L1 can't be interrupted by L2, i.e. there is zero chance that KVM needs to allow a bus lock in L1 to ensure L1 makes forward progress.
nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); svm_switch_vmcb(svm, &svm->vmcb01); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9df3e1e5ae81..0d0407f1ee6a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1372,6 +1372,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu) svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; }
- if (vcpu->kvm->arch.bus_lock_detection_enabled)
svm_set_intercept(svm, INTERCEPT_BUSLOCK);
- if (sev_guest(vcpu->kvm)) sev_init_vmcb(svm);
@@ -3286,6 +3289,24 @@ static int invpcid_interception(struct kvm_vcpu *vcpu) return kvm_handle_invpcid(vcpu, type, gva); } +static int bus_lock_exit(struct kvm_vcpu *vcpu) +{
- struct vcpu_svm *svm = to_svm(vcpu);
- vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
- vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
- /*
* Reload the counter with value greater than '0'.
The value quite obviously must be exactly '1', not simply greater than '0. I also think this is the wrong place to set the counter. Rather than set the counter at the time of exit, KVM should implement a vcpu->arch.complete_userspace_io callback and set the counter to '1' if and only if RIP (or LIP, but I have no objection to keeping things simple) is unchanged. It's a bit of extra complexity, but it will make it super obvious why KVM is setting the counter to '1'. And, if userspace wants to stuff state and move past the instruction, e.g. by emulating the guilty instruction, then KVM won't unnecessarily allow a bus lock in the guest.
And then the comment can be:
/* * If userspace has NOT change RIP, then KVM's ABI is to let the guest * execute the bus-locking instruction. Set the bus lock counter to '1' * to effectively step past the bus lock. */
* The bus lock exit on SVM happens with RIP pointing to the guilty
* instruction. So, reloading the value of bus_lock_counter to '0'
* results into generating continuous bus lock exits.
*/
- svm->vmcb->control.bus_lock_counter = 1;
- return 0;
+}
static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_CR4_WRITE_TRAP] = cr_trap, [SVM_EXIT_CR8_WRITE_TRAP] = cr_trap, [SVM_EXIT_INVPCID] = invpcid_interception,
- [SVM_EXIT_BUS_LOCK] = bus_lock_exit, [SVM_EXIT_NPF] = npf_interception, [SVM_EXIT_RSM] = rsm_interception, [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception,
@@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); }
- if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) {
pr_info("Bus Lock Threashold supported\n");
kvm_caps.has_bus_lock_exit = true;
- }
- /* CPUID 0x80000008 */ if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD))
-- 2.34.1
Hi Sean,
Thanks for reviewing my patches.
On 10/15/2024 11:19 PM, Sean Christopherson wrote:
On Fri, Oct 04, 2024, Manali Shukla wrote:
When a VMRUN instruction is executed, the bus lock threshold count is loaded into an internal count register. Before the processor executes a bus lock in the guest, it checks the value of this register:
If the value is greater than '0', the processor successfully executes the bus lock and decrements the count.
If the value is '0', the bus lock is not executed, and a #VMEXIT to the VMM is taken.
The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT code A5h, SVM_EXIT_BUS_LOCK.
I vote to split this into two patches: one to add the architectural collateral, with the above as the changelog, and a second to actually implement support in KVM. Having the above background is useful, but it makes it quite hard to find information on the KVM design and implementation.
Sure. I will split it into 2 patches.
This implementation is set up to intercept every guest bus lock.
"This implementation" is a variation on "This patch". Drop it, and simply state what the patch is doing.
initial value of the Bus Lock Threshold counter is '0' in the VMCB, so the very first bus lock will exit, set the Bus Lock Threshold counter to '1' (so that the offending instruction can make forward progress) but then the value is at '0' again so the next bus lock will exit.
The generated SVM_EXIT_BUS_LOCK in kvm will exit to user space by
s/kvm/KVM
And again, the tone is wrong.
Something is what I would aim for:
Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.
The biggest difference between the two features is that Threshold is fault-like, whereas Detection is trap-like. To allow the guest to make forward progress, Threshold provides a per-VMCB counter which is decremented every time a bus lock occurs, and a VM-Exit is triggered if and only if the counter is '0'.
To provide Detection-like semantics, initialize the counter to '0', i.e. exit on every bus lock, and when re-executing the guilty instruction, set the counter to '1' to effectively step past the instruction.
Note, in the unlikely scenario that re-executing the instruction doesn't trigger a bus lock, e.g. because the guest has changed memory types or patched the guilty instruction, the bus lock counter will be left at '1', i.e. the guest will be able to do a bus lock on a different instruction. In a perfect world, KVM would ensure the counter is '0' if the guest has made forward progress, e.g. if RIP has changed. But trying to close that hole would incur non-trivial complexity, for marginal benefit; the intent of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks, not to allow for precise detection of problematic guest code. And, it's simply not feasible to fully close the hole, e.g. if an interrupt arrives before the original instruction can re-execute, the guest could step past a different bus lock.
setting the KVM_RUN_BUS_LOCK flag in vcpu->run->flags, indicating to the user space that a bus lock has been detected in the guest.
Use the already available KVM capability KVM_CAP_X86_BUS_LOCK_EXIT to enable the feature. This feature is disabled by default, it can be enabled explicitly from user space.
More details about the Bus Lock Threshold feature can be found in AMD APM [1].
...
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index d5314cb7dff4..ca1c42201894 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -669,6 +669,11 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
- /*
* The bus lock threshold count should keep running across nested
* transitions. Copy the buslock threshold count from vmcb01 to vmcb02.
No, it should be reset to '0'. The bus lock can't have been on VMRUN, because KVM is emulating the VMRUN. That leaves two possibilities: the bus lock happened in L1 on an instruction before VMRUN, or the bus lock happened in _an_ L2, before a nested VM-Exit to L1 occurred.
In the first case, the bus lock firmly happened on an instruction in the past. Even if vmcb01's counter is still '1', e.g. because the guilty instruction got patched, the vCPU has clearly made forward progress and so KVM should reset vmcb02's counter to '0'.
In the second case, KVM has no idea if L2 has made forward progress. The only way to _try to_ detect if L2 has made forward progress would to be to track the counter on a per-vmcb12 basis, but even that is flawed because KVM doesn't have visibility into L1's management of L2. I do think we may need to stash vmcb02's counter though. E.g. if userspace rate- limits the vCPU, then it's entirely possible that L1's tick interrupt is pending by the time userspace re-runs the vCPU. If KVM unconditionally clears the counter on VMRUN, then when L1 re-enters L2, the same instruction will trigger a VM-Exit and the entire cycle starts over.
Anything we do is going to be imperfect, but the best idea I can come up with is also relatively simple, especially in conjunction with my suggestion below. If KVM tracks the RIP that triggered the bus lock, then on nested VM-Exit KVM can propagate that RIP into svm_nested_state as appropriate. E.g.
if (vmcb02->control.bus_lock_counter && svm->bus_lock_rip == vmcb02->save.rip) svm->nested.bus_lock_rip = svm->bus_lock_rip; else svm->nested.bus_lock_rip = INVALID_GVA; /* or '0', as much as that bugs me */
and then on nested VMRUN
if (svm->nested.bus_lock_rip == vmcb02->save.rip) { vmcb02->control.bus_lock_counter = 1; svm->bus_lock_rip = svm->nested.bus_lock_rip; } else { vmcb02->control.bus_lock_counter = 0; }
svm->nested.bus_lock_rip = INVALID_GVA;
Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the same RIP, then KVM will allow a bus lock for the wrong vCPU. But the odds of that happening are absurdly low unless L1 is deliberately doing weird things, and so I don't think
*/
- vmcb02->control.bus_lock_counter = vmcb01->control.bus_lock_counter; /* Done at vmrun: asid. */
/* Also overwritten later if necessary. */ @@ -1035,6 +1040,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm) }
- /*
* The bus lock threshold count should keep running across nested
* transitions. Copy the buslock threshold count from vmcb02 to vmcb01.
*/
- vmcb01->control.bus_lock_counter = vmcb02->control.bus_lock_counter;
KVM should definitely reset the counter to '0' on a nested VM-Exit, because L1 can't be interrupted by L2, i.e. there is zero chance that KVM needs to allow a bus lock in L1 to ensure L1 makes forward progress.
nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr); svm_switch_vmcb(svm, &svm->vmcb01); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9df3e1e5ae81..0d0407f1ee6a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1372,6 +1372,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu) svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK; }
- if (vcpu->kvm->arch.bus_lock_detection_enabled)
svm_set_intercept(svm, INTERCEPT_BUSLOCK);
- if (sev_guest(vcpu->kvm)) sev_init_vmcb(svm);
@@ -3286,6 +3289,24 @@ static int invpcid_interception(struct kvm_vcpu *vcpu) return kvm_handle_invpcid(vcpu, type, gva); } +static int bus_lock_exit(struct kvm_vcpu *vcpu) +{
- struct vcpu_svm *svm = to_svm(vcpu);
- vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
- vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
- /*
* Reload the counter with value greater than '0'.
The value quite obviously must be exactly '1', not simply greater than '0. I also think this is the wrong place to set the counter. Rather than set the counter at the time of exit, KVM should implement a vcpu->arch.complete_userspace_io callback and set the counter to '1' if and only if RIP (or LIP, but I have no objection to keeping things simple) is unchanged. It's a bit of extra complexity, but it will make it super obvious why KVM is setting the counter to '1'. And, if userspace wants to stuff state and move past the instruction, e.g. by emulating the guilty instruction, then KVM won't unnecessarily allow a bus lock in the guest.
And then the comment can be:
/* * If userspace has NOT change RIP, then KVM's ABI is to let the guest * execute the bus-locking instruction. Set the bus lock counter to '1' * to effectively step past the bus lock. */
Thank you for highlighting these scenarios (for nested guest and normal guests). I had not thought about them. I’m currently going through the comments and trying to fully understand them. I’ll try out the suggested changes and get back to you.
- Manali
* The bus lock exit on SVM happens with RIP pointing to the guilty
* instruction. So, reloading the value of bus_lock_counter to '0'
* results into generating continuous bus lock exits.
*/
- svm->vmcb->control.bus_lock_counter = 1;
- return 0;
+}
static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_CR4_WRITE_TRAP] = cr_trap, [SVM_EXIT_CR8_WRITE_TRAP] = cr_trap, [SVM_EXIT_INVPCID] = invpcid_interception,
- [SVM_EXIT_BUS_LOCK] = bus_lock_exit, [SVM_EXIT_NPF] = npf_interception, [SVM_EXIT_RSM] = rsm_interception, [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception,
@@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); }
- if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) {
pr_info("Bus Lock Threashold supported\n");
kvm_caps.has_bus_lock_exit = true;
- }
- /* CPUID 0x80000008 */ if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD))
-- 2.34.1
On 10/15/2024 11:19 PM, Sean Christopherson wrote:
On Fri, Oct 04, 2024, Manali Shukla wrote:
...
+static int bus_lock_exit(struct kvm_vcpu *vcpu) +{
- struct vcpu_svm *svm = to_svm(vcpu);
- vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
- vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
- /*
* Reload the counter with value greater than '0'.
The value quite obviously must be exactly '1', not simply greater than '0. I also think this is the wrong place to set the counter. Rather than set the counter at the time of exit, KVM should implement a vcpu->arch.complete_userspace_io callback and set the counter to '1' if and only if RIP (or LIP, but I have no objection to keeping things simple) is unchanged. It's a bit of extra complexity, but it will make it super obvious why KVM is setting the counter to '1'. And, if userspace wants to stuff state and move past the instruction, e.g. by emulating the guilty instruction, then KVM won't unnecessarily allow a bus lock in the guest.
And then the comment can be:
/* * If userspace has NOT change RIP, then KVM's ABI is to let the guest * execute the bus-locking instruction. Set the bus lock counter to '1' * to effectively step past the bus lock. */
The bus lock threshold intercept feature is available for SEV-ES and SEV-SNP guests too. The rip where the bus lock exit occurred, is not available in bus_lock_exit handler for SEV-ES and SEV-SNP guests, so the above-mentioned solution won't work with SEV-ES and SEV-SNP guests.
I would propose to add the above-mentioned solution only for normal and SEV guests and unconditionally reloading of bus_lock_counter to 1 in complete_userspace_io for SEV-ES and SEV-SNP guests.
Any thoughts ?
* The bus lock exit on SVM happens with RIP pointing to the guilty
* instruction. So, reloading the value of bus_lock_counter to '0'
* results into generating continuous bus lock exits.
*/
- svm->vmcb->control.bus_lock_counter = 1;
- return 0;
+}
static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_CR4_WRITE_TRAP] = cr_trap, [SVM_EXIT_CR8_WRITE_TRAP] = cr_trap, [SVM_EXIT_INVPCID] = invpcid_interception,
- [SVM_EXIT_BUS_LOCK] = bus_lock_exit, [SVM_EXIT_NPF] = npf_interception, [SVM_EXIT_RSM] = rsm_interception, [SVM_EXIT_AVIC_INCOMPLETE_IPI] = avic_incomplete_ipi_interception,
@@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void) kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); }
- if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) {
pr_info("Bus Lock Threashold supported\n");
kvm_caps.has_bus_lock_exit = true;
- }
- /* CPUID 0x80000008 */ if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) || boot_cpu_has(X86_FEATURE_AMD_SSBD))
-- 2.34.1
-Manali
On Sun, Nov 03, 2024, Manali Shukla wrote:
On 10/15/2024 11:19 PM, Sean Christopherson wrote:
On Fri, Oct 04, 2024, Manali Shukla wrote:
...
+static int bus_lock_exit(struct kvm_vcpu *vcpu) +{
- struct vcpu_svm *svm = to_svm(vcpu);
- vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
- vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
- /*
* Reload the counter with value greater than '0'.
The value quite obviously must be exactly '1', not simply greater than '0. I also think this is the wrong place to set the counter. Rather than set the counter at the time of exit, KVM should implement a vcpu->arch.complete_userspace_io callback and set the counter to '1' if and only if RIP (or LIP, but I have no objection to keeping things simple) is unchanged. It's a bit of extra complexity, but it will make it super obvious why KVM is setting the counter to '1'. And, if userspace wants to stuff state and move past the instruction, e.g. by emulating the guilty instruction, then KVM won't unnecessarily allow a bus lock in the guest.
And then the comment can be:
/* * If userspace has NOT change RIP, then KVM's ABI is to let the guest * execute the bus-locking instruction. Set the bus lock counter to '1' * to effectively step past the bus lock. */
The bus lock threshold intercept feature is available for SEV-ES and SEV-SNP guests too. The rip where the bus lock exit occurred, is not available in bus_lock_exit handler for SEV-ES and SEV-SNP guests, so the above-mentioned solution won't work with SEV-ES and SEV-SNP guests.
I would propose to add the above-mentioned solution only for normal and SEV guests and unconditionally reloading of bus_lock_counter to 1 in complete_userspace_io for SEV-ES and SEV-SNP guests.
Yeah, that works. Though I would condition the check on guest_state_protected. Actually, and this is going to seem really stupid, but everything will Just Work if you use kvm_get_linear_rip() and kvm_is_linear_rip(), because kvm_get_linear_rip() returns '0' for vCPUs with protected state. I.e. KVM will do a rather superfluous cui() callback, but otherwise it's fine. Silly, but in many ways preferable to special casing ES and SNP guests.
On a related topic, can you add a refacotring prep patch to move linear_rip out of kvm_pio_request and place it next to complete_userspace_io? There's nothing port I/O specific about that field, it just so happens to that port I/O is the only case where KVM's ABI is to let userspace stuff state (to emulate RESET) without first completing the I/O instruction.
I.e.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8e8ca6dab2b2..8617b15096a6 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -406,7 +406,6 @@ struct kvm_rmap_head { };
struct kvm_pio_request { - unsigned long linear_rip; unsigned long count; int in; int port; @@ -884,6 +883,7 @@ struct kvm_vcpu_arch { bool emulate_regs_need_sync_to_vcpu; bool emulate_regs_need_sync_from_vcpu; int (*complete_userspace_io)(struct kvm_vcpu *vcpu); + unsigned long cui_linear_rip;
gpa_t time; struct pvclock_vcpu_time_info hv_clock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 425a301911a6..7704d3901481 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9308,7 +9308,7 @@ static int complete_fast_pio_out(struct kvm_vcpu *vcpu) { vcpu->arch.pio.count = 0;
- if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip))) + if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.cui_linear_rip))) return 1;
return kvm_skip_emulated_instruction(vcpu); @@ -9333,7 +9333,7 @@ static int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, complete_fast_pio_out_port_0x7e; kvm_skip_emulated_instruction(vcpu); } else { - vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu); + vcpu->arch.cui_linear_rip = kvm_get_linear_rip(vcpu); vcpu->arch.complete_userspace_io = complete_fast_pio_out; } return 0; @@ -9346,7 +9346,7 @@ static int complete_fast_pio_in(struct kvm_vcpu *vcpu) /* We should only ever be called with arch.pio.count equal to 1 */ BUG_ON(vcpu->arch.pio.count != 1);
- if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip))) { + if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.cui_linear_rip))) { vcpu->arch.pio.count = 0; return 1; } @@ -9375,7 +9375,7 @@ static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size, return ret; }
- vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu); + vcpu->arch.cui_linear_rip = kvm_get_linear_rip(vcpu); vcpu->arch.complete_userspace_io = complete_fast_pio_in;
return 0;
On 11/5/2024 7:52 AM, Sean Christopherson wrote:
On Sun, Nov 03, 2024, Manali Shukla wrote:
On 10/15/2024 11:19 PM, Sean Christopherson wrote:
On Fri, Oct 04, 2024, Manali Shukla wrote:
...
+static int bus_lock_exit(struct kvm_vcpu *vcpu) +{
- struct vcpu_svm *svm = to_svm(vcpu);
- vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
- vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
- /*
* Reload the counter with value greater than '0'.
The value quite obviously must be exactly '1', not simply greater than '0. I also think this is the wrong place to set the counter. Rather than set the counter at the time of exit, KVM should implement a vcpu->arch.complete_userspace_io callback and set the counter to '1' if and only if RIP (or LIP, but I have no objection to keeping things simple) is unchanged. It's a bit of extra complexity, but it will make it super obvious why KVM is setting the counter to '1'. And, if userspace wants to stuff state and move past the instruction, e.g. by emulating the guilty instruction, then KVM won't unnecessarily allow a bus lock in the guest.
And then the comment can be:
/* * If userspace has NOT change RIP, then KVM's ABI is to let the guest * execute the bus-locking instruction. Set the bus lock counter to '1' * to effectively step past the bus lock. */
The bus lock threshold intercept feature is available for SEV-ES and SEV-SNP guests too. The rip where the bus lock exit occurred, is not available in bus_lock_exit handler for SEV-ES and SEV-SNP guests, so the above-mentioned solution won't work with SEV-ES and SEV-SNP guests.
I would propose to add the above-mentioned solution only for normal and SEV guests and unconditionally reloading of bus_lock_counter to 1 in complete_userspace_io for SEV-ES and SEV-SNP guests.
Yeah, that works. Though I would condition the check on guest_state_protected. Actually, and this is going to seem really stupid, but everything will Just Work if you use kvm_get_linear_rip() and kvm_is_linear_rip(), because kvm_get_linear_rip() returns '0' for vCPUs with protected state. I.e. KVM will do a rather superfluous cui() callback, but otherwise it's fine. Silly, but in many ways preferable to special casing ES and SNP guests.
Ack.
On a related topic, can you add a refacotring prep patch to move linear_rip out of kvm_pio_request and place it next to complete_userspace_io? There's nothing port I/O specific about that field, it just so happens to that port I/O is the only case where KVM's ABI is to let userspace stuff state (to emulate RESET) without first completing the I/O instruction.
Sure. I will add this refactoring prep patch with v4.
- Manali
I.e.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 8e8ca6dab2b2..8617b15096a6 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -406,7 +406,6 @@ struct kvm_rmap_head { }; struct kvm_pio_request {
unsigned long linear_rip; unsigned long count; int in; int port;
@@ -884,6 +883,7 @@ struct kvm_vcpu_arch { bool emulate_regs_need_sync_to_vcpu; bool emulate_regs_need_sync_from_vcpu; int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
unsigned long cui_linear_rip;
gpa_t time; struct pvclock_vcpu_time_info hv_clock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 425a301911a6..7704d3901481 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9308,7 +9308,7 @@ static int complete_fast_pio_out(struct kvm_vcpu *vcpu) { vcpu->arch.pio.count = 0;
if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip)))
if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.cui_linear_rip))) return 1;
return kvm_skip_emulated_instruction(vcpu); @@ -9333,7 +9333,7 @@ static int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, complete_fast_pio_out_port_0x7e; kvm_skip_emulated_instruction(vcpu); } else {
vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu);
vcpu->arch.cui_linear_rip = kvm_get_linear_rip(vcpu); vcpu->arch.complete_userspace_io = complete_fast_pio_out; } return 0;
@@ -9346,7 +9346,7 @@ static int complete_fast_pio_in(struct kvm_vcpu *vcpu) /* We should only ever be called with arch.pio.count equal to 1 */ BUG_ON(vcpu->arch.pio.count != 1);
if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.pio.linear_rip))) {
if (unlikely(!kvm_is_linear_rip(vcpu, vcpu->arch.cui_linear_rip))) { vcpu->arch.pio.count = 0; return 1; }
@@ -9375,7 +9375,7 @@ static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size, return ret; }
vcpu->arch.pio.linear_rip = kvm_get_linear_rip(vcpu);
vcpu->arch.cui_linear_rip = kvm_get_linear_rip(vcpu); vcpu->arch.complete_userspace_io = complete_fast_pio_in;
return 0;
Add a note about behavioral difference for KVM_EXIT_X86_BUS_LOCK between AMD CPUs and Intel CPUs in KVM_CAP_X86_BUS_LOCK_EXIT capability documentation.
Signed-off-by: Manali Shukla manali.shukla@amd.com --- Documentation/virt/kvm/api.rst | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index e32471977d0a..49465323dc62 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -7884,6 +7884,10 @@ Note! Detected bus locks may be coincident with other exits to userspace, i.e. KVM_RUN_X86_BUS_LOCK should be checked regardless of the primary exit reason if userspace wants to take action on all detected bus locks.
+Note! On AMD CPUs, the bus lock exit to user space occurs with RIP pointing at +the offending instruction. In contrast, on Intel CPUs, the RIP points to the +instruction right after the guilty one after the bus lock exit to user space. + 7.23 KVM_CAP_PPC_DAWR1 ----------------------
x86 (lowercase 'x')
On Fri, Oct 04, 2024, Manali Shukla wrote:
Add a note about behavioral difference for KVM_EXIT_X86_BUS_LOCK between AMD CPUs and Intel CPUs in KVM_CAP_X86_BUS_LOCK_EXIT capability documentation.
This belongs in the previous patch, especially if patch 2 is split into arch collateral and KVM implementation. The KVM changes are small enough that I don't see a reason to separate the documentation from the code. And there will likely be enough subtleties to the code that we'll want a bit more documentation.
From: Nikunj A Dadhania nikunj@amd.com
Add a test case to verify the Bus Lock exit feature
The main thing that the selftest verifies is that when a Buslock is generated in the guest context, the KVM_EXIT_X86_BUS_LOCK is triggered for SVM or VMX when the KVM capability KVM_CAP_X86_BUS_LOCK_EXIT is enabled.
This test case also verifies the Bus Lock exit in nested scenario.
Signed-off-by: Nikunj A Dadhania nikunj@amd.com Co-developed-by: Manali Shukla manali.shukla@amd.com Signed-off-by: Manali Shukla manali.shukla@amd.com --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/kvm_buslock_test.c | 130 ++++++++++++++++++ 2 files changed, 131 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 960cf6a77198..b2de61e4a213 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test TEST_GEN_PROGS_x86_64 += x86_64/hyperv_tlb_flush TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test +TEST_GEN_PROGS_x86_64 += x86_64/kvm_buslock_test TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test diff --git a/tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c b/tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c new file mode 100644 index 000000000000..82693520485c --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/kvm_buslock_test.c @@ -0,0 +1,130 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2024 Advanced Micro Devices, Inc. + */ + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "svm_util.h" +#include "vmx.h" + +#define NR_ITERATIONS 100 +#define L2_GUEST_STACK_SIZE 64 + +struct buslock_test { + unsigned char pad[PAGE_SIZE - 2]; + atomic_long_t val; +} __packed; + +struct buslock_test test __aligned(PAGE_SIZE); + +static __always_inline void buslock_atomic_add(int i, atomic_long_t *v) +{ + asm volatile(LOCK_PREFIX "addl %1,%0" + : "+m" (v->counter) + : "ir" (i) : "memory"); +} + +static void buslock_add(void) +{ + /* + * Increment a page unaligned variable atomically. + * This should generate a bus lock exit. + */ + for (int i = 0; i < NR_ITERATIONS; i++) + buslock_atomic_add(2, &test.val); +} + +static void l2_guest_code(void) +{ + buslock_add(); + GUEST_DONE(); +} + +static void l1_svm_code(struct svm_test_data *svm) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + struct vmcb *vmcb = svm->vmcb; + + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + run_guest(vmcb, svm->vmcb_gpa); + GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL); + GUEST_DONE(); +} + +static void l1_vmx_code(struct vmx_pages *vmx) +{ + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + + GUEST_ASSERT_EQ(prepare_for_vmx_operation(vmx), true); + GUEST_ASSERT_EQ(load_vmcs(vmx), true); + + prepare_vmcs(vmx, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + GUEST_ASSERT(!vmwrite(GUEST_RIP, (u64)l2_guest_code)); + GUEST_ASSERT(!vmlaunch()); + + GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_VMCALL); + GUEST_DONE(); +} + +static void guest_code(void *test_data) +{ + buslock_add(); + + if (this_cpu_has(X86_FEATURE_SVM)) + l1_svm_code(test_data); + else + l1_vmx_code(test_data); +} + +int main(int argc, char *argv[]) +{ + struct kvm_vcpu *vcpu; + struct kvm_run *run; + struct kvm_vm *vm; + vm_vaddr_t nested_test_data_gva; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM) || kvm_cpu_has(X86_FEATURE_VMX)); + TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_BUS_LOCK_EXIT)); + + vm = vm_create(1); + vm_enable_cap(vm, KVM_CAP_X86_BUS_LOCK_EXIT, KVM_BUS_LOCK_DETECTION_EXIT); + vcpu = vm_vcpu_add(vm, 0, guest_code); + + if (kvm_cpu_has(X86_FEATURE_SVM)) + vcpu_alloc_svm(vm, &nested_test_data_gva); + else + vcpu_alloc_vmx(vm, &nested_test_data_gva); + + vcpu_args_set(vcpu, 1, nested_test_data_gva); + + run = vcpu->run; + + for (;;) { + struct ucall uc; + + vcpu_run(vcpu); + + if (run->exit_reason == KVM_EXIT_IO) { + switch (get_ucall(vcpu, &uc)) { + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + /* NOT REACHED */ + case UCALL_SYNC: + continue; + case UCALL_DONE: + goto done; + default: + TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd); + } + } + + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_X86_BUS_LOCK); + } +done: + kvm_vm_free(vm); + return 0; +}
Hi Manali,
kernel test robot noticed the following build warnings:
[auto build test WARNING on efbc6bd090f48ccf64f7a8dd5daea775821d57ec]
url: https://github.com/intel-lab-lkp/linux/commits/Manali-Shukla/x86-cpufeatures... base: efbc6bd090f48ccf64f7a8dd5daea775821d57ec patch link: https://lore.kernel.org/r/20241004053341.5726-5-manali.shukla%40amd.com patch subject: [PATCH v3 4/4] KVM: selftests: Add bus lock exit test :::::: branch date: 2 days ago :::::: commit date: 2 days ago compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241006/202410060638.ZXRqIbIj-lkp@i...)
If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot lkp@intel.com | Closes: https://lore.kernel.org/r/202410060638.ZXRqIbIj-lkp@intel.com/
All warnings (new ones prefixed by >>):
x86_64/kvm_buslock_test.c: In function 'buslock_add':
x86_64/kvm_buslock_test.c:36:39: warning: taking address of packed member of 'struct buslock_test' may result in an unaligned pointer value [-Waddress-of-packed-member]
36 | buslock_atomic_add(2, &test.val); | ^~~~~~~~~ At top level: cc1: note: unrecognized command-line option '-Wno-gnu-variable-sized-type-not-at-end' may have been intended to silence earlier diagnostics
linux-kselftest-mirror@lists.linaro.org