Those are few patches I was working on lately, all somewhat related to the two CVEs that I found recently.
First 7 patches fix various minor bugs that relate to these CVEs.
The rest of the patches implement various optional SVM features, some of which the guest could enable anyway due to incorrect checking of virt_ext field.
Last patch is somewhat an RFC, I would like to hear your opinion on that.
I also implemented nested TSC scaling while at it.
As for other optional SVM features here is my summary of few features I took a look at:
X86_FEATURE_DECODEASSISTS: this feature should make it easier for the L1 to emulate an instruction on MMIO access, by not needing to read the guest memory but rather using the instruction bytes that the CPU already fetched.
The challenge of implementing this is that we sometimes inject #PF and #NPT syntenically and in those cases we must be sure we set the correct instruction bytes.
Also this feature adds assists for MOV CR/DR, INTn, and INVLPG, which aren't that interesting but must be supported as well to expose this feature to the nested guest.
X86_FEATURE_VGIF Might allow the L2 to run the L3 a bit faster, but due to crazy complex logic we already have around int_ctl and vgif probably not worth it.
X86_FEATURE_VMCBCLEAN Should just be enabled, because otherwise L1 doesn't even attempt to set the clean bits. But we need to know if we can take an advantage of these bits first.
X86_FEATURE_FLUSHBYASID X86_FEATURE_AVIC These two features would be very good to enable, but that would require lots of work, and will be done eventually.
There are few more nested SVM features that I didn't yet had a chance to take a look at.
Best regards, Maxim Levitsky
Maxim Levitsky (14): KVM: x86: nSVM: restore int_vector in svm_clear_vintr KVM: x86: selftests: test simultaneous uses of V_IRQ from L1 and L0 KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround KVM: x86: nSVM: don't copy pause related settings KVM: x86: nSVM: don't copy virt_ext from vmcb12 KVM: x86: SVM: don't set VMLOAD/VMSAVE intercepts on vCPU reset KVM: x86: SVM: add warning for CVE-2021-3656 KVM: x86: SVM: add module param to control LBR virtualization KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running KVM: x86: nSVM: implement nested LBR virtualization KVM: x86: nSVM: implement nested VMLOAD/VMSAVE KVM: x86: SVM: add module param to control TSC scaling KVM: x86: nSVM: implement nested TSC scaling KVM: x86: nSVM: support PAUSE filter threshold and count
arch/x86/kvm/svm/nested.c | 105 +++++++-- arch/x86/kvm/svm/svm.c | 218 +++++++++++++++--- arch/x86/kvm/svm/svm.h | 20 +- arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 1 + tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/svm_int_ctl_test.c | 128 ++++++++++ 8 files changed, 427 insertions(+), 48 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c
In svm_clear_vintr we try to restore the virtual interrupt injection that might be pending, but we fail to restore the interrupt vector.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/svm.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 05e8d4d27969..b2e710a3fff6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1566,6 +1566,8 @@ static void svm_clear_vintr(struct vcpu_svm *svm)
svm->vmcb->control.int_ctl |= svm->nested.ctl.int_ctl & V_IRQ_INJECTION_BITS_MASK; + + svm->vmcb->control.int_vector = svm->nested.ctl.int_vector; }
vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
On 14/09/21 17:48, Maxim Levitsky wrote:
In svm_clear_vintr we try to restore the virtual interrupt injection that might be pending, but we fail to restore the interrupt vector.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/svm.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 05e8d4d27969..b2e710a3fff6 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1566,6 +1566,8 @@ static void svm_clear_vintr(struct vcpu_svm *svm) svm->vmcb->control.int_ctl |= svm->nested.ctl.int_ctl & V_IRQ_INJECTION_BITS_MASK;
}svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
Queued, thanks.
Paolo
Test that if:
* L1 disables virtual interrupt masking, and INTR intercept.
* L1 setups a virtual interrupt to be injected to L2 and enters L2 with interrupts disabled, thus the virtual interrupt is pending.
* Now an external interrupt arrives in L1 and since L1 doesn't intercept it, it should be delivered to L2 when it enables interrupts.
to do this L0 (abuses) V_IRQ to setup an interrupt window, and returns to L2.
* L2 enables interrupts. This should trigger the interrupt window, injection of the external interrupt and delivery of the virtual interrupt that can now be done.
* Test that now L2 gets those interrupts.
This is the test that demonstrates the issue that was fixed in the previous patch.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/svm_int_ctl_test.c | 128 ++++++++++++++++++ 3 files changed, 130 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index 36896d251977..eb98958b15e4 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -23,6 +23,7 @@ /x86_64/smm_test /x86_64/state_test /x86_64/svm_vmcall_test +/x86_64/svm_int_ctl_test /x86_64/sync_regs_test /x86_64/tsc_msrs_test /x86_64/userspace_msr_exit_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index c103873531e0..3b8b143daecc 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -56,6 +56,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/smm_test TEST_GEN_PROGS_x86_64 += x86_64/state_test TEST_GEN_PROGS_x86_64 += x86_64/vmx_preemption_timer_test TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test +TEST_GEN_PROGS_x86_64 += x86_64/svm_int_ctl_test TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test TEST_GEN_PROGS_x86_64 += x86_64/userspace_msr_exit_test TEST_GEN_PROGS_x86_64 += x86_64/vmx_apic_access_test diff --git a/tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c b/tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c new file mode 100644 index 000000000000..df04f56ce859 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c @@ -0,0 +1,128 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * svm_int_ctl_test + * + * Copyright (C) 2021, Red Hat, Inc. + * + * Nested SVM testing: test simultaneous use of V_IRQ from L1 and L0. + */ + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "svm_util.h" +#include "apic.h" + +#define VCPU_ID 0 + +static struct kvm_vm *vm; + +bool vintr_irq_called; +bool intr_irq_called; + +#define VINTR_IRQ_NUMBER 0x20 +#define INTR_IRQ_NUMBER 0x30 + +static void vintr_irq_handler(struct ex_regs *regs) +{ + vintr_irq_called = true; +} + +static void intr_irq_handler(struct ex_regs *regs) +{ + x2apic_write_reg(APIC_EOI, 0x00); + intr_irq_called = true; +} + +static void l2_guest_code(struct svm_test_data *svm) +{ + /* This code raises interrupt INTR_IRQ_NUMBER in the L1's LAPIC, + * and since L1 didn't enable virtual interrupt masking, + * L2 should receive it and not L1. + * + * L2 also has virtual interrupt 'VINTR_IRQ_NUMBER' pending in V_IRQ + * so it should also receive it after the following 'sti'. + */ + x2apic_write_reg(APIC_ICR, + APIC_DEST_SELF | APIC_INT_ASSERT | INTR_IRQ_NUMBER); + + __asm__ __volatile__( + "sti\n" + "nop\n" + ); + + GUEST_ASSERT(vintr_irq_called); + GUEST_ASSERT(intr_irq_called); + + __asm__ __volatile__( + "vmcall\n" + ); +} + +static void l1_guest_code(struct svm_test_data *svm) +{ + #define L2_GUEST_STACK_SIZE 64 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + struct vmcb *vmcb = svm->vmcb; + + x2apic_enable(); + + /* Prepare for L2 execution. */ + generic_svm_setup(svm, l2_guest_code, + &l2_guest_stack[L2_GUEST_STACK_SIZE]); + + /* No virtual interrupt masking */ + vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK; + + /* No intercepts for real and virtual interrupts */ + vmcb->control.intercept &= ~(1ULL << INTERCEPT_INTR | INTERCEPT_VINTR); + + /* Make a virtual interrupt VINTR_IRQ_NUMBER pending */ + vmcb->control.int_ctl |= V_IRQ_MASK | (0x1 << V_INTR_PRIO_SHIFT); + vmcb->control.int_vector = VINTR_IRQ_NUMBER; + + run_guest(vmcb, svm->vmcb_gpa); + GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL); + GUEST_DONE(); +} + +int main(int argc, char *argv[]) +{ + vm_vaddr_t svm_gva; + + nested_svm_check_supported(); + + vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code); + + vm_init_descriptor_tables(vm); + vcpu_init_descriptor_tables(vm, VCPU_ID); + + vm_install_exception_handler(vm, VINTR_IRQ_NUMBER, vintr_irq_handler); + vm_install_exception_handler(vm, INTR_IRQ_NUMBER, intr_irq_handler); + + vcpu_alloc_svm(vm, &svm_gva); + vcpu_args_set(vm, VCPU_ID, 1, svm_gva); + + struct kvm_run *run = vcpu_state(vm, VCPU_ID); + struct ucall uc; + + vcpu_run(vm, VCPU_ID); + TEST_ASSERT(run->exit_reason == KVM_EXIT_IO, + "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n", + run->exit_reason, + exit_reason_str(run->exit_reason)); + + switch (get_ucall(vm, VCPU_ID, &uc)) { + case UCALL_ABORT: + TEST_FAIL("%s", (const char *)uc.args[0]); + break; + /* NOT REACHED */ + case UCALL_DONE: + goto done; + default: + TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd); + } +done: + kvm_vm_free(vm); + return 0; +}
On 14/09/21 17:48, Maxim Levitsky wrote:
Test that if:
L1 disables virtual interrupt masking, and INTR intercept.
L1 setups a virtual interrupt to be injected to L2 and enters L2 with interrupts disabled, thus the virtual interrupt is pending.
Now an external interrupt arrives in L1 and since L1 doesn't intercept it, it should be delivered to L2 when it enables interrupts.
to do this L0 (abuses) V_IRQ to setup an interrupt window, and returns to L2.
L2 enables interrupts. This should trigger the interrupt window, injection of the external interrupt and delivery of the virtual interrupt that can now be done.
Test that now L2 gets those interrupts.
This is the test that demonstrates the issue that was fixed in the previous patch.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/svm_int_ctl_test.c | 128 ++++++++++++++++++ 3 files changed, 130 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index 36896d251977..eb98958b15e4 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -23,6 +23,7 @@ /x86_64/smm_test /x86_64/state_test /x86_64/svm_vmcall_test +/x86_64/svm_int_ctl_test /x86_64/sync_regs_test /x86_64/tsc_msrs_test /x86_64/userspace_msr_exit_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index c103873531e0..3b8b143daecc 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -56,6 +56,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/smm_test TEST_GEN_PROGS_x86_64 += x86_64/state_test TEST_GEN_PROGS_x86_64 += x86_64/vmx_preemption_timer_test TEST_GEN_PROGS_x86_64 += x86_64/svm_vmcall_test +TEST_GEN_PROGS_x86_64 += x86_64/svm_int_ctl_test TEST_GEN_PROGS_x86_64 += x86_64/sync_regs_test TEST_GEN_PROGS_x86_64 += x86_64/userspace_msr_exit_test TEST_GEN_PROGS_x86_64 += x86_64/vmx_apic_access_test diff --git a/tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c b/tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c new file mode 100644 index 000000000000..df04f56ce859 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c @@ -0,0 +1,128 @@ +// SPDX-License-Identifier: GPL-2.0-only +/*
- svm_int_ctl_test
- Copyright (C) 2021, Red Hat, Inc.
- Nested SVM testing: test simultaneous use of V_IRQ from L1 and L0.
- */
+#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "svm_util.h" +#include "apic.h"
+#define VCPU_ID 0
+static struct kvm_vm *vm;
+bool vintr_irq_called; +bool intr_irq_called;
+#define VINTR_IRQ_NUMBER 0x20 +#define INTR_IRQ_NUMBER 0x30
+static void vintr_irq_handler(struct ex_regs *regs) +{
- vintr_irq_called = true;
+}
+static void intr_irq_handler(struct ex_regs *regs) +{
- x2apic_write_reg(APIC_EOI, 0x00);
- intr_irq_called = true;
+}
+static void l2_guest_code(struct svm_test_data *svm) +{
- /* This code raises interrupt INTR_IRQ_NUMBER in the L1's LAPIC,
* and since L1 didn't enable virtual interrupt masking,
* L2 should receive it and not L1.
*
* L2 also has virtual interrupt 'VINTR_IRQ_NUMBER' pending in V_IRQ
* so it should also receive it after the following 'sti'.
*/
- x2apic_write_reg(APIC_ICR,
APIC_DEST_SELF | APIC_INT_ASSERT | INTR_IRQ_NUMBER);
- __asm__ __volatile__(
"sti\n"
"nop\n"
- );
- GUEST_ASSERT(vintr_irq_called);
- GUEST_ASSERT(intr_irq_called);
- __asm__ __volatile__(
"vmcall\n"
- );
+}
+static void l1_guest_code(struct svm_test_data *svm) +{
- #define L2_GUEST_STACK_SIZE 64
- unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
- struct vmcb *vmcb = svm->vmcb;
- x2apic_enable();
- /* Prepare for L2 execution. */
- generic_svm_setup(svm, l2_guest_code,
&l2_guest_stack[L2_GUEST_STACK_SIZE]);
- /* No virtual interrupt masking */
- vmcb->control.int_ctl &= ~V_INTR_MASKING_MASK;
- /* No intercepts for real and virtual interrupts */
- vmcb->control.intercept &= ~(1ULL << INTERCEPT_INTR | INTERCEPT_VINTR);
- /* Make a virtual interrupt VINTR_IRQ_NUMBER pending */
- vmcb->control.int_ctl |= V_IRQ_MASK | (0x1 << V_INTR_PRIO_SHIFT);
- vmcb->control.int_vector = VINTR_IRQ_NUMBER;
- run_guest(vmcb, svm->vmcb_gpa);
- GUEST_ASSERT(vmcb->control.exit_code == SVM_EXIT_VMMCALL);
- GUEST_DONE();
+}
+int main(int argc, char *argv[]) +{
- vm_vaddr_t svm_gva;
- nested_svm_check_supported();
- vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
- vm_init_descriptor_tables(vm);
- vcpu_init_descriptor_tables(vm, VCPU_ID);
- vm_install_exception_handler(vm, VINTR_IRQ_NUMBER, vintr_irq_handler);
- vm_install_exception_handler(vm, INTR_IRQ_NUMBER, intr_irq_handler);
- vcpu_alloc_svm(vm, &svm_gva);
- vcpu_args_set(vm, VCPU_ID, 1, svm_gva);
- struct kvm_run *run = vcpu_state(vm, VCPU_ID);
- struct ucall uc;
- vcpu_run(vm, VCPU_ID);
- TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
"Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
run->exit_reason,
exit_reason_str(run->exit_reason));
- switch (get_ucall(vm, VCPU_ID, &uc)) {
- case UCALL_ABORT:
TEST_FAIL("%s", (const char *)uc.args[0]);
break;
/* NOT REACHED */
- case UCALL_DONE:
goto done;
- default:
TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
- }
+done:
- kvm_vm_free(vm);
- return 0;
+}
Queued, thanks.
Paolo
GP SVM errata workaround made the #GP handler always emulate the SVM instructions.
However these instructions #GP in case the operand is not 4K aligned, but the workaround code didn't check this and we ended up emulating these instructions anyway.
This is only an emulation accuracy check bug as there is no harm for KVM to read/write unaligned vmcb images.
Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/svm.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b2e710a3fff6..6645542df9bd 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2224,6 +2224,10 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (error_code) goto reinject;
+ /* All SVM instructions expect page aligned RAX */ + if (svm->vmcb->save.rax & ~PAGE_MASK) + goto reinject; + /* Decode the instruction for usage later */ if (x86_decode_emulated_instruction(vcpu, 0, NULL, 0) != EMULATION_OK) goto reinject;
On 14/09/21 17:48, Maxim Levitsky wrote:
GP SVM errata workaround made the #GP handler always emulate the SVM instructions.
However these instructions #GP in case the operand is not 4K aligned, but the workaround code didn't check this and we ended up emulating these instructions anyway.
This is only an emulation accuracy check bug as there is no harm for KVM to read/write unaligned vmcb images.
Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/svm.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b2e710a3fff6..6645542df9bd 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -2224,6 +2224,10 @@ static int gp_interception(struct kvm_vcpu *vcpu) if (error_code) goto reinject;
- /* All SVM instructions expect page aligned RAX */
- if (svm->vmcb->save.rax & ~PAGE_MASK)
goto reinject;
- /* Decode the instruction for usage later */ if (x86_decode_emulated_instruction(vcpu, 0, NULL, 0) != EMULATION_OK) goto reinject;
Queued, thanks.
Paolo
According to the SDM, the CPU never modifies these settings. It loads them on VM entry and updates an internal copy instead.
Also don't load them from the vmcb12 as we don't expose these features to the nested guest yet.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/nested.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 2545d0c61985..476e01f98035 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -551,9 +551,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) svm->vmcb->control.event_inj = svm->nested.ctl.event_inj; svm->vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err;
- svm->vmcb->control.pause_filter_count = svm->nested.ctl.pause_filter_count; - svm->vmcb->control.pause_filter_thresh = svm->nested.ctl.pause_filter_thresh; - nested_svm_transition_tlb_flush(vcpu);
/* Enter Guest-Mode */ @@ -808,11 +805,6 @@ int nested_svm_vmexit(struct vcpu_svm *svm) vmcb12->control.event_inj = svm->nested.ctl.event_inj; vmcb12->control.event_inj_err = svm->nested.ctl.event_inj_err;
- vmcb12->control.pause_filter_count = - svm->vmcb->control.pause_filter_count; - vmcb12->control.pause_filter_thresh = - svm->vmcb->control.pause_filter_thresh; - nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr);
svm_switch_vmcb(svm, &svm->vmcb01);
These field correspond to features that we don't expose yet to L2
While currently there are no CVE worthy features in this field, if AMD adds more features to this field, that could allow guest escapes similar to CVE-2021-3653 and CVE-2021-3656.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/nested.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 476e01f98035..4df59d9795b6 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -545,7 +545,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) | (svm->vmcb01.ptr->control.int_ctl & int_ctl_vmcb01_bits);
- svm->vmcb->control.virt_ext = svm->nested.ctl.virt_ext; svm->vmcb->control.int_vector = svm->nested.ctl.int_vector; svm->vmcb->control.int_state = svm->nested.ctl.int_state; svm->vmcb->control.event_inj = svm->nested.ctl.event_inj;
On 14/09/21 17:48, Maxim Levitsky wrote:
These field correspond to features that we don't expose yet to L2
While currently there are no CVE worthy features in this field, if AMD adds more features to this field, that could allow guest escapes similar to CVE-2021-3653 and CVE-2021-3656.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/nested.c | 1 - 1 file changed, 1 deletion(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 476e01f98035..4df59d9795b6 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -545,7 +545,6 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) | (svm->vmcb01.ptr->control.int_ctl & int_ctl_vmcb01_bits);
- svm->vmcb->control.virt_ext = svm->nested.ctl.virt_ext; svm->vmcb->control.int_vector = svm->nested.ctl.int_vector; svm->vmcb->control.int_state = svm->nested.ctl.int_state; svm->vmcb->control.event_inj = svm->nested.ctl.event_inj;
Queued, thanks.
Paolo
commit adc2a23734ac ("KVM: nSVM: improve SYSENTER emulation on AMD"), made init_vmcb set vmload/vmsave intercepts unconditionally, and relied on svm_vcpu_after_set_cpuid to clear them when possible.
However init_vmcb is also called when the vCPU is reset, and it is not followed by another call to svm_vcpu_after_set_cpuid because the CPUID is already set.
This mistake makes the VMSAVE/VMLOAD intercept to be set when it is not needed, and harms performance of the nested guest.
Fixes: adc2a23734ac ("KVM: nSVM: improve SYSENTER emulation on AMD")
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/svm.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 6645542df9bd..861ac9f74331 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1199,8 +1199,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu) svm_set_intercept(svm, INTERCEPT_SHUTDOWN); svm_set_intercept(svm, INTERCEPT_VMRUN); svm_set_intercept(svm, INTERCEPT_VMMCALL); - svm_set_intercept(svm, INTERCEPT_VMLOAD); - svm_set_intercept(svm, INTERCEPT_VMSAVE); svm_set_intercept(svm, INTERCEPT_STGI); svm_set_intercept(svm, INTERCEPT_CLGI); svm_set_intercept(svm, INTERCEPT_SKINIT); @@ -1377,6 +1375,10 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu) svm->guest_state_loaded = false;
svm_switch_vmcb(svm, &svm->vmcb01); + + svm_set_intercept(svm, INTERCEPT_VMLOAD); + svm_set_intercept(svm, INTERCEPT_VMSAVE); + init_vmcb(vcpu);
svm_vcpu_init_msrpm(vcpu, svm->msrpm);
On 14/09/21 17:48, Maxim Levitsky wrote:
commit adc2a23734ac ("KVM: nSVM: improve SYSENTER emulation on AMD"), made init_vmcb set vmload/vmsave intercepts unconditionally, and relied on svm_vcpu_after_set_cpuid to clear them when possible.
However init_vmcb is also called when the vCPU is reset, and it is not followed by another call to svm_vcpu_after_set_cpuid because the CPUID is already set.
This mistake makes the VMSAVE/VMLOAD intercept to be set when it is not needed, and harms performance of the nested guest.
Fixes: adc2a23734ac ("KVM: nSVM: improve SYSENTER emulation on AMD")
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/svm.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 6645542df9bd..861ac9f74331 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1199,8 +1199,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu) svm_set_intercept(svm, INTERCEPT_SHUTDOWN); svm_set_intercept(svm, INTERCEPT_VMRUN); svm_set_intercept(svm, INTERCEPT_VMMCALL);
- svm_set_intercept(svm, INTERCEPT_VMLOAD);
- svm_set_intercept(svm, INTERCEPT_VMSAVE); svm_set_intercept(svm, INTERCEPT_STGI); svm_set_intercept(svm, INTERCEPT_CLGI); svm_set_intercept(svm, INTERCEPT_SKINIT);
@@ -1377,6 +1375,10 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu) svm->guest_state_loaded = false; svm_switch_vmcb(svm, &svm->vmcb01);
- svm_set_intercept(svm, INTERCEPT_VMLOAD);
- svm_set_intercept(svm, INTERCEPT_VMSAVE);
- init_vmcb(vcpu);
svm_vcpu_init_msrpm(vcpu, svm->msrpm);
This needs to be redone after the latest refactoring of svm_vcpu_reset. I'll send a patch myself.
Paolo
Just in case, add a warning ensuring that on guest entry, either both VMLOAD and VMSAVE intercept is enabled or vVMLOAD/VMSAVE is enabled.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/svm.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 861ac9f74331..deeebd05f682 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3784,6 +3784,12 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
+ /* Check that CVE-2021-3656 can't happen again */ + if (!svm_is_intercept(svm, INTERCEPT_VMSAVE) || + !svm_is_intercept(svm, INTERCEPT_VMSAVE)) + WARN_ON(!(svm->vmcb->control.virt_ext & + VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); + sync_lapic_to_cr8(vcpu);
if (unlikely(svm->asid != svm->vmcb->control.asid)) {
On 14/09/21 17:48, Maxim Levitsky wrote:
Just in case, add a warning ensuring that on guest entry, either both VMLOAD and VMSAVE intercept is enabled or vVMLOAD/VMSAVE is enabled.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/svm.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 861ac9f74331..deeebd05f682 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3784,6 +3784,12 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
- /* Check that CVE-2021-3656 can't happen again */
- if (!svm_is_intercept(svm, INTERCEPT_VMSAVE) ||
!svm_is_intercept(svm, INTERCEPT_VMSAVE))
WARN_ON(!(svm->vmcb->control.virt_ext &
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK));
- sync_lapic_to_cr8(vcpu);
if (unlikely(svm->asid != svm->vmcb->control.asid)) {
While it's nice to be "proactive", this does adds some extra work. Maybe it should be under CONFIG_DEBUG_KERNEL. It could be useful to make it into its own function so we can add similar intercept invariants in the same place.
Paolo
On Thu, Sep 23, 2021, Paolo Bonzini wrote:
On 14/09/21 17:48, Maxim Levitsky wrote:
Just in case, add a warning ensuring that on guest entry, either both VMLOAD and VMSAVE intercept is enabled or vVMLOAD/VMSAVE is enabled.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/svm.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 861ac9f74331..deeebd05f682 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3784,6 +3784,12 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
- /* Check that CVE-2021-3656 can't happen again */
- if (!svm_is_intercept(svm, INTERCEPT_VMSAVE) ||
!svm_is_intercept(svm, INTERCEPT_VMSAVE))
WARN_ON(!(svm->vmcb->control.virt_ext &
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK));
- sync_lapic_to_cr8(vcpu); if (unlikely(svm->asid != svm->vmcb->control.asid)) {
While it's nice to be "proactive", this does adds some extra work. Maybe it should be under CONFIG_DEBUG_KERNEL. It could be useful to make it into its own function so we can add similar intercept invariants in the same place.
I don't know that DEBUG_KERNEL will guard much, DEBUG_KERNEL=y is very common, e.g. it's on by default in the x86 defconfigs. I too agree it's nice to be proactive, but this isn't that different than say failing to intercept CR3 loads when shadow paging is enabled.
If we go down the path of effectively auditing KVM invariants, I'd rather we commit fully and (a) add a dedicated Kconfig that is highly unlikely to be turned on by accident and (b) audit a large number of invariants.
On 9/14/2021 11:48 PM, Maxim Levitsky wrote:
Just in case, add a warning ensuring that on guest entry, either both VMLOAD and VMSAVE intercept is enabled or vVMLOAD/VMSAVE is enabled.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/svm.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 861ac9f74331..deeebd05f682 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3784,6 +3784,12 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
- /* Check that CVE-2021-3656 can't happen again */
- if (!svm_is_intercept(svm, INTERCEPT_VMSAVE) ||
!svm_is_intercept(svm, INTERCEPT_VMSAVE))
either one needs to be INTERCEPT_VMLOAD, right?
WARN_ON(!(svm->vmcb->control.virt_ext &
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK));
- sync_lapic_to_cr8(vcpu);
if (unlikely(svm->asid != svm->vmcb->control.asid)) {
On Tue, 2021-10-12 at 01:30 +0800, Xiaoyao Li wrote:
On 9/14/2021 11:48 PM, Maxim Levitsky wrote:
Just in case, add a warning ensuring that on guest entry, either both VMLOAD and VMSAVE intercept is enabled or vVMLOAD/VMSAVE is enabled.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com
arch/x86/kvm/svm/svm.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 861ac9f74331..deeebd05f682 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3784,6 +3784,12 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu) WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
- /* Check that CVE-2021-3656 can't happen again */
- if (!svm_is_intercept(svm, INTERCEPT_VMSAVE) ||
!svm_is_intercept(svm, INTERCEPT_VMSAVE))
either one needs to be INTERCEPT_VMLOAD, right?
Oops! Of course.
Best regards, Maxim Levitsky
WARN_ON(!(svm->vmcb->control.virt_ext &
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK));
- sync_lapic_to_cr8(vcpu);
if (unlikely(svm->asid != svm->vmcb->control.asid)) {
This is useful for debug and also makes it consistent with the rest of the SVM optional features.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/svm.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index deeebd05f682..b08c5d826208 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -186,6 +186,11 @@ module_param(vls, int, 0444); static int vgif = true; module_param(vgif, int, 0444);
+/* enable/disable LBR virtualization */ +static int lbrv = true; +module_param(lbrv, int, 0444); + + /* * enable / disable AVIC. Because the defaults differ for APICv * support between VMX and SVM we cannot use module_param_named. @@ -1059,6 +1064,13 @@ static __init int svm_hardware_setup(void) pr_info("Virtual GIF supported\n"); }
+ if (lbrv) { + if (!boot_cpu_has(X86_FEATURE_LBRV)) + lbrv = false; + else + pr_info("LBR virtualization supported\n"); + } + svm_set_cpu_caps();
/* @@ -2920,7 +2932,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) svm->tsc_aux = data; break; case MSR_IA32_DEBUGCTLMSR: - if (!boot_cpu_has(X86_FEATURE_LBRV)) { + if (!lbrv) { vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTL 0x%llx, nop\n", __func__, data); break;
When L2 is running without LBR virtualization, we should ensure that L1's LBR msrs continue to update as usual.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/nested.c | 11 +++++ arch/x86/kvm/svm/svm.c | 98 +++++++++++++++++++++++++++++++-------- arch/x86/kvm/svm/svm.h | 2 + 3 files changed, 92 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 4df59d9795b6..c6f09b591696 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -502,6 +502,9 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 svm->vcpu.arch.dr6 = vmcb12->save.dr6 | DR6_ACTIVE_LOW; vmcb_mark_dirty(svm->vmcb, VMCB_DR); } + + if (unlikely(svm->vmcb01.ptr->control.virt_ext & LBR_CTL_ENABLE_MASK)) + svm_copy_lbrs(svm->vmcb01.ptr, svm->vmcb); }
static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) @@ -550,6 +553,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) svm->vmcb->control.event_inj = svm->nested.ctl.event_inj; svm->vmcb->control.event_inj_err = svm->nested.ctl.event_inj_err;
+ svm->vmcb->control.virt_ext = svm->vmcb01.ptr->control.virt_ext & + LBR_CTL_ENABLE_MASK; + nested_svm_transition_tlb_flush(vcpu);
/* Enter Guest-Mode */ @@ -808,6 +814,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
svm_switch_vmcb(svm, &svm->vmcb01);
+ if (unlikely(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK)) { + svm_copy_lbrs(svm->nested.vmcb02.ptr, svm->vmcb); + svm_update_lbrv(vcpu); + } + /* * On vmexit the GIF is set to false and * no event can be injected in L1. diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index b08c5d826208..981cc9765b95 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -792,6 +792,17 @@ static void init_msrpm_offsets(void) } }
+void svm_copy_lbrs(struct vmcb *from_vmcb, struct vmcb *to_vmcb) +{ + to_vmcb->save.dbgctl = from_vmcb->save.dbgctl; + to_vmcb->save.br_from = from_vmcb->save.br_from; + to_vmcb->save.br_to = from_vmcb->save.br_to; + to_vmcb->save.last_excp_from = from_vmcb->save.last_excp_from; + to_vmcb->save.last_excp_to = from_vmcb->save.last_excp_to; + + vmcb_mark_dirty(to_vmcb, VMCB_LBR); +} + static void svm_enable_lbrv(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -801,6 +812,10 @@ static void svm_enable_lbrv(struct kvm_vcpu *vcpu) set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1); + + /* Move the LBR msrs to the vmcb02 so that the guest can see them. */ + if (is_guest_mode(vcpu)) + svm_copy_lbrs(svm->vmcb01.ptr, svm->vmcb); }
static void svm_disable_lbrv(struct kvm_vcpu *vcpu) @@ -812,6 +827,63 @@ static void svm_disable_lbrv(struct kvm_vcpu *vcpu) set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 0, 0); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 0, 0); + + /* + * Move the LBR msrs back to the vmcb01 to avoid copying them + * on nested guest entries. + */ + if (is_guest_mode(vcpu)) + svm_copy_lbrs(svm->vmcb, svm->vmcb01.ptr); +} + +static int svm_get_lbr_msr(struct vcpu_svm *svm, u32 index) +{ + /* + * If the LBR virtualization is disabled, the LBR msrs are always + * kept in the vmcb01 to avoid copying them on nested guest entries. + * + * If nested, and the LBR virtualization is enabled/disabled, the msrs + * are moved between the vmcb01 and vmcb02 as needed. + */ + struct vmcb *vmcb = + (svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) ? + svm->vmcb : svm->vmcb01.ptr; + + switch (index) { + case MSR_IA32_DEBUGCTLMSR: + return vmcb->save.dbgctl; + case MSR_IA32_LASTBRANCHFROMIP: + return vmcb->save.br_from; + case MSR_IA32_LASTBRANCHTOIP: + return vmcb->save.br_to; + case MSR_IA32_LASTINTFROMIP: + return vmcb->save.last_excp_from; + case MSR_IA32_LASTINTTOIP: + return vmcb->save.last_excp_to; + default: + KVM_BUG(false, svm->vcpu.kvm, + "%s: Unknown MSR 0x%x", __func__, index); + return 0; + } +} + +void svm_update_lbrv(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + bool enable_lbrv = svm_get_lbr_msr(svm, MSR_IA32_DEBUGCTLMSR) & + DEBUGCTLMSR_LBR; + + bool current_enable_lbrv = !!(svm->vmcb->control.virt_ext & + LBR_CTL_ENABLE_MASK); + + if (enable_lbrv == current_enable_lbrv) + return; + + if (enable_lbrv) + svm_enable_lbrv(vcpu); + else + svm_disable_lbrv(vcpu); }
void disable_nmi_singlestep(struct vcpu_svm *svm) @@ -2704,25 +2776,12 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_TSC_AUX: msr_info->data = svm->tsc_aux; break; - /* - * Nobody will change the following 5 values in the VMCB so we can - * safely return them on rdmsr. They will always be 0 until LBRV is - * implemented. - */ case MSR_IA32_DEBUGCTLMSR: - msr_info->data = svm->vmcb->save.dbgctl; - break; case MSR_IA32_LASTBRANCHFROMIP: - msr_info->data = svm->vmcb->save.br_from; - break; case MSR_IA32_LASTBRANCHTOIP: - msr_info->data = svm->vmcb->save.br_to; - break; case MSR_IA32_LASTINTFROMIP: - msr_info->data = svm->vmcb->save.last_excp_from; - break; case MSR_IA32_LASTINTTOIP: - msr_info->data = svm->vmcb->save.last_excp_to; + msr_info->data = svm_get_lbr_msr(svm, msr_info->index); break; case MSR_VM_HSAVE_PA: msr_info->data = svm->nested.hsave_msr; @@ -2940,12 +2999,13 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) if (data & DEBUGCTL_RESERVED_BITS) return 1;
- svm->vmcb->save.dbgctl = data; - vmcb_mark_dirty(svm->vmcb, VMCB_LBR); - if (data & (1ULL<<0)) - svm_enable_lbrv(vcpu); + if (svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK) + svm->vmcb->save.dbgctl = data; else - svm_disable_lbrv(vcpu); + svm->vmcb01.ptr->save.dbgctl = data; + + svm_update_lbrv(vcpu); + break; case MSR_VM_HSAVE_PA: /* diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 524d943f3efc..0c351e9c4d6d 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -417,6 +417,8 @@ u32 svm_msrpm_offset(u32 msr); u32 *svm_vcpu_alloc_msrpm(void); void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm); void svm_vcpu_free_msrpm(u32 *msrpm); +void svm_copy_lbrs(struct vmcb *from_vmcb, struct vmcb *to_vmcb); +void svm_update_lbrv(struct kvm_vcpu *vcpu);
int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer); void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
This was tested with kvm-unit-test that was developed for this purpose.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/nested.c | 21 +++++++++++++++++++-- arch/x86/kvm/svm/svm.c | 9 +++++++++ arch/x86/kvm/svm/svm.h | 1 + 3 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index c6f09b591696..aadbff9b6514 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -503,8 +503,19 @@ static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12 vmcb_mark_dirty(svm->vmcb, VMCB_DR); }
- if (unlikely(svm->vmcb01.ptr->control.virt_ext & LBR_CTL_ENABLE_MASK)) + if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { + svm_copy_lbrs(vmcb12, svm->vmcb); + /* + * TODO: to avoid TOC/TOU race, + * make sure that we only pick LBR enable bit from + * the guest. + */ + svm->vmcb->save.dbgctl &= LBR_CTL_ENABLE_MASK; + svm_update_lbrv(&svm->vcpu); + + } else if (unlikely(svm->vmcb01.ptr->control.virt_ext & LBR_CTL_ENABLE_MASK)) { svm_copy_lbrs(svm->vmcb01.ptr, svm->vmcb); + } }
static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) @@ -555,6 +566,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
svm->vmcb->control.virt_ext = svm->vmcb01.ptr->control.virt_ext & LBR_CTL_ENABLE_MASK; + if (svm->lbrv_enabled) + svm->vmcb->control.virt_ext |= + (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK);
nested_svm_transition_tlb_flush(vcpu);
@@ -814,7 +828,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
svm_switch_vmcb(svm, &svm->vmcb01);
- if (unlikely(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK)) { + if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) { + svm_copy_lbrs(svm->nested.vmcb02.ptr, vmcb12); + svm_update_lbrv(vcpu); + } else if (unlikely(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK)) { svm_copy_lbrs(svm->nested.vmcb02.ptr, svm->vmcb); svm_update_lbrv(vcpu); } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 981cc9765b95..66f99e8d804c 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -877,6 +877,10 @@ void svm_update_lbrv(struct kvm_vcpu *vcpu) bool current_enable_lbrv = !!(svm->vmcb->control.virt_ext & LBR_CTL_ENABLE_MASK);
+ if (unlikely(is_guest_mode(vcpu) && svm->lbrv_enabled)) + if (unlikely(svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK)) + enable_lbrv = true; + if (enable_lbrv == current_enable_lbrv) return;
@@ -1006,6 +1010,9 @@ static __init void svm_set_cpu_caps(void) if (npt_enabled) kvm_cpu_cap_set(X86_FEATURE_NPT);
+ if (lbrv) + kvm_cpu_cap_set(X86_FEATURE_LBRV); + /* Nested VM can receive #VMEXIT instead of triggering #GP */ kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); } @@ -4081,6 +4088,8 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) svm->nrips_enabled = kvm_cpu_cap_has(X86_FEATURE_NRIPS) && guest_cpuid_has(vcpu, X86_FEATURE_NRIPS);
+ svm->lbrv_enabled = lbrv && guest_cpuid_has(vcpu, X86_FEATURE_LBRV); + svm_recalc_instruction_intercepts(vcpu, svm);
/* For sev guests, the memory encryption bit is not reserved in CR3. */ diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 0c351e9c4d6d..c9a81e18707d 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -161,6 +161,7 @@ struct vcpu_svm {
/* cached guest cpuid flags for faster access */ bool nrips_enabled : 1; + bool lbrv_enabled : 1;
u32 ldr_reg; u32 dfr_reg;
This was tested by booting L1,L2,L3 (all Linux) and checking that no VMLOAD/VMSAVE vmexits happened.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/nested.c | 35 +++++++++++++++++++++++++++++------ arch/x86/kvm/svm/svm.c | 7 +++++++ arch/x86/kvm/svm/svm.h | 12 +++++++++--- 3 files changed, 45 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index aadbff9b6514..29b5d0f85960 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -119,6 +119,20 @@ static void nested_svm_uninit_mmu_context(struct kvm_vcpu *vcpu) vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; }
+static bool nested_vmcb_needs_vls_intercept(struct vcpu_svm *svm) +{ + if (!svm->v_vmload_vmsave_enabled) + return true; + + if (!nested_npt_enabled(svm)) + return true; + + if (!(svm->nested.ctl.virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)) + return true; + + return false; +} + void recalc_intercepts(struct vcpu_svm *svm) { struct vmcb_control_area *c, *h, *g; @@ -159,8 +173,17 @@ void recalc_intercepts(struct vcpu_svm *svm) if (!intercept_smi) vmcb_clr_intercept(c, INTERCEPT_SMI);
- vmcb_set_intercept(c, INTERCEPT_VMLOAD); - vmcb_set_intercept(c, INTERCEPT_VMSAVE); + if (nested_vmcb_needs_vls_intercept(svm)) { + /* + * If the virtual VMLOAD/VMSAVE is not enabled for the L2, + * we must intercept these instructions to correctly + * emulate them in case L1 doesn't intercept them. + */ + vmcb_set_intercept(c, INTERCEPT_VMLOAD); + vmcb_set_intercept(c, INTERCEPT_VMSAVE); + } else { + WARN_ON(!(c->virt_ext & VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK)); + } }
static void copy_vmcb_control_area(struct vmcb_control_area *dst, @@ -387,10 +410,7 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm, vmcb12->control.exit_int_info = exit_int_info; }
-static inline bool nested_npt_enabled(struct vcpu_svm *svm) -{ - return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; -} +
static void nested_svm_transition_tlb_flush(struct kvm_vcpu *vcpu) { @@ -570,6 +590,9 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) svm->vmcb->control.virt_ext |= (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK);
+ if (!nested_vmcb_needs_vls_intercept(svm)) + svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK; + nested_svm_transition_tlb_flush(vcpu);
/* Enter Guest-Mode */ diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 66f99e8d804c..6504e40e0985 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1013,6 +1013,9 @@ static __init void svm_set_cpu_caps(void) if (lbrv) kvm_cpu_cap_set(X86_FEATURE_LBRV);
+ if (vls) + kvm_cpu_cap_set(X86_FEATURE_V_VMSAVE_VMLOAD); + /* Nested VM can receive #VMEXIT instead of triggering #GP */ kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); } @@ -4090,6 +4093,8 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
svm->lbrv_enabled = lbrv && guest_cpuid_has(vcpu, X86_FEATURE_LBRV);
+ svm->v_vmload_vmsave_enabled = vls && guest_cpuid_has(vcpu, X86_FEATURE_V_VMSAVE_VMLOAD); + svm_recalc_instruction_intercepts(vcpu, svm);
/* For sev guests, the memory encryption bit is not reserved in CR3. */ @@ -4129,6 +4134,8 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 0, 0); set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 0, 0); + + svm->v_vmload_vmsave_enabled = false; } else { /* * If hardware supports Virtual VMLOAD VMSAVE then enable it diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index c9a81e18707d..029340a7fbcc 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -159,9 +159,10 @@ struct vcpu_svm { unsigned int3_injected; unsigned long int3_rip;
- /* cached guest cpuid flags for faster access */ - bool nrips_enabled : 1; - bool lbrv_enabled : 1; + /* optional nested SVM features that are enabled for this guest */ + bool nrips_enabled : 1; + bool lbrv_enabled : 1; + bool v_vmload_vmsave_enabled : 1;
u32 ldr_reg; u32 dfr_reg; @@ -409,6 +410,11 @@ static inline bool gif_set(struct vcpu_svm *svm) return !!(svm->vcpu.arch.hflags & HF_GIF_MASK); }
+static inline bool nested_npt_enabled(struct vcpu_svm *svm) +{ + return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE; +} + /* svm.c */ #define MSR_INVALID 0xffffffffU
This allows to easily simulate a CPU without this feature.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/svm.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 6504e40e0985..001a5af842ba 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -191,6 +191,9 @@ static int lbrv = true; module_param(lbrv, int, 0444);
+static int tsc_scaling = true; +module_param(tsc_scaling, int, 0444); + /* * enable / disable AVIC. Because the defaults differ for APICv * support between VMX and SVM we cannot use module_param_named. @@ -471,7 +474,7 @@ static int has_svm(void) static void svm_hardware_disable(void) { /* Make sure we clean up behind us */ - if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) + if (tsc_scaling) wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT);
cpu_svm_disable(); @@ -514,6 +517,10 @@ static int svm_hardware_enable(void) wrmsrl(MSR_VM_HSAVE_PA, __sme_page_pa(sd->save_area));
if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) { + /* + * Set the default value, even if we don't use TSC scaling + * to avoid having stale value in the msr + */ wrmsrl(MSR_AMD64_TSC_RATIO, TSC_RATIO_DEFAULT); __this_cpu_write(current_tsc_ratio, TSC_RATIO_DEFAULT); } @@ -1063,10 +1070,15 @@ static __init int svm_hardware_setup(void) if (boot_cpu_has(X86_FEATURE_FXSR_OPT)) kvm_enable_efer_bits(EFER_FFXSR);
- if (boot_cpu_has(X86_FEATURE_TSCRATEMSR)) { - kvm_has_tsc_control = true; - kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; - kvm_tsc_scaling_ratio_frac_bits = 32; + if (tsc_scaling) { + if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) { + tsc_scaling = false; + } else { + pr_info("TSC scaling supported\n"); + kvm_has_tsc_control = true; + kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX; + kvm_tsc_scaling_ratio_frac_bits = 32; + } }
tsc_aux_uret_slot = kvm_add_user_return_msr(MSR_TSC_AUX); @@ -1543,7 +1555,7 @@ static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu) vmsave(__sme_page_pa(sd->save_area)); }
- if (static_cpu_has(X86_FEATURE_TSCRATEMSR)) { + if (tsc_scaling) { u64 tsc_ratio = vcpu->arch.tsc_scaling_ratio; if (tsc_ratio != __this_cpu_read(current_tsc_ratio)) { __this_cpu_write(current_tsc_ratio, tsc_ratio);
This was tested by booting a nested guest with TSC=1Ghz, observing the clocks, and doing about 100 cycles of migration.
Note that qemu patch is needed to support migration because of a new MSR that needs to be placed in the migration state.
The patch will be sent to the qemu mailing list soon.
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/nested.c | 29 +++++++++++++++++++++++++++-- arch/x86/kvm/svm/svm.c | 30 ++++++++++++++++++++++++++++-- arch/x86/kvm/svm/svm.h | 5 +++++ arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 1 + 5 files changed, 62 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 29b5d0f85960..4c26417f36b8 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -572,8 +572,17 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (nested_npt_enabled(svm)) nested_svm_init_mmu_context(vcpu);
- svm->vmcb->control.tsc_offset = vcpu->arch.tsc_offset = - vcpu->arch.l1_tsc_offset + svm->nested.ctl.tsc_offset; + vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset( + vcpu->arch.l1_tsc_offset, + svm->nested.ctl.tsc_offset, + svm->tsc_ratio_msr); + + svm->vmcb->control.tsc_offset = vcpu->arch.tsc_offset; + + if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) { + WARN_ON(!svm->tsc_scaling_enabled); + nested_svm_update_tsc_ratio_msr(vcpu); + }
svm->vmcb->control.int_ctl = (svm->nested.ctl.int_ctl & int_ctl_vmcb12_bits) | @@ -872,6 +881,12 @@ int nested_svm_vmexit(struct vcpu_svm *svm) vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); }
+ if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) { + WARN_ON(!svm->tsc_scaling_enabled); + vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio; + svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio); + } + svm->nested.ctl.nested_cr3 = 0;
/* @@ -1259,6 +1274,16 @@ int nested_svm_exit_special(struct vcpu_svm *svm) return NESTED_EXIT_CONTINUE; }
+void nested_svm_update_tsc_ratio_msr(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + + vcpu->arch.tsc_scaling_ratio = + kvm_calc_nested_tsc_multiplier(vcpu->arch.l1_tsc_scaling_ratio, + svm->tsc_ratio_msr); + svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio); +} + static int svm_get_nested_state(struct kvm_vcpu *vcpu, struct kvm_nested_state __user *user_kvm_nested_state, u32 user_data_size) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 001a5af842ba..0b797351cfb9 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1023,6 +1023,9 @@ static __init void svm_set_cpu_caps(void) if (vls) kvm_cpu_cap_set(X86_FEATURE_V_VMSAVE_VMLOAD);
+ if (tsc_scaling) + kvm_cpu_cap_set(X86_FEATURE_TSCRATEMSR); + /* Nested VM can receive #VMEXIT instead of triggering #GP */ kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); } @@ -1215,7 +1218,9 @@ static u64 svm_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
static u64 svm_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu) { - return kvm_default_tsc_scaling_ratio; + struct vcpu_svm *svm = to_svm(vcpu); + + return svm->tsc_ratio_msr; }
static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) @@ -1227,7 +1232,7 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS); }
-static void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier) +void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier) { wrmsrl(MSR_AMD64_TSC_RATIO, multiplier); } @@ -1405,6 +1410,7 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
enable_gif(svm);
+ svm->tsc_ratio_msr = kvm_default_tsc_scaling_ratio; }
static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) @@ -2765,6 +2771,11 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) struct vcpu_svm *svm = to_svm(vcpu);
switch (msr_info->index) { + case MSR_AMD64_TSC_RATIO: + if (!msr_info->host_initiated && !svm->tsc_scaling_enabled) + return 1; + msr_info->data = svm->tsc_ratio_msr; + break; case MSR_STAR: msr_info->data = svm->vmcb01.ptr->save.star; break; @@ -2901,6 +2912,19 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) u32 ecx = msr->index; u64 data = msr->data; switch (ecx) { + case MSR_AMD64_TSC_RATIO: + if (!msr->host_initiated && !svm->tsc_scaling_enabled) + return 1; + + if (data & TSC_RATIO_RSVD) + return 1; + + svm->tsc_ratio_msr = data; + + if (svm->tsc_scaling_enabled && is_guest_mode(vcpu)) + nested_svm_update_tsc_ratio_msr(vcpu); + + break; case MSR_IA32_CR_PAT: if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data)) return 1; @@ -4107,6 +4131,8 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
svm->v_vmload_vmsave_enabled = vls && guest_cpuid_has(vcpu, X86_FEATURE_V_VMSAVE_VMLOAD);
+ svm->tsc_scaling_enabled = tsc_scaling && guest_cpuid_has(vcpu, X86_FEATURE_TSCRATEMSR); + svm_recalc_instruction_intercepts(vcpu, svm);
/* For sev guests, the memory encryption bit is not reserved in CR3. */ diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 029340a7fbcc..c8ea3b14da73 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -140,6 +140,8 @@ struct vcpu_svm { u64 next_rip;
u64 spec_ctrl; + + u64 tsc_ratio_msr; /* * Contains guest-controlled bits of VIRT_SPEC_CTRL, which will be * translated into the appropriate L2_CFG bits on the host to @@ -163,6 +165,7 @@ struct vcpu_svm { bool nrips_enabled : 1; bool lbrv_enabled : 1; bool v_vmload_vmsave_enabled : 1; + bool tsc_scaling_enabled : 1;
u32 ldr_reg; u32 dfr_reg; @@ -491,6 +494,8 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu); int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, bool has_error_code, u32 error_code); int nested_svm_exit_special(struct vcpu_svm *svm); +void nested_svm_update_tsc_ratio_msr(struct kvm_vcpu *vcpu); +void svm_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier); void nested_load_control_from_vmcb12(struct vcpu_svm *svm, struct vmcb_control_area *control); void nested_sync_control_from_vmcb02(struct vcpu_svm *svm); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index fada1055f325..e1d8f0df8172 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6404,6 +6404,7 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index) case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC: return nested; case MSR_AMD64_VIRT_SPEC_CTRL: + case MSR_AMD64_TSC_RATIO: /* This is AMD only. */ return false; default: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 86539c1686fa..1b7881c7a516 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1374,6 +1374,7 @@ static const u32 emulated_msrs_all[] = { MSR_PLATFORM_INFO, MSR_MISC_FEATURES_ENABLES, MSR_AMD64_VIRT_SPEC_CTRL, + MSR_AMD64_TSC_RATIO, MSR_IA32_POWER_CTL, MSR_IA32_UCODE_REV,
Allow L1 to use these settings if L0 disables PAUSE interception (AKA cpu_pm=on)
Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/nested.c | 6 ++++++ arch/x86/kvm/svm/svm.c | 18 ++++++++++++++++++ arch/x86/kvm/svm/svm.h | 2 ++ 3 files changed, 26 insertions(+)
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index 4c26417f36b8..b47c745ec659 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -602,6 +602,12 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm) if (!nested_vmcb_needs_vls_intercept(svm)) svm->vmcb->control.virt_ext |= VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
+ if (svm->pause_filter_enabled) + svm->vmcb->control.pause_filter_count = svm->nested.ctl.pause_filter_count; + + if (svm->pause_threshold_enabled) + svm->vmcb->control.pause_filter_thresh = svm->nested.ctl.pause_filter_thresh; + nested_svm_transition_tlb_flush(vcpu);
/* Enter Guest-Mode */ diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 0b797351cfb9..95125cd2e666 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1026,6 +1026,13 @@ static __init void svm_set_cpu_caps(void) if (tsc_scaling) kvm_cpu_cap_set(X86_FEATURE_TSCRATEMSR);
+ if (pause_filter_count) + kvm_cpu_cap_set(X86_FEATURE_PAUSEFILTER); + + if (pause_filter_thresh) + kvm_cpu_cap_set(X86_FEATURE_PFTHRESHOLD); + + /* Nested VM can receive #VMEXIT instead of triggering #GP */ kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK); } @@ -4133,6 +4140,17 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
svm->tsc_scaling_enabled = tsc_scaling && guest_cpuid_has(vcpu, X86_FEATURE_TSCRATEMSR);
+ if (kvm_pause_in_guest(vcpu->kvm)) { + svm->pause_filter_enabled = pause_filter_count > 0 && + guest_cpuid_has(vcpu, X86_FEATURE_PAUSEFILTER); + + svm->pause_threshold_enabled = pause_filter_thresh > 0 && + guest_cpuid_has(vcpu, X86_FEATURE_PFTHRESHOLD); + } else { + svm->pause_filter_enabled = false; + svm->pause_threshold_enabled = false; + } + svm_recalc_instruction_intercepts(vcpu, svm);
/* For sev guests, the memory encryption bit is not reserved in CR3. */ diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index c8ea3b14da73..7e679dd7a6e4 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -166,6 +166,8 @@ struct vcpu_svm { bool lbrv_enabled : 1; bool v_vmload_vmsave_enabled : 1; bool tsc_scaling_enabled : 1; + bool pause_filter_enabled : 1; + bool pause_threshold_enabled : 1;
u32 ldr_reg; u32 dfr_reg;
On 14/09/21 17:48, Maxim Levitsky wrote:
Those are few patches I was working on lately, all somewhat related to the two CVEs that I found recently.
First 7 patches fix various minor bugs that relate to these CVEs.
The rest of the patches implement various optional SVM features, some of which the guest could enable anyway due to incorrect checking of virt_ext field.
Last patch is somewhat an RFC, I would like to hear your opinion on that.
I also implemented nested TSC scaling while at it.
As for other optional SVM features here is my summary of few features I took a look at:
X86_FEATURE_DECODEASSISTS: this feature should make it easier for the L1 to emulate an instruction on MMIO access, by not needing to read the guest memory but rather using the instruction bytes that the CPU already fetched.
The challenge of implementing this is that we sometimes inject #PF and #NPT syntenically and in those cases we must be sure we set the correct instruction bytes. Also this feature adds assists for MOV CR/DR, INTn, and INVLPG, which aren't that interesting but must be supported as well to expose this feature to the nested guest.
X86_FEATURE_VGIF Might allow the L2 to run the L3 a bit faster, but due to crazy complex logic we already have around int_ctl and vgif probably not worth it.
X86_FEATURE_VMCBCLEAN Should just be enabled, because otherwise L1 doesn't even attempt to set the clean bits. But we need to know if we can take an advantage of these bits first.
X86_FEATURE_FLUSHBYASID X86_FEATURE_AVIC These two features would be very good to enable, but that would require lots of work, and will be done eventually.
There are few more nested SVM features that I didn't yet had a chance to take a look at.
Best regards, Maxim Levitsky
Maxim Levitsky (14): KVM: x86: nSVM: restore int_vector in svm_clear_vintr KVM: x86: selftests: test simultaneous uses of V_IRQ from L1 and L0 KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround KVM: x86: nSVM: don't copy pause related settings KVM: x86: nSVM: don't copy virt_ext from vmcb12 KVM: x86: SVM: don't set VMLOAD/VMSAVE intercepts on vCPU reset KVM: x86: SVM: add warning for CVE-2021-3656 KVM: x86: SVM: add module param to control LBR virtualization KVM: x86: nSVM: correctly virtualize LBR msrs when L2 is running KVM: x86: nSVM: implement nested LBR virtualization KVM: x86: nSVM: implement nested VMLOAD/VMSAVE KVM: x86: SVM: add module param to control TSC scaling KVM: x86: nSVM: implement nested TSC scaling KVM: x86: nSVM: support PAUSE filter threshold and count
arch/x86/kvm/svm/nested.c | 105 +++++++-- arch/x86/kvm/svm/svm.c | 218 +++++++++++++++--- arch/x86/kvm/svm/svm.h | 20 +- arch/x86/kvm/vmx/vmx.c | 1 + arch/x86/kvm/x86.c | 1 + tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/x86_64/svm_int_ctl_test.c | 128 ++++++++++ 8 files changed, 427 insertions(+), 48 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/svm_int_ctl_test.c
Queued more patches, with 9-10-11-14 left now.
Paolo
linux-kselftest-mirror@lists.linaro.org