Rework how KVM limits guest-unsupported xfeatures to effectively hide only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of what features have been exposed to the guest.
The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"):
As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2.
Peventing userspace from doing stupid things is usually a good idea, but in this case restricting KVM_SET_XSAVE actually exacerbated the problem that commit ad856280ddea was fixing. As reported by Tyler, rejecting KVM_SET_XSAVE for guest-unsupported xfeatures breaks live migration from a kernel without commit ad856280ddea, to a kernel with ad856280ddea. I.e. from a kernel that saves guest-unsupported xfeatures to a kernel that doesn't allow loading guest-unuspported xfeatures.
To make matters even worse, QEMU doesn't terminate if KVM_SET_XSAVE fails, and so the end result is that the live migration results (possibly silent) guest data corruption instead of a failed migration.
Patch 1 refactors the FPU code to let KVM pass in a mask of which xfeatures to save, patch 2 fixes KVM by passing in guest_supported_xcr0 instead of modifying user_xfeatures directly.
Patches 3-5 are regression tests.
I have no objection if anyone wants patches 1 and 2 squashed together, I split them purely to make review easier.
Note, this doesn't fix the scenario where a guest is migrated from a "bad" to a "good" kernel and the target host doesn't support the over-saved set of xfeatures. I don't see a way to safely handle that in the kernel without an opt-in, which more or less defeats the purpose of handling it in KVM.
Sean Christopherson (5): x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2} KVM: selftests: Touch relevant XSAVE state in guest for state test KVM: selftests: Load XSAVE state into untouched vCPU during state test KVM: selftests: Force load all supported XSAVE state in state test
arch/x86/include/asm/fpu/api.h | 3 +- arch/x86/kernel/fpu/core.c | 5 +- arch/x86/kernel/fpu/xstate.c | 12 +- arch/x86/kernel/fpu/xstate.h | 3 +- arch/x86/kvm/cpuid.c | 8 -- arch/x86/kvm/x86.c | 37 +++--- .../selftests/kvm/include/x86_64/processor.h | 23 ++++ .../testing/selftests/kvm/x86_64/state_test.c | 110 +++++++++++++++++- 8 files changed, 168 insertions(+), 33 deletions(-)
base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can constrain which xfeatures are saved into the userspace buffer without having to modify the user_xfeatures field in KVM's guest_fpu state.
KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to guest must not show up in the effective xstate_bv field of the buffer. Saving only the guest-supported xfeatures allows userspace to load the saved state on a different host with a fewer xfeatures, so long as the target host supports the xfeatures that are exposed to the guest.
KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to the set of guest-supported xfeatures, but doing so broke KVM's historical ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that are supported by the *host*.
Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/include/asm/fpu/api.h | 3 ++- arch/x86/kernel/fpu/core.c | 5 +++-- arch/x86/kernel/fpu/xstate.c | 7 +++++-- arch/x86/kernel/fpu/xstate.h | 3 ++- arch/x86/kvm/x86.c | 23 ++++++++++------------- 5 files changed, 22 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 31089b851c4f..a2be3aefff9f 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -157,7 +157,8 @@ static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { static inline void fpu_sync_guest_vmexit_xfd_state(void) { } #endif
-extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru); +extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, + unsigned int size, u64 xfeatures, u32 pkru); extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
static inline void fpstate_set_confidential(struct fpu_guest *gfpu) diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index a86d37052a64..a21a4d0ecc34 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -369,14 +369,15 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest) EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, - unsigned int size, u32 pkru) + unsigned int size, u64 xfeatures, u32 pkru) { struct fpstate *kstate = gfpu->fpstate; union fpregs_state *ustate = buf; struct membuf mb = { .p = buf, .left = size };
if (cpu_feature_enabled(X86_FEATURE_XSAVE)) { - __copy_xstate_to_uabi_buf(mb, kstate, pkru, XSTATE_COPY_XSAVE); + __copy_xstate_to_uabi_buf(mb, kstate, xfeatures, pkru, + XSTATE_COPY_XSAVE); } else { memcpy(&ustate->fxsave, &kstate->regs.fxsave, sizeof(ustate->fxsave)); diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index cadf68737e6b..76408313ed7f 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1049,6 +1049,7 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate, * __copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer * @to: membuf descriptor * @fpstate: The fpstate buffer from which to copy + * @xfeatures: The mask of xfeatures to save (XSAVE mode only) * @pkru_val: The PKRU value to store in the PKRU component * @copy_mode: The requested copy mode * @@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate, * It supports partial copy but @to.pos always starts from zero. */ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, - u32 pkru_val, enum xstate_copy_mode copy_mode) + u64 xfeatures, u32 pkru_val, + enum xstate_copy_mode copy_mode) { const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); struct xregs_state *xinit = &init_fpstate.regs.xsave; @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, break;
case XSTATE_COPY_XSAVE: - header.xfeatures &= fpstate->user_xfeatures; + header.xfeatures &= fpstate->user_xfeatures & xfeatures; break; }
@@ -1185,6 +1187,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk, enum xstate_copy_mode copy_mode) { __copy_xstate_to_uabi_buf(to, tsk->thread.fpu.fpstate, + tsk->thread.fpu.fpstate->user_xfeatures, tsk->thread.pkru, copy_mode); }
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h index a4ecb04d8d64..3518fb26d06b 100644 --- a/arch/x86/kernel/fpu/xstate.h +++ b/arch/x86/kernel/fpu/xstate.h @@ -43,7 +43,8 @@ enum xstate_copy_mode {
struct membuf; extern void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, - u32 pkru_val, enum xstate_copy_mode copy_mode); + u64 xfeatures, u32 pkru_val, + enum xstate_copy_mode copy_mode); extern void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk, enum xstate_copy_mode mode); extern int copy_uabi_from_kernel_to_xstate(struct fpstate *fpstate, const void *kbuf, u32 *pkru); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9f18b06bbda6..41d8e6c8570c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5382,17 +5382,6 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return 0; }
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, - struct kvm_xsave *guest_xsave) -{ - if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) - return; - - fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, - guest_xsave->region, - sizeof(guest_xsave->region), - vcpu->arch.pkru); -}
static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, u8 *state, unsigned int size) @@ -5400,8 +5389,16 @@ static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return;
- fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, - state, size, vcpu->arch.pkru); + fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size, + vcpu->arch.guest_fpu.fpstate->user_xfeatures, + vcpu->arch.pkru); +} + +static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) +{ + return kvm_vcpu_ioctl_x86_get_xsave2(vcpu, (void *)guest_xsave->region, + sizeof(guest_xsave->region)); }
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
Mask off xfeatures that aren't exposed to the guest only when saving guest state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly. Preserving the maximal set of xfeatures in user_xfeatures restores KVM's ABI for KVM_SET_XSAVE, which prior to commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") allowed userspace to load xfeatures that are supported by the host, irrespective of what xfeatures are exposed to the guest.
There is no known use case where userspace *intentionally* loads xfeatures that aren't exposed to the guest, but the bug fixed by commit ad856280ddea was specifically that KVM_GET_SAVE{2} would save xfeatures that weren't exposed to the guest, e.g. would lead to userspace unintentionally loading guest-unsupported xfeatures when live migrating a VM.
Restricting KVM_SET_XSAVE to guest-supported xfeatures is especially problematic for QEMU-based setups, as QEMU has a bug where instead of terminating the VM if KVM_SET_XSAVE fails, QEMU instead simply stops loading guest state, i.e. resumes the guest after live migration with incomplete guest state, and ultimately results in guest data corruption.
Note, letting userspace restore all host-supported xfeatures does not fix setups where a VM is migrated from a host *without* commit ad856280ddea, to a target with a subset of host-supported xfeatures. However there is no way to safely address that scenario, e.g. KVM could silently drop the unsupported features, but that would be a clear violation of KVM's ABI and so would require userspace to opt-in, at which point userspace could simply be updated to sanitize the to-be-loaded XSAVE state.
Reported-by: Tyler Stachecki stachecki.tyler@gmail.com Closes: https://lore.kernel.org/all/20230914010003.358162-1-tstachecki@bloomberg.net Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") Cc: stable@vger.kernel.org Cc: Leonardo Bras leobras@redhat.com Signed-off-by: Sean Christopherson seanjc@google.com --- arch/x86/kernel/fpu/xstate.c | 5 +---- arch/x86/kvm/cpuid.c | 8 -------- arch/x86/kvm/x86.c | 18 ++++++++++++++++-- 3 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 76408313ed7f..ef6906107c54 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1539,10 +1539,7 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize, fpregs_restore_userregs();
newfps->xfeatures = curfps->xfeatures | xfeatures; - - if (!guest_fpu) - newfps->user_xfeatures = curfps->user_xfeatures | xfeatures; - + newfps->user_xfeatures = curfps->user_xfeatures | xfeatures; newfps->xfd = curfps->xfd & ~xfeatures;
/* Do the final updates within the locked region */ diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0544e30b4946..773132c3bf5a 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -360,14 +360,6 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
- /* - * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if - * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't - * supported by the host. - */ - vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0 | - XFEATURE_MASK_FPSSE; - kvm_update_pv_runtime(vcpu);
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 41d8e6c8570c..1e645f5b1e2c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5386,12 +5386,26 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, u8 *state, unsigned int size) { + /* + * Only copy state for features that are enabled for the guest. The + * state itself isn't problematic, but setting bits in the header for + * features that are supported in *this* host but not exposed to the + * guest can result in KVM_SET_XSAVE failing when live migrating to a + * compatible host without the features that are NOT exposed to the + * guest. + * + * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if + * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't + * supported by the host. + */ + u64 supported_xcr0 = vcpu->arch.guest_supported_xcr0 | + XFEATURE_MASK_FPSSE; + if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return;
fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size, - vcpu->arch.guest_fpu.fpstate->user_xfeatures, - vcpu->arch.pkru); + supported_xcr0, vcpu->arch.pkru); }
static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
On 9/27/23 17:19, Sean Christopherson wrote:
Mask off xfeatures that aren't exposed to the guest only when saving guest state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly. Preserving the maximal set of xfeatures in user_xfeatures restores KVM's ABI for KVM_SET_XSAVE, which prior to commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") allowed userspace to load xfeatures that are supported by the host, irrespective of what xfeatures are exposed to the guest.
There is no known use case where userspace *intentionally* loads xfeatures that aren't exposed to the guest, but the bug fixed by commit ad856280ddea was specifically that KVM_GET_SAVE{2} would save xfeatures that weren't exposed to the guest, e.g. would lead to userspace unintentionally loading guest-unsupported xfeatures when live migrating a VM.
Restricting KVM_SET_XSAVE to guest-supported xfeatures is especially problematic for QEMU-based setups, as QEMU has a bug where instead of terminating the VM if KVM_SET_XSAVE fails, QEMU instead simply stops loading guest state, i.e. resumes the guest after live migration with incomplete guest state, and ultimately results in guest data corruption.
Note, letting userspace restore all host-supported xfeatures does not fix setups where a VM is migrated from a host *without* commit ad856280ddea, to a target with a subset of host-supported xfeatures. However there is no way to safely address that scenario, e.g. KVM could silently drop the unsupported features, but that would be a clear violation of KVM's ABI and so would require userspace to opt-in, at which point userspace could simply be updated to sanitize the to-be-loaded XSAVE state.
Acked-by: Dave Hansen dave.hansen@linux.intel.com
It's surprising (and nice) that this takes eliminates the !guest check in fpstate_realloc().
Modify support XSAVE state in the "state test's" guest code so that saving and loading state via KVM_{G,S}ET_XSAVE actually does something useful, i.e. so that xstate_bv in XSAVE state isn't empty.
Punt on BNDCSR for now, it's easier to just stuff that xfeature from the host side.
Signed-off-by: Sean Christopherson seanjc@google.com --- .../selftests/kvm/include/x86_64/processor.h | 14 ++++ .../testing/selftests/kvm/x86_64/state_test.c | 77 +++++++++++++++++++ 2 files changed, 91 insertions(+)
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index 4fd042112526..6f66861175ad 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -68,6 +68,12 @@ struct xstate { #define XFEATURE_MASK_OPMASK BIT_ULL(5) #define XFEATURE_MASK_ZMM_Hi256 BIT_ULL(6) #define XFEATURE_MASK_Hi16_ZMM BIT_ULL(7) +#define XFEATURE_MASK_PT BIT_ULL(8) +#define XFEATURE_MASK_PKRU BIT_ULL(9) +#define XFEATURE_MASK_PASID BIT_ULL(10) +#define XFEATURE_MASK_CET_USER BIT_ULL(11) +#define XFEATURE_MASK_CET_KERNEL BIT_ULL(12) +#define XFEATURE_MASK_LBR BIT_ULL(15) #define XFEATURE_MASK_XTILE_CFG BIT_ULL(17) #define XFEATURE_MASK_XTILE_DATA BIT_ULL(18)
@@ -147,6 +153,7 @@ struct kvm_x86_cpu_feature { #define X86_FEATURE_CLWB KVM_X86_CPU_FEATURE(0x7, 0, EBX, 24) #define X86_FEATURE_UMIP KVM_X86_CPU_FEATURE(0x7, 0, ECX, 2) #define X86_FEATURE_PKU KVM_X86_CPU_FEATURE(0x7, 0, ECX, 3) +#define X86_FEATURE_OSPKE KVM_X86_CPU_FEATURE(0x7, 0, ECX, 4) #define X86_FEATURE_LA57 KVM_X86_CPU_FEATURE(0x7, 0, ECX, 16) #define X86_FEATURE_RDPID KVM_X86_CPU_FEATURE(0x7, 0, ECX, 22) #define X86_FEATURE_SGX_LC KVM_X86_CPU_FEATURE(0x7, 0, ECX, 30) @@ -553,6 +560,13 @@ static inline void xsetbv(u32 index, u64 value) __asm__ __volatile__("xsetbv" :: "a" (eax), "d" (edx), "c" (index)); }
+static inline void wrpkru(u32 pkru) +{ + /* Note, ECX and EDX are architecturally required to be '0'. */ + asm volatile(".byte 0x0f,0x01,0xef\n\t" + : : "a" (pkru), "c"(0), "d"(0)); +} + static inline struct desc_ptr get_gdt(void) { struct desc_ptr gdt; diff --git a/tools/testing/selftests/kvm/x86_64/state_test.c b/tools/testing/selftests/kvm/x86_64/state_test.c index 4c4925a8ab45..df3e93df4343 100644 --- a/tools/testing/selftests/kvm/x86_64/state_test.c +++ b/tools/testing/selftests/kvm/x86_64/state_test.c @@ -139,6 +139,83 @@ static void vmx_l1_guest_code(struct vmx_pages *vmx_pages) static void __attribute__((__flatten__)) guest_code(void *arg) { GUEST_SYNC(1); + + if (this_cpu_has(X86_FEATURE_XSAVE)) { + uint64_t supported_xcr0 = this_cpu_supported_xcr0(); + uint8_t buffer[4096]; + + memset(buffer, 0xcc, sizeof(buffer)); + + set_cr4(get_cr4() | X86_CR4_OSXSAVE); + GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSXSAVE)); + + xsetbv(0, xgetbv(0) | supported_xcr0); + + /* + * Modify state for all supported xfeatures to take them out of + * their "init" state, i.e. to make them show up in XSTATE_BV. + * + * Note off-by-default features, e.g. AMX, are out of scope for + * this particular testcase as they have a different ABI. + */ + GUEST_ASSERT(supported_xcr0 & XFEATURE_MASK_FP); + asm volatile ("fincstp"); + + GUEST_ASSERT(supported_xcr0 & XFEATURE_MASK_SSE); + asm volatile ("vmovdqu %0, %%xmm0" :: "m" (buffer)); + + if (supported_xcr0 & XFEATURE_MASK_YMM) + asm volatile ("vmovdqu %0, %%ymm0" :: "m" (buffer)); + + if (supported_xcr0 & XFEATURE_MASK_AVX512) { + asm volatile ("kmovq %0, %%k1" :: "r" (-1ull)); + asm volatile ("vmovupd %0, %%zmm0" :: "m" (buffer)); + asm volatile ("vmovupd %0, %%zmm16" :: "m" (buffer)); + } + + if (this_cpu_has(X86_FEATURE_MPX)) { + uint64_t bounds[2] = { 10, 0xffffffffull }; + uint64_t output[2] = { }; + + GUEST_ASSERT(supported_xcr0 & XFEATURE_MASK_BNDREGS); + GUEST_ASSERT(supported_xcr0 & XFEATURE_MASK_BNDCSR); + + /* + * Don't bother trying to get BNDCSR into the INUSE + * state. MSR_IA32_BNDCFGS doesn't count as it isn't + * managed via XSAVE/XRSTOR, and BNDCFGU can only be + * modified by XRSTOR. Stuffing XSTATE_BV in the host + * is simpler than doing XRSTOR here in the guest. + * + * However, temporarily enable MPX in BNDCFGS so that + * BNDMOV actually loads BND1. If MPX isn't *fully* + * enabled, all MPX instructions are treated as NOPs. + * + * Hand encode "bndmov (%rax),%bnd1" as support for MPX + * mnemonics/registers has been removed from gcc and + * clang (and was never fully supported by clang). + */ + wrmsr(MSR_IA32_BNDCFGS, BIT_ULL(0)); + asm volatile (".byte 0x66,0x0f,0x1a,0x08" :: "a" (bounds)); + /* + * Hand encode "bndmov %bnd1, (%rax)" to sanity check + * that BND1 actually got loaded. + */ + asm volatile (".byte 0x66,0x0f,0x1b,0x08" :: "a" (output)); + wrmsr(MSR_IA32_BNDCFGS, 0); + + GUEST_ASSERT_EQ(bounds[0], output[0]); + GUEST_ASSERT_EQ(bounds[1], output[1]); + } + if (this_cpu_has(X86_FEATURE_PKU)) { + GUEST_ASSERT(supported_xcr0 & XFEATURE_MASK_PKRU); + set_cr4(get_cr4() | X86_CR4_PKE); + GUEST_ASSERT(this_cpu_has(X86_FEATURE_OSPKE)); + + wrpkru(-1u); + } + } + GUEST_SYNC(2);
if (arg) {
Expand x86's state test to load XSAVE state into a "dummy" vCPU prior to KVM_SET_CPUID2, and again with an empty guest CPUID model. Except for off-by-default features, i.e. AMX, KVM's ABI for KVM_SET_XSAVE is that userspace is allowed to load xfeatures so long as they are supported by the host. This is a regression test for a combination of KVM bugs where the state saved by KVM_GET_XSAVE{2} could not be loaded via KVM_SET_XSAVE if the saved xstate_bv would load guest-unsupported xfeatures.
Signed-off-by: Sean Christopherson seanjc@google.com --- .../testing/selftests/kvm/x86_64/state_test.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/kvm/x86_64/state_test.c b/tools/testing/selftests/kvm/x86_64/state_test.c index df3e93df4343..115b2cdf9279 100644 --- a/tools/testing/selftests/kvm/x86_64/state_test.c +++ b/tools/testing/selftests/kvm/x86_64/state_test.c @@ -231,9 +231,9 @@ static void __attribute__((__flatten__)) guest_code(void *arg) int main(int argc, char *argv[]) { vm_vaddr_t nested_gva = 0; - + struct kvm_cpuid2 empty_cpuid = {}; struct kvm_regs regs1, regs2; - struct kvm_vcpu *vcpu; + struct kvm_vcpu *vcpu, *vcpuN; struct kvm_vm *vm; struct kvm_x86_state *state; struct ucall uc; @@ -286,6 +286,21 @@ int main(int argc, char *argv[]) /* Restore state in a new VM. */ vcpu = vm_recreate_with_one_vcpu(vm); vcpu_load_state(vcpu, state); + + /* + * Restore XSAVE state in a dummy vCPU, first without doing + * KVM_SET_CPUID2, and then with an empty guest CPUID. Except + * for off-by-default xfeatures, e.g. AMX, KVM is supposed to + * allow KVM_SET_XSAVE regardless of guest CPUID. Manually + * load only XSAVE state, MSRs in particular have a much more + * convoluted ABI. + */ + vcpuN = __vm_vcpu_add(vm, vcpu->id + 1); + vcpu_xsave_set(vcpuN, state->xsave); + + vcpu_init_cpuid(vcpuN, &empty_cpuid); + vcpu_xsave_set(vcpuN, state->xsave); + kvm_x86_state_cleanup(state);
memset(®s2, 0, sizeof(regs2));
Extend x86's state to forcefully load *all* host-supported xfeatures by modifying xstate_bv in the saved state. Stuffing xstate_bv ensures that the selftest is verifying KVM's full ABI regardless of whether or not the guest code is successful in getting various xfeatures out of their INIT state, e.g. see the disaster that is/was MPX.
Signed-off-by: Sean Christopherson seanjc@google.com --- .../selftests/kvm/include/x86_64/processor.h | 9 +++++++++ tools/testing/selftests/kvm/x86_64/state_test.c | 14 ++++++++++++++ 2 files changed, 23 insertions(+)
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index 6f66861175ad..25bc61dac5fb 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -922,6 +922,15 @@ static inline bool kvm_pmu_has(struct kvm_x86_pmu_feature feature) !kvm_cpu_has(feature.anti_feature); }
+static __always_inline uint64_t kvm_cpu_supported_xcr0(void) +{ + if (!kvm_cpu_has_p(X86_PROPERTY_SUPPORTED_XCR0_LO)) + return 0; + + return kvm_cpu_property(X86_PROPERTY_SUPPORTED_XCR0_LO) | + ((uint64_t)kvm_cpu_property(X86_PROPERTY_SUPPORTED_XCR0_HI) << 32); +} + static inline size_t kvm_cpuid2_size(int nr_entries) { return sizeof(struct kvm_cpuid2) + diff --git a/tools/testing/selftests/kvm/x86_64/state_test.c b/tools/testing/selftests/kvm/x86_64/state_test.c index 115b2cdf9279..88b58aab7207 100644 --- a/tools/testing/selftests/kvm/x86_64/state_test.c +++ b/tools/testing/selftests/kvm/x86_64/state_test.c @@ -230,6 +230,7 @@ static void __attribute__((__flatten__)) guest_code(void *arg)
int main(int argc, char *argv[]) { + uint64_t *xstate_bv, saved_xstate_bv; vm_vaddr_t nested_gva = 0; struct kvm_cpuid2 empty_cpuid = {}; struct kvm_regs regs1, regs2; @@ -294,12 +295,25 @@ int main(int argc, char *argv[]) * allow KVM_SET_XSAVE regardless of guest CPUID. Manually * load only XSAVE state, MSRs in particular have a much more * convoluted ABI. + * + * Load two versions of XSAVE state: one with the actual guest + * XSAVE state, and one with all supported features forced "on" + * in xstate_bv, e.g. to ensure that KVM allows loading all + * supported features, even if something goes awry in saving + * the original snapshot. */ + xstate_bv = (void *)&((uint8_t *)state->xsave->region)[512]; + saved_xstate_bv = *xstate_bv; + vcpuN = __vm_vcpu_add(vm, vcpu->id + 1); vcpu_xsave_set(vcpuN, state->xsave); + *xstate_bv = kvm_cpu_supported_xcr0(); + vcpu_xsave_set(vcpuN, state->xsave);
vcpu_init_cpuid(vcpuN, &empty_cpuid); vcpu_xsave_set(vcpuN, state->xsave); + *xstate_bv = saved_xstate_bv; + vcpu_xsave_set(vcpuN, state->xsave);
kvm_x86_state_cleanup(state);
On Wed, Sep 27, 2023 at 05:19:51PM -0700, Sean Christopherson wrote:
Rework how KVM limits guest-unsupported xfeatures to effectively hide only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of what features have been exposed to the guest.
Ok, IIUC your changes provide: - KVM_GET_XSAVE will return only guest-supported xfeatures - KVM_SET_XSAVE will allow user to set any xfeatures supported by host Is that correct?
The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"):
As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2.
Peventing userspace from doing stupid things is usually a good idea, but in this case restricting KVM_SET_XSAVE actually exacerbated the problem that commit ad856280ddea was fixing. As reported by Tyler, rejecting KVM_SET_XSAVE for guest-unsupported xfeatures breaks live migration from a kernel without commit ad856280ddea, to a kernel with ad856280ddea. I.e. from a kernel that saves guest-unsupported xfeatures to a kernel that doesn't allow loading guest-unuspported xfeatures.
So this patch is supposed to fix migration of VM from a host with pre-ad856280ddea (OLD) kernel to a host with ad856280ddea + your set(NEW). Right?
Let's get the scenario here, where all machines are the same: 1 - VM created on OLD kernel with a host-supported xfeature F, which is not guest supported. 2 - VM is migrated to a NEW kernel/host, and KVM_SET_XSAVE xfeature F. 3 - VM will be migrated to another host, qemu requests KVM_GET_XSAVE, which returns only guest-supported xfeatures, and this is passed to next host 4 - VM will be started on 3rd host with guest-supported xfeatures, meaning xfeature F is filtered-out, which is not good, because the VM will have less features compared to boot.
In fact, I notice something would possibly happen between 2 and 3, since qemu will run KVM_GET_XSAVE at kvm_cpu_synchronize_state() and KVM_SET_XSAVE at kvm_cpu_exec(), which happens quite often (when vcpu stops / resumes for some reason).
Also, even if I got something wrong, and for some reason qemu will be able to store the original VM xfeatures between migrations, we have the original issue ad856280ddea was dealing with: newer machines -> older machines migration:
1 - User gets a VM from an OLD kernel, with a newer host (more xfeatures). 2 - User migrates VM to NEW kernel, and we suppose qemu stores original xfeatures (it works). Migration can occur to newer or same gen hosts. 3 - At some point, if migration is attempted to an older host (less xfeatures), qemu will abort the VM.
To make matters even worse, QEMU doesn't terminate if KVM_SET_XSAVE fails, and so the end result is that the live migration results (possibly silent) guest data corruption instead of a failed migration.
And this is something that really needs to be fixed in QEMU side.
Patch 1 refactors the FPU code to let KVM pass in a mask of which xfeatures to save, patch 2 fixes KVM by passing in guest_supported_xcr0 instead of modifying user_xfeatures directly.
At my current understanding of this patchset, I would not recomment merging it, as it would introduce a lot of undesired behaviors.
Please let me know if I got something wrong, so I can review it again.
Thanks! Leo
Patches 3-5 are regression tests.
I have no objection if anyone wants patches 1 and 2 squashed together, I split them purely to make review easier.
Note, this doesn't fix the scenario where a guest is migrated from a "bad" to a "good" kernel and the target host doesn't support the over-saved set of xfeatures. I don't see a way to safely handle that in the kernel without an opt-in, which more or less defeats the purpose of handling it in KVM.
Sean Christopherson (5): x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2} KVM: selftests: Touch relevant XSAVE state in guest for state test KVM: selftests: Load XSAVE state into untouched vCPU during state test KVM: selftests: Force load all supported XSAVE state in state test
arch/x86/include/asm/fpu/api.h | 3 +- arch/x86/kernel/fpu/core.c | 5 +- arch/x86/kernel/fpu/xstate.c | 12 +- arch/x86/kernel/fpu/xstate.h | 3 +- arch/x86/kvm/cpuid.c | 8 -- arch/x86/kvm/x86.c | 37 +++--- .../selftests/kvm/include/x86_64/processor.h | 23 ++++ .../testing/selftests/kvm/x86_64/state_test.c | 110 +++++++++++++++++- 8 files changed, 168 insertions(+), 33 deletions(-)
base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
2.42.0.582.g8ccd20d70d-goog
On Wed, Oct 04, 2023 at 04:11:52AM -0300, Leonardo Bras wrote:
So this patch is supposed to fix migration of VM from a host with pre-ad856280ddea (OLD) kernel to a host with ad856280ddea + your set(NEW). Right?
Let's get the scenario here, where all machines are the same: 1 - VM created on OLD kernel with a host-supported xfeature F, which is not guest supported. 2 - VM is migrated to a NEW kernel/host, and KVM_SET_XSAVE xfeature F. 3 - VM will be migrated to another host, qemu requests KVM_GET_XSAVE, which returns only guest-supported xfeatures, and this is passed to next host 4 - VM will be started on 3rd host with guest-supported xfeatures, meaning xfeature F is filtered-out, which is not good, because the VM will have less features compared to boot.
This is what I was (trying) to convey earlier...
See Sean's response here: https://lore.kernel.org/all/ZRMHY83W%2FVPjYyhy@google.com/
I'll copy the pertinent part of his very detailed response inline:
KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's ABI for a very long time, and userspace absolutely relies on that functionality to ensure that a VM can be migrated within a pool of heterogenous systems so long as the features that are *exposed* to the guest are supported on all platforms.
My 2 cents: as an outsider with less familiarity of the KVM code, it is hard to understand the contract here with the guest/userspace. It seems there is a fundamental question of whether or not "superfluous" features, those being host-supported features which extend that which the guest is actually capable of, can be removed between the time that the guest boots and when it terminates, through however many live-migrations that may be.
Ultimately, this problem is not really fixable if said features cannot be removed.
Is there an RFC or document which captures expectations of this form?
On Wed, Oct 04, 2023, Tyler Stachecki wrote:
On Wed, Oct 04, 2023 at 04:11:52AM -0300, Leonardo Bras wrote:
So this patch is supposed to fix migration of VM from a host with pre-ad856280ddea (OLD) kernel to a host with ad856280ddea + your set(NEW). Right?
Let's get the scenario here, where all machines are the same: 1 - VM created on OLD kernel with a host-supported xfeature F, which is not guest supported. 2 - VM is migrated to a NEW kernel/host, and KVM_SET_XSAVE xfeature F. 3 - VM will be migrated to another host, qemu requests KVM_GET_XSAVE, which returns only guest-supported xfeatures, and this is passed to next host 4 - VM will be started on 3rd host with guest-supported xfeatures, meaning xfeature F is filtered-out, which is not good, because the VM will have less features compared to boot.
No, the VM will not have less features, because KVM_SET_XSAVE loads *data*, not features. On a host that supports xfeature F, the VM is running with garbage data no matter what, which is perfectly fine because from the guest's perspective, that xfeature and its associated data do not exist.
And in all likelihood, unless QEMU is doing something bizarre, the data that is loaded via KVM_SET_XSAVE will be the exact same data that is already present in the guest FPU state, as both with be in the init state.
On top of that, the data that is loaded via KVM_SET_XSAVE may not actually be loaded into hardware, i.e. may never be exposed to the guest. E.g. IIRC, the original issues was with PKRU. If PKU is supported by the host, but not exposed to the guest, KVM will run the guest with the *host's* PKRU value.
This is what I was (trying) to convey earlier...
See Sean's response here: https://lore.kernel.org/all/ZRMHY83W%2FVPjYyhy@google.com/
I'll copy the pertinent part of his very detailed response inline:
KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's ABI for a very long time, and userspace absolutely relies on that functionality to ensure that a VM can be migrated within a pool of heterogenous systems so long as the features that are *exposed* to the guest are supported on all platforms.
My 2 cents: as an outsider with less familiarity of the KVM code, it is hard to understand the contract here with the guest/userspace. It seems there is a fundamental question of whether or not "superfluous" features, those being host-supported features which extend that which the guest is actually capable of, can be removed between the time that the guest boots and when it terminates, through however many live-migrations that may be.
KVM's ABI has no formal notion of guest boot=>shutdown or live migration. The myriad KVM_GET_* APIs allow taking a snapshot of guest state, and the KVM_SET_* APIs allow loading a snapshot of guest state. Live migration is probably the most common use of those APIs, but there are other use cases.
That matters because KVM's contract with userspace for KVM_SET_XSAVE (or any other state save/load ioctl()) doesn't have a holistic view of the guest, e.g. KVM can't know that userspace is live migrating a VM, and that userspace's attempt to load data for an unsupported xfeature is ok because the xfeature isn't exposed to the guest.
In other words, at the time of KVM_SET_XSAVE, KVM has no way of knowing that an xfeature is superfluous. Normally, that's a complete non-issue because there is no superfluous xfeature data, as KVM's contract for KVM_GET_SAVE{2} is that only necessary data is saved in the snapshot.
Unfortunately, the original bug that led to this mess broke the contract for KVM_GET_XSAVE{2}, and I don't see a safe way to workaround that bug in KVM without an opt-in from userspace.
Ultimately, this problem is not really fixable if said features cannot be removed.
It's not about removing features. The change you're asking for is to have KVM *silently* drop data. Aside from the fact that such a change would break KVM's ABI, silently ignoring data that userspace has explicitly requested be loaded for a vCPU is incredibly dangerous.
E.g. a not too far fetched scenario would be:
1. xfeature X is supported on Host A and exposed to a guest 2. Host B is upgraded to a new kernel that has a bug that causes the kernel to disable support for X, even though X is supported in hardware 3. The guest is live migrated from Host A to Host B
At step #3, what will currently happen is that KVM_SET_XSAVE will fail with -EINVAL because userspace is attempting to load data that Host B is incapable of loading.
The change you're suggesting would result in KVM dropping the data for X and letting KVM_SET_XSAVE succeed, *for an xfeature that is exposed to the guest*. I.e. for all intents and purposes, KVM would deliberately corrupt guest data.
Is there an RFC or document which captures expectations of this form?
Not AFAIK. :-/
On Wed, Oct 04, 2023 at 07:51:17AM -0700, Sean Christopherson wrote:
KVM's ABI has no formal notion of guest boot=>shutdown or live migration. The myriad KVM_GET_* APIs allow taking a snapshot of guest state, and the KVM_SET_* APIs allow loading a snapshot of guest state. Live migration is probably the most common use of those APIs, but there are other use cases.
I think the lightbulb just clicked, it is really this:
No, the VM will not have less features, because KVM_SET_XSAVE loads *data*, not features [...]
I think I'm conflating the data vs. features aspect here and will have to revisit my understanding of the code...
Ultimately, this problem is not really fixable if said features cannot be removed.
It's not about removing features. The change you're asking for is to have KVM *silently* drop data. Aside from the fact that such a change would break KVM's ABI, silently ignoring data that userspace has explicitly requested be loaded for a vCPU is incredibly dangerous.
Sorry if it came off that way - I fully understand and am resigned to the "you break it, you keep both halves" nature of what I had initially proposed and that it is not a generally tractable solution.
That being said, I genuinely appreciate your jump to action on this problem!
Thanks, Tyler
On Wed, Oct 04, 2023, Tyler Stachecki wrote:
On Wed, Oct 04, 2023 at 07:51:17AM -0700, Sean Christopherson wrote:
It's not about removing features. The change you're asking for is to have KVM *silently* drop data. Aside from the fact that such a change would break KVM's ABI, silently ignoring data that userspace has explicitly requested be loaded for a vCPU is incredibly dangerous.
Sorry if it came off that way
No need to apologise, you got bit by a nasty kernel bug and are trying to find a solution. There's nothing wrong with that.
I fully understand and am resigned to the "you break it, you keep both halves" nature of what I had initially proposed and that it is not a generally tractable solution.
Yeah, the crux of the matter is that we have no control or even knowledge of who all is using KVM, with what userspace VMM, on what hardware, etc. E.g. if this bug were affecting our fleet and for some reason we couldn't address the problem in userspace, carrying a hack in KVM in our internal kernel would probably be a viable option because we can do a proper risk assessment. E.g. we know and control exactly what userspace we're running, the underlying hardware in affected pools, what features are exposed to the guest, etc. And we could revert the hack once all affected VMs had been sanitized.
On Wed, 27 Sep 2023 17:19:51 -0700, Sean Christopherson wrote:
Rework how KVM limits guest-unsupported xfeatures to effectively hide only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of what features have been exposed to the guest.
The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"):
[...]
Applied to kvm-x86 fpu, even though there is still ongoing discussion. I want to get this exposure in -next sooner than later. I'll keep this in its own branch so it'll be easier to rewrite/discard if necessary.
[1/5] x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer https://github.com/kvm-x86/linux/commit/2d287ec65e79 [2/5] KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2} https://github.com/kvm-x86/linux/commit/27526efb5cff [3/5] KVM: selftests: Touch relevant XSAVE state in guest for state test https://github.com/kvm-x86/linux/commit/ff0654c71fb6 [4/5] KVM: selftests: Load XSAVE state into untouched vCPU during state test https://github.com/kvm-x86/linux/commit/d7b8762ec4a3 [5/5] KVM: selftests: Force load all supported XSAVE state in state test https://github.com/kvm-x86/linux/commit/afb2c7e27a7f
linux-kselftest-mirror@lists.linaro.org