Live-migrations under qemu result in guest corruption when the following three conditions are all met:
* The source host CPU has capabilities that itself extend that of the guest CPU fpstate->user_xfeatures
* The source kernel emits guest_fpu->user_xfeatures with respect to the host CPU (i.e. it *does not* have the "Fixes:" commit)
* The destination kernel enforces that the xfeatures in the buffer given to KVM_SET_IOCTL are compatible with the guest CPU (i.e., it *does* have the "Fixes:" commit)
When these conditions are met, the semantical changes to fpstate->user_features trigger a subtle bug in qemu that results in qemu failing to put the XSAVE architectural state into KVM.
qemu then both ceases to put the remaining (non-XSAVE) x86 architectural state into KVM and makes the fateful mistake of resuming the guest anyways. This usually results in immediate guest corruption, silent or not.
Due to the grave nature of this qemu bug, attempt to retain behavior of old kernels by clamping the xfeatures specified in the buffer given to KVM_SET_IOCTL such that it aligns with the guests fpstate->user_xfeatures instead of returning an error.
Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") Cc: stable@vger.kernel.org Cc: Leonardo Bras leobras@redhat.com Signed-off-by: Tyler Stachecki tstachecki@bloomberg.net --- arch/x86/kvm/x86.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..baad160b592f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5407,11 +5407,21 @@ static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, struct kvm_xsave *guest_xsave) { + union fpregs_state *ustate = (union fpregs_state *) guest_xsave->region; + u64 user_xfeatures = vcpu->arch.guest_fpu.fpstate->user_xfeatures; + if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return 0;
- return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, - guest_xsave->region, + /* + * In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate + * based on what the host CPU supported. Recent kernels changed this + * and only accept ustate containing xfeatures that the guest CPU is + * capable of supporting. + */ + ustate->xsave.header.xfeatures &= user_xfeatures; + + return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, ustate, kvm_caps.supported_xcr0, &vcpu->arch.pkru); }
On Wed, Sep 13, 2023 at 09:00:03PM -0400, Tyler Stachecki wrote:
qemu then both ceases to put the remaining (non-XSAVE) x86 architectural state into KVM and makes the fateful mistake of resuming the guest anyways. This usually results in immediate guest corruption, silent or not.
I just want to highlight that although this is probably more of a bug with respect to how qemu is handling things, the original patches from Leo are starting to appear in many distro stable kernels and are really putting a spanner in the works for maintaining VMs that are long-lived in nature.
At present, if you take the fix for PKRU migration issues (or if you are just in need a more recent kernel), you are dealt with a situation where live- migrating VMs to a kernel patched for the PKRU issue from one that is not potentially crashes or corrupts skads of VMs.
There is no fix for qemu that I am aware of yet. Although, I am willing to look into one if that is more palatable, I filed this patch on the premise of "don't break userspace"...
On Wed, Sep 13, 2023 at 09:00:03PM -0400, Tyler Stachecki wrote:
Live-migrations under qemu result in guest corruption when the following three conditions are all met:
The source host CPU has capabilities that itself extend that of the guest CPU fpstate->user_xfeatures
The source kernel emits guest_fpu->user_xfeatures with respect to the host CPU (i.e. it *does not* have the "Fixes:" commit)
The destination kernel enforces that the xfeatures in the buffer given to KVM_SET_IOCTL are compatible with the guest CPU (i.e., it *does* have the "Fixes:" commit)
When these conditions are met, the semantical changes to fpstate->user_features trigger a subtle bug in qemu that results in qemu failing to put the XSAVE architectural state into KVM.
qemu then both ceases to put the remaining (non-XSAVE) x86 architectural state into KVM and makes the fateful mistake of resuming the guest anyways. This usually results in immediate guest corruption, silent or not.
Due to the grave nature of this qemu bug, attempt to retain behavior of old kernels by clamping the xfeatures specified in the buffer given to KVM_SET_IOCTL such that it aligns with the guests fpstate->user_xfeatures instead of returning an error.
So, IIUC, the xfeatures from the source guest will be different than the xfeatures of the target (destination) guest. Is that correct?
It does not seem right to me. I mean, from the guest viewpoint, some features will simply vanish during execution, and this could lead to major issues in the guest.
The idea here is that if the target (destination) host can't provide those features for the guest, then migration should fail.
I mean, qemu should fail the migration, and that's correct behavior. Is it what is happening?
Regards, Leo
Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") Cc: stable@vger.kernel.org Cc: Leonardo Bras leobras@redhat.com Signed-off-by: Tyler Stachecki tstachecki@bloomberg.net
arch/x86/kvm/x86.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6c9c81e82e65..baad160b592f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5407,11 +5407,21 @@ static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, struct kvm_xsave *guest_xsave) {
- union fpregs_state *ustate = (union fpregs_state *) guest_xsave->region;
- u64 user_xfeatures = vcpu->arch.guest_fpu.fpstate->user_xfeatures;
- if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return 0;
- return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
guest_xsave->region,
- /*
* In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate
* based on what the host CPU supported. Recent kernels changed this
* and only accept ustate containing xfeatures that the guest CPU is
* capable of supporting.
*/
- ustate->xsave.header.xfeatures &= user_xfeatures;
- return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, ustate, kvm_caps.supported_xcr0, &vcpu->arch.pkru);
}
2.30.2
On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
So, IIUC, the xfeatures from the source guest will be different than the xfeatures of the target (destination) guest. Is that correct?
Correct.
It does not seem right to me. I mean, from the guest viewpoint, some features will simply vanish during execution, and this could lead to major issues in the guest.
My assumption is that the guest CPU model should confine access to registers that make sense for that (guest) CPU.
e.g., take a host CPU capable of AVX-512 running a guest CPU model that only has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should not really be perceivable as %ymm architecturally remains unchanged.
Though maybe I'm being too rash here? Is there a case where this assumption breaks down?
The idea here is that if the target (destination) host can't provide those features for the guest, then migration should fail.
I mean, qemu should fail the migration, and that's correct behavior. Is it what is happening?
Unfortunately, no, it is not... and that is biggest concern right now.
I do see some discussion between Peter and you on this topic and see that there was an RFC to implement such behavior stemming from it, here: https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/
... though I do not believe that work ever landed in the tree. Looking at qemu's master branch now, the error from kvm_arch_put_registers is just discarded in do_kvm_cpu_synchronize_post_init...
``` static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg) { kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE); cpu->vcpu_dirty = false; } ```
Best, Tyler
On 9/14/23 2:11 AM, Tyler Stachecki wrote:
On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
So, IIUC, the xfeatures from the source guest will be different than the xfeatures of the target (destination) guest. Is that correct?
Correct.
It does not seem right to me. I mean, from the guest viewpoint, some features will simply vanish during execution, and this could lead to major issues in the guest.
I fully agree with this.
I think the original commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") is for the source server, not destination server.
That is:
1. Without the commit (src and dst), something bad may happen.
2. With the commit on src, issue is fixed.
3. With the commit only dst, it is expected that issue is not fixed.
Therefore, from administrator's perspective, the bugfix should always be applied no the source server, in order to succeed the migration.
BTW, we may not be able to use commit ad856280ddea in the Fixes tag.
My assumption is that the guest CPU model should confine access to registers that make sense for that (guest) CPU.
e.g., take a host CPU capable of AVX-512 running a guest CPU model that only has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should not really be perceivable as %ymm architecturally remains unchanged.
Though maybe I'm being too rash here? Is there a case where this assumption breaks down?
The idea here is that if the target (destination) host can't provide those features for the guest, then migration should fail.
I mean, qemu should fail the migration, and that's correct behavior. Is it what is happening?
Unfortunately, no, it is not... and that is biggest concern right now.
I do see some discussion between Peter and you on this topic and see that there was an RFC to implement such behavior stemming from it, here: https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/
I agree that bug is at QEMU side, not KVM side.
It is better to improve at QEMU side.
4508 int kvm_arch_put_registers(CPUState *cpu, int level) 4509 { 4510 X86CPU *x86_cpu = X86_CPU(cpu); 4511 int ret; ... ... 4546 ret = kvm_put_xsave(x86_cpu); 4547 if (ret < 0) { 4548 return ret; 4549 } ... ...--> the rest of kvm_arch_put_registers() won't execute !!!
Thank you very much!
Dongli Zhang
... though I do not believe that work ever landed in the tree. Looking at qemu's master branch now, the error from kvm_arch_put_registers is just discarded in do_kvm_cpu_synchronize_post_init...
static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg) { kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE); cpu->vcpu_dirty = false; }
Best, Tyler
On Thu, Sep 14, 2023 at 10:05:57AM -0700, Dongli Zhang wrote:
That is:
Without the commit (src and dst), something bad may happen.
With the commit on src, issue is fixed.
With the commit only dst, it is expected that issue is not fixed.
Therefore, from administrator's perspective, the bugfix should always be applied no the source server, in order to succeed the migration.
I fully agree. Though, I think this boils down to: The commit must be on the source or something bad may happen.
It then follows that you cannot live-migrate guests off the source to patch it without potentially corrupting the guests currently running on that source...
Regards, Tyler
On Thu, Sep 14, 2023 at 08:58:42PM -0400, Tyler Stachecki wrote:
On Thu, Sep 14, 2023 at 10:05:57AM -0700, Dongli Zhang wrote:
That is:
Without the commit (src and dst), something bad may happen.
With the commit on src, issue is fixed.
With the commit only dst, it is expected that issue is not fixed.
Therefore, from administrator's perspective, the bugfix should always be applied no the source server, in order to succeed the migration.
I fully agree. Though, I think this boils down to: The commit must be on the source or something bad may happen.
It then follows that you cannot live-migrate guests off the source to patch it without potentially corrupting the guests currently running on that source...
Well, the bug was a real bad issue, and even the solution does not solve all problems.
As we discussed, there is no way of safely removing any feature from the guest without potential issues. One potential solution would be having hosts that implement the missing guest features needed for the VMs, but this may be far from easy depending on the missing feature.
Other than that, all I can think of is removing the features from guest:
As you commented, there may be some features that would not be a problem to be removed, and also there may be features which are not used by the workload, and could be removed. But this would depend on the feature, and the workload, beind a custom solution for every case.
For this (removing guest features), from kernel side, I would suggest using SystemTap (and eBPF, IIRC). The procedures should be something like: - Try to migrate VM from host with older kernel: fail - Look at qemu error, which features are missing? - Are those features safely removable from guest ? - If so, get an SystemTap / eBPF script masking out the undesired bits. - Try the migration again, it should succeed.
IIRC, this could also be done in qemu side, with a custom qemu: - Try to migrate VM from host with older kernel: fail - Look at qemu error, which features are missing? - Are those features safely removable from guest ? - If so, get a custom qemu which mask-out the desired flags before the VM starts - Live migrate (can be inside the source host) to the custom qemu - Live migrate from custom qemu to target host. - The custom qemu could be on a auxiliary host, and used only for this
Yes, it's hard, takes time, and may not solve every case, but it gets a higher chance of the VM surviving in the long run.
But keep in mind this is a hack. Taking features from a live guest is not supported in any way, and has a high chance of crashing the VM.
Best regards, Leo
Regards, Tyler
On Fri, Sep 15, 2023 at 04:41:20AM -0300, Leonardo Bras wrote:
Other than that, all I can think of is removing the features from guest:
As you commented, there may be some features that would not be a problem to be removed, and also there may be features which are not used by the workload, and could be removed. But this would depend on the feature, and the workload, beind a custom solution for every case.
Yes, the "fixup back" should be refined to pointed and verified cases.
For this (removing guest features), from kernel side, I would suggest using SystemTap (and eBPF, IIRC). The procedures should be something like:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ?
- If so, get an SystemTap / eBPF script masking out the undesired bits.
- Try the migration again, it should succeed.
IIRC, this could also be done in qemu side, with a custom qemu:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ?
- If so, get a custom qemu which mask-out the desired flags before the VM starts
- Live migrate (can be inside the source host) to the custom qemu
- Live migrate from custom qemu to target host.
- The custom qemu could be on a auxiliary host, and used only for this
Yes, it's hard, takes time, and may not solve every case, but it gets a higher chance of the VM surviving in the long run.
Thank you for taking the time to throughly consider the issue and suggest some ways out - I really appreciate it.
But keep in mind this is a hack. Taking features from a live guest is not supported in any way, and has a high chance of crashing the VM.
OK - if there's no interest in the below, I will not push for including this patch in the kernel tree any longer. I do think the specific case below is what a vast majority of KVM users will struggle with in the near future, though:
I have a test environment with Broadwell-based (have only AVX-256) guests running under Skylake (PKRU, AVX512, ...) hypervisors.
I added some pr_debug statements to a guest kernel running under a hypervisor, with said hypervisor containing neither your nor my patches, and printed the guests view of `fpu_kernel_cfg.max_features` at boot. It was 0x7, or: XFEATURE_MASK_FP, XFEATURE_MASK_SSE, XFEATURE_MASK_YMM
Thus, I'm pretty sure that all that's happening here is that the guest's FP context is having PKRU/ZMM. saved and restored needlessly by the hypervisor. Stripping it on a live-migration does not seem to have any ill-effects in all the testing I have done.
Cheers, Tyler
On Fri, Sep 15, 2023, Tyler Stachecki wrote:
On Fri, Sep 15, 2023 at 04:41:20AM -0300, Leonardo Bras wrote:
Other than that, all I can think of is removing the features from guest:
As you commented, there may be some features that would not be a problem to be removed, and also there may be features which are not used by the workload, and could be removed. But this would depend on the feature, and the workload, beind a custom solution for every case.
Yes, the "fixup back" should be refined to pointed and verified cases.
For this (removing guest features), from kernel side, I would suggest using SystemTap (and eBPF, IIRC). The procedures should be something like:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ?
- If so, get an SystemTap / eBPF script masking out the undesired bits.
- Try the migration again, it should succeed.
IIRC, this could also be done in qemu side, with a custom qemu:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ?
- If so, get a custom qemu which mask-out the desired flags before the VM starts
- Live migrate (can be inside the source host) to the custom qemu
- Live migrate from custom qemu to target host.
- The custom qemu could be on a auxiliary host, and used only for this
Yes, it's hard, takes time, and may not solve every case, but it gets a higher chance of the VM surviving in the long run.
Thank you for taking the time to throughly consider the issue and suggest some ways out - I really appreciate it.
But keep in mind this is a hack. Taking features from a live guest is not supported in any way, and has a high chance of crashing the VM.
OK - if there's no interest in the below, I will not push for including this patch in the kernel tree any longer. I do think the specific case below is what a vast majority of KVM users will struggle with in the near future, though:
I have a test environment with Broadwell-based (have only AVX-256) guests running under Skylake (PKRU, AVX512, ...) hypervisors.
I definitely don't want to take the proposed patch. As Leo pointed out, silently dropping features that userspace explicitly requests is a recipe for disaster.
However, I do agree with Tyler that is an egregious kernel/KVM bug, as essentially requiring KVM_SET_XSAVE to be a subset of guest supported XCR0, i.e. guest CPUID, is a clearcut breakage of userspace. KVM_SET_XSAVE worked on kernel X and failed on kernel X+1, there's really no wiggle room there.
Luckily, I'm pretty sure there's no need to take features away from the guest in order to fix the bug Tyler is experiencing. Prior to commit ad856280ddea, KVM's ABI was that KVM_SET_SAVE just needs a subset of the *host* features, i.e. this chunk from the changelog simply needs to be undone:
As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2.
That can be done by applying guest_supported_xcr0 to *only* the KVM_GET_XSAVE{2} path. It's not ideal since it means that KVM_GET_XSAVE{2} won't be consistent with the guest model if userspace does KVM_GET_XSAVE{2} before KVM_SET_CPUID, but practically speaking I don't think there's a real world userspace VMM that does that.
Compile tested only, and it needs a changelog, but I think this will do the trick:
--- arch/x86/include/asm/fpu/api.h | 3 ++- arch/x86/kernel/fpu/core.c | 5 +++-- arch/x86/kernel/fpu/xstate.c | 7 +++++-- arch/x86/kernel/fpu/xstate.h | 3 ++- arch/x86/kvm/cpuid.c | 8 -------- arch/x86/kvm/x86.c | 37 ++++++++++++++++++++++------------ 6 files changed, 36 insertions(+), 27 deletions(-)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 31089b851c4f..a2be3aefff9f 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -157,7 +157,8 @@ static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { static inline void fpu_sync_guest_vmexit_xfd_state(void) { } #endif
-extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru); +extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, + unsigned int size, u64 xfeatures, u32 pkru); extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
static inline void fpstate_set_confidential(struct fpu_guest *gfpu) diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index a86d37052a64..a21a4d0ecc34 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -369,14 +369,15 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest) EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, - unsigned int size, u32 pkru) + unsigned int size, u64 xfeatures, u32 pkru) { struct fpstate *kstate = gfpu->fpstate; union fpregs_state *ustate = buf; struct membuf mb = { .p = buf, .left = size };
if (cpu_feature_enabled(X86_FEATURE_XSAVE)) { - __copy_xstate_to_uabi_buf(mb, kstate, pkru, XSTATE_COPY_XSAVE); + __copy_xstate_to_uabi_buf(mb, kstate, xfeatures, pkru, + XSTATE_COPY_XSAVE); } else { memcpy(&ustate->fxsave, &kstate->regs.fxsave, sizeof(ustate->fxsave)); diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index cadf68737e6b..7d31033d176e 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1049,6 +1049,7 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate, * __copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer * @to: membuf descriptor * @fpstate: The fpstate buffer from which to copy + * @xfeatures: Constraint which of user xfeatures to save (XSAVE only) * @pkru_val: The PKRU value to store in the PKRU component * @copy_mode: The requested copy mode * @@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate, * It supports partial copy but @to.pos always starts from zero. */ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, - u32 pkru_val, enum xstate_copy_mode copy_mode) + u64 xfeatures, u32 pkru_val, + enum xstate_copy_mode copy_mode) { const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); struct xregs_state *xinit = &init_fpstate.regs.xsave; @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, break;
case XSTATE_COPY_XSAVE: - header.xfeatures &= fpstate->user_xfeatures; + header.xfeatures &= fpstate->user_xfeatures & xfeatures; break; }
@@ -1185,6 +1187,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk, enum xstate_copy_mode copy_mode) { __copy_xstate_to_uabi_buf(to, tsk->thread.fpu.fpstate, + tsk->thread.fpu.fpstate->user_xfeatures, tsk->thread.pkru, copy_mode); }
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h index a4ecb04d8d64..3518fb26d06b 100644 --- a/arch/x86/kernel/fpu/xstate.h +++ b/arch/x86/kernel/fpu/xstate.h @@ -43,7 +43,8 @@ enum xstate_copy_mode {
struct membuf; extern void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, - u32 pkru_val, enum xstate_copy_mode copy_mode); + u64 xfeatures, u32 pkru_val, + enum xstate_copy_mode copy_mode); extern void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk, enum xstate_copy_mode mode); extern int copy_uabi_from_kernel_to_xstate(struct fpstate *fpstate, const void *kbuf, u32 *pkru); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0544e30b4946..773132c3bf5a 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -360,14 +360,6 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
- /* - * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if - * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't - * supported by the host. - */ - vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0 | - XFEATURE_MASK_FPSSE; - kvm_update_pv_runtime(vcpu);
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9f18b06bbda6..734e2d69329b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5382,26 +5382,37 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return 0; }
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, - struct kvm_xsave *guest_xsave) -{ - if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) - return; - - fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, - guest_xsave->region, - sizeof(guest_xsave->region), - vcpu->arch.pkru); -}
static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, u8 *state, unsigned int size) { + /* + * Only copy state for features that are enabled for the guest. The + * state itself isn't problematic, but setting bits in the header for + * features that are supported in *this* host but not exposed to the + * guest can result in KVM_SET_XSAVE failing when live migrating to a + * compatible host, i.e. a host without the features that are NOT + * exposed to the guest. + * + * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if + * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't + * supported by the host. + */ + u64 supported_xcr0 = vcpu->arch.guest_supported_xcr0 | + XFEATURE_MASK_FPSSE; + if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return;
- fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, - state, size, vcpu->arch.pkru); + fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size, + supported_xcr0, vcpu->arch.pkru); +} + +static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) +{ + return kvm_vcpu_ioctl_x86_get_xsave2(vcpu, (void *)guest_xsave->region, + sizeof(guest_xsave->region)); }
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
On Mon, Sep 25, 2023 at 02:26:47PM -0700, Sean Christopherson wrote:
On Fri, Sep 15, 2023, Tyler Stachecki wrote:
On Fri, Sep 15, 2023 at 04:41:20AM -0300, Leonardo Bras wrote:
Other than that, all I can think of is removing the features from guest:
As you commented, there may be some features that would not be a problem to be removed, and also there may be features which are not used by the workload, and could be removed. But this would depend on the feature, and the workload, beind a custom solution for every case.
Yes, the "fixup back" should be refined to pointed and verified cases.
For this (removing guest features), from kernel side, I would suggest using SystemTap (and eBPF, IIRC). The procedures should be something like:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ?
- If so, get an SystemTap / eBPF script masking out the undesired bits.
- Try the migration again, it should succeed.
IIRC, this could also be done in qemu side, with a custom qemu:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ?
- If so, get a custom qemu which mask-out the desired flags before the VM starts
- Live migrate (can be inside the source host) to the custom qemu
- Live migrate from custom qemu to target host.
- The custom qemu could be on a auxiliary host, and used only for this
Yes, it's hard, takes time, and may not solve every case, but it gets a higher chance of the VM surviving in the long run.
Thank you for taking the time to throughly consider the issue and suggest some ways out - I really appreciate it.
But keep in mind this is a hack. Taking features from a live guest is not supported in any way, and has a high chance of crashing the VM.
OK - if there's no interest in the below, I will not push for including this patch in the kernel tree any longer. I do think the specific case below is what a vast majority of KVM users will struggle with in the near future, though:
I have a test environment with Broadwell-based (have only AVX-256) guests running under Skylake (PKRU, AVX512, ...) hypervisors.
I definitely don't want to take the proposed patch. As Leo pointed out, silently dropping features that userspace explicitly requests is a recipe for disaster.
However, I do agree with Tyler that is an egregious kernel/KVM bug, as essentially requiring KVM_SET_XSAVE to be a subset of guest supported XCR0, i.e. guest CPUID, is a clearcut breakage of userspace. KVM_SET_XSAVE worked on kernel X and failed on kernel X+1, there's really no wiggle room there.
Luckily, I'm pretty sure there's no need to take features away from the guest in order to fix the bug Tyler is experiencing. Prior to commit ad856280ddea, KVM's ABI was that KVM_SET_SAVE just needs a subset of the *host* features, i.e. this chunk from the changelog simply needs to be undone:
As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2.
That can be done by applying guest_supported_xcr0 to *only* the KVM_GET_XSAVE{2} path. It's not ideal since it means that KVM_GET_XSAVE{2} won't be consistent with the guest model if userspace does KVM_GET_XSAVE{2} before KVM_SET_CPUID, but practically speaking I don't think there's a real world userspace VMM that does that.
Compile tested only, and it needs a changelog, but I think this will do the trick:
arch/x86/include/asm/fpu/api.h | 3 ++- arch/x86/kernel/fpu/core.c | 5 +++-- arch/x86/kernel/fpu/xstate.c | 7 +++++-- arch/x86/kernel/fpu/xstate.h | 3 ++- arch/x86/kvm/cpuid.c | 8 -------- arch/x86/kvm/x86.c | 37 ++++++++++++++++++++++------------ 6 files changed, 36 insertions(+), 27 deletions(-)
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index 31089b851c4f..a2be3aefff9f 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -157,7 +157,8 @@ static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) { static inline void fpu_sync_guest_vmexit_xfd_state(void) { } #endif -extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru); +extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
unsigned int size, u64 xfeatures, u32 pkru);
extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru); static inline void fpstate_set_confidential(struct fpu_guest *gfpu) diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index a86d37052a64..a21a4d0ecc34 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -369,14 +369,15 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest) EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate); void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
unsigned int size, u32 pkru)
unsigned int size, u64 xfeatures, u32 pkru)
{ struct fpstate *kstate = gfpu->fpstate; union fpregs_state *ustate = buf; struct membuf mb = { .p = buf, .left = size }; if (cpu_feature_enabled(X86_FEATURE_XSAVE)) {
__copy_xstate_to_uabi_buf(mb, kstate, pkru, XSTATE_COPY_XSAVE);
__copy_xstate_to_uabi_buf(mb, kstate, xfeatures, pkru,
} else { memcpy(&ustate->fxsave, &kstate->regs.fxsave, sizeof(ustate->fxsave));XSTATE_COPY_XSAVE);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index cadf68737e6b..7d31033d176e 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1049,6 +1049,7 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
- __copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer
- @to: membuf descriptor
- @fpstate: The fpstate buffer from which to copy
- @xfeatures: Constraint which of user xfeatures to save (XSAVE only)
- @pkru_val: The PKRU value to store in the PKRU component
- @copy_mode: The requested copy mode
@@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
- It supports partial copy but @to.pos always starts from zero.
*/ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
u32 pkru_val, enum xstate_copy_mode copy_mode)
u64 xfeatures, u32 pkru_val,
enum xstate_copy_mode copy_mode)
{ const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); struct xregs_state *xinit = &init_fpstate.regs.xsave; @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, break; case XSTATE_COPY_XSAVE:
header.xfeatures &= fpstate->user_xfeatures;
break;header.xfeatures &= fpstate->user_xfeatures & xfeatures;
This changes the consideration of the xfeatures copied *into* the uabi buffer with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only).
IOW: we are still trimming guest xfeatures, just at the source...?
That being said: the patch that I gave siliently allows unacceptable things to be accepted at the destination, whereas this maintains status-quo and signals an error when the destination cannot wholly process the uabi buffer (i.e., asked to restore more state than the destination processor has present).
The downside of my approach is above -- the flip side, though is that this approach requires a patch to be applied on the source. However, we cannot apply a patch at the source until it is evacuated of VMs -- chicken and egg problem...
Unless I am grossly misunderstanding things here -- forgive me... :-)
It almost feels like userspace needs a flag to say: yes, old pre-Leo's-patched kernel was broken and sent more state than you might need or want -- eat what you can by default. If an additional flag is set, be conservative and ensure that you are capable of restoring the xfeatures specified in the uabi buffer.
Cheers, Tyler
} @@ -1185,6 +1187,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk, enum xstate_copy_mode copy_mode) { __copy_xstate_to_uabi_buf(to, tsk->thread.fpu.fpstate,
tsk->thread.fpu.fpstate->user_xfeatures, tsk->thread.pkru, copy_mode);
} diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h index a4ecb04d8d64..3518fb26d06b 100644 --- a/arch/x86/kernel/fpu/xstate.h +++ b/arch/x86/kernel/fpu/xstate.h @@ -43,7 +43,8 @@ enum xstate_copy_mode { struct membuf; extern void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
u32 pkru_val, enum xstate_copy_mode copy_mode);
u64 xfeatures, u32 pkru_val,
enum xstate_copy_mode copy_mode);
extern void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk, enum xstate_copy_mode mode); extern int copy_uabi_from_kernel_to_xstate(struct fpstate *fpstate, const void *kbuf, u32 *pkru); diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0544e30b4946..773132c3bf5a 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -360,14 +360,6 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) vcpu->arch.guest_supported_xcr0 = cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
- /*
* FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
* XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't
* supported by the host.
*/
- vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0 |
XFEATURE_MASK_FPSSE;
- kvm_update_pv_runtime(vcpu);
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9f18b06bbda6..734e2d69329b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5382,26 +5382,37 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return 0; } -static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
-{
- if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
return;
- fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
guest_xsave->region,
sizeof(guest_xsave->region),
vcpu->arch.pkru);
-} static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, u8 *state, unsigned int size) {
- /*
* Only copy state for features that are enabled for the guest. The
* state itself isn't problematic, but setting bits in the header for
* features that are supported in *this* host but not exposed to the
* guest can result in KVM_SET_XSAVE failing when live migrating to a
* compatible host, i.e. a host without the features that are NOT
* exposed to the guest.
*
* FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
* XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't
* supported by the host.
*/
- u64 supported_xcr0 = vcpu->arch.guest_supported_xcr0 |
XFEATURE_MASK_FPSSE;
- if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) return;
- fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
state, size, vcpu->arch.pkru);
- fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size,
supported_xcr0, vcpu->arch.pkru);
+}
+static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
struct kvm_xsave *guest_xsave)
+{
- return kvm_vcpu_ioctl_x86_get_xsave2(vcpu, (void *)guest_xsave->region,
sizeof(guest_xsave->region));
} static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
On Mon, Sep 25, 2023, Tyler Stachecki wrote:
On Mon, Sep 25, 2023 at 02:26:47PM -0700, Sean Christopherson wrote:
On Fri, Sep 15, 2023, Tyler Stachecki wrote:
OK - if there's no interest in the below, I will not push for including this patch in the kernel tree any longer. I do think the specific case below is what a vast majority of KVM users will struggle with in the near future, though:
I have a test environment with Broadwell-based (have only AVX-256) guests running under Skylake (PKRU, AVX512, ...) hypervisors.
I definitely don't want to take the proposed patch. As Leo pointed out, silently dropping features that userspace explicitly requests is a recipe for disaster.
However, I do agree with Tyler that is an egregious kernel/KVM bug, as essentially requiring KVM_SET_XSAVE to be a subset of guest supported XCR0, i.e. guest CPUID, is a clearcut breakage of userspace. KVM_SET_XSAVE worked on kernel X and failed on kernel X+1, there's really no wiggle room there.
Luckily, I'm pretty sure there's no need to take features away from the guest in order to fix the bug Tyler is experiencing. Prior to commit ad856280ddea, KVM's ABI was that KVM_SET_SAVE just needs a subset of the *host* features, i.e. this chunk from the changelog simply needs to be undone:
As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2.
That can be done by applying guest_supported_xcr0 to *only* the KVM_GET_XSAVE{2} path. It's not ideal since it means that KVM_GET_XSAVE{2} won't be consistent with the guest model if userspace does KVM_GET_XSAVE{2} before KVM_SET_CPUID, but practically speaking I don't think there's a real world userspace VMM that does that.
Compile tested only, and it needs a changelog, but I think this will do the trick:
...
@@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
- It supports partial copy but @to.pos always starts from zero.
*/ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
u32 pkru_val, enum xstate_copy_mode copy_mode)
u64 xfeatures, u32 pkru_val,
enum xstate_copy_mode copy_mode)
{ const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); struct xregs_state *xinit = &init_fpstate.regs.xsave; @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, break; case XSTATE_COPY_XSAVE:
header.xfeatures &= fpstate->user_xfeatures;
break;header.xfeatures &= fpstate->user_xfeatures & xfeatures;
This changes the consideration of the xfeatures copied *into* the uabi buffer with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only).
IOW: we are still trimming guest xfeatures, just at the source...?
KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's ABI for a very long time, and userspace absolutely relies on that functionality to ensure that a VM can be migrated within a pool of heterogenous systems so long as the features that are *exposed* to the guest are supported on all platforms.
Before the big FPU rework, KVM manually filled xregs_state and directly "trimmed" based on the guest supported XCR0 (see fill_xsave() below).
The major difference between my proposed patch and the current code is that KVM trims *only* at KVM_GET_XSAVE{2}, which in addition to being KVM's historical ABI, (see load_xsave() below), ensures that the *destination* can load state irrespective of the possibly-not-yet-defined guest CPUID model.
Masking fpstate->user_xfeatures is buggy for another reason: it's destructive if userspace calls KVM_SET_CPUID multiple times. No real world userspace actually calls KVM_SET_CPUID to "expand" features, but it's technically possible and KVM is supposed to allow it.
static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu) { struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave; u64 xstate_bv = xsave->header.xfeatures; u64 valid;
/* * Copy legacy XSAVE area, to avoid complications with CPUID * leaves 0 and 1 in the loop below. */ memcpy(dest, xsave, XSAVE_HDR_OFFSET);
/* Set XSTATE_BV */ xstate_bv &= vcpu->arch.guest_supported_xcr0 | XFEATURE_MASK_FPSSE; <= here lies the trimming *(u64 *)(dest + XSAVE_HDR_OFFSET) = xstate_bv;
/* * Copy each region from the possibly compacted offset to the * non-compacted offset. */ valid = xstate_bv & ~XFEATURE_MASK_FPSSE; while (valid) { u64 feature = valid & -valid; int index = fls64(feature) - 1; void *src = get_xsave_addr(xsave, feature);
if (src) { u32 size, offset, ecx, edx; cpuid_count(XSTATE_CPUID, index, &size, &offset, &ecx, &edx); if (feature == XFEATURE_MASK_PKRU) memcpy(dest + offset, &vcpu->arch.pkru, sizeof(vcpu->arch.pkru)); else memcpy(dest + offset, src, size);
}
valid -= feature; } }
static void load_xsave(struct kvm_vcpu *vcpu, u8 *src) { struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave; u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET); u64 valid;
/* * Copy legacy XSAVE area, to avoid complications with CPUID * leaves 0 and 1 in the loop below. */ memcpy(xsave, src, XSAVE_HDR_OFFSET);
/* Set XSTATE_BV and possibly XCOMP_BV. */ xsave->header.xfeatures = xstate_bv; <= features NOT limited to guest support if (boot_cpu_has(X86_FEATURE_XSAVES)) xsave->header.xcomp_bv = host_xcr0 | XSTATE_COMPACTION_ENABLED;
/* * Copy each region from the non-compacted offset to the * possibly compacted offset. */ valid = xstate_bv & ~XFEATURE_MASK_FPSSE; while (valid) { u64 feature = valid & -valid; int index = fls64(feature) - 1; void *dest = get_xsave_addr(xsave, feature);
if (dest) { u32 size, offset, ecx, edx; cpuid_count(XSTATE_CPUID, index, &size, &offset, &ecx, &edx); if (feature == XFEATURE_MASK_PKRU) memcpy(&vcpu->arch.pkru, src + offset, sizeof(vcpu->arch.pkru)); else memcpy(dest, src + offset, size); }
valid -= feature; } }
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, struct kvm_xsave *guest_xsave) { u64 xstate_bv = *(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)]; u32 mxcsr = *(u32 *)&guest_xsave->region[XSAVE_MXCSR_OFFSET / sizeof(u32)];
if (boot_cpu_has(X86_FEATURE_XSAVE)) { /* * Here we allow setting states that are not present in * CPUID leaf 0xD, index 0, EDX:EAX. This is for compatibility * with old userspace. */ if (xstate_bv & ~kvm_supported_xcr0() || <= loading more than KVM supports is disallowed mxcsr & ~mxcsr_feature_mask) return -EINVAL; load_xsave(vcpu, (u8 *)guest_xsave->region); } else { if (xstate_bv & ~XFEATURE_MASK_FPSSE || mxcsr & ~mxcsr_feature_mask) return -EINVAL; memcpy(&vcpu->arch.guest_fpu->state.fxsave, guest_xsave->region, sizeof(struct fxregs_state)); } return 0; }
That being said: the patch that I gave siliently allows unacceptable things to be accepted at the destination, whereas this maintains status-quo and signals an error when the destination cannot wholly process the uabi buffer (i.e., asked to restore more state than the destination processor has present).
The downside of my approach is above -- the flip side, though is that this approach requires a patch to be applied on the source. However, we cannot apply a patch at the source until it is evacuated of VMs -- chicken and egg problem...
Unless I am grossly misunderstanding things here -- forgive me... :-)
I suspect you're overlooking the impact on the destination. Trimming features only when saving XSAVE state into the userspace buffer means that the destination will play nice with the "bad" snapshot. It won't help setups where a VM is being migrated from a host with more XSAVE features to a host with fewer XSAVE features, but I don't see a sane way to retroactively fix that situation. And IIUC, that's not the situation you're in (unless the Skylake host vs. Broadwell guests is only the test environment).
Silently dropping features would break KVM's ABI (see kvm_vcpu_ioctl_x86_set_xsave() above). Giving userspace a flag to opt-in is pointless because that requires a userspace update, and if userspace needs to be modified then it's simpler to just have userspace sanitize the xfeatures field.
On Tue, Sep 26, 2023, Sean Christopherson wrote:
Masking fpstate->user_xfeatures is buggy for another reason: it's destructive if userspace calls KVM_SET_CPUID multiple times. No real world userspace actually calls KVM_SET_CPUID to "expand" features, but it's technically possible and KVM is supposed to allow it.
This particular bit is wrong, KVM overwrites user_xfeatures, it doesn't AND it. I misremembered the code.
On Tue, Sep 26, 2023, Sean Christopherson wrote:
On Mon, Sep 25, 2023, Tyler Stachecki wrote:
@@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
- It supports partial copy but @to.pos always starts from zero.
*/ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
u32 pkru_val, enum xstate_copy_mode copy_mode)
u64 xfeatures, u32 pkru_val,
enum xstate_copy_mode copy_mode)
{ const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); struct xregs_state *xinit = &init_fpstate.regs.xsave; @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, break; case XSTATE_COPY_XSAVE:
header.xfeatures &= fpstate->user_xfeatures;
break;header.xfeatures &= fpstate->user_xfeatures & xfeatures;
This changes the consideration of the xfeatures copied *into* the uabi buffer with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only).
IOW: we are still trimming guest xfeatures, just at the source...?
KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's ABI for a very long time, and userspace absolutely relies on that functionality to ensure that a VM can be migrated within a pool of heterogenous systems so long as the features that are *exposed* to the guest are supported on all platforms.
Before the big FPU rework, KVM manually filled xregs_state and directly "trimmed" based on the guest supported XCR0 (see fill_xsave() below).
The major difference between my proposed patch and the current code is that KVM trims *only* at KVM_GET_XSAVE{2}, which in addition to being KVM's historical ABI, (see load_xsave() below), ensures that the *destination* can load state irrespective of the possibly-not-yet-defined guest CPUID model.
...
That being said: the patch that I gave siliently allows unacceptable things to be accepted at the destination, whereas this maintains status-quo and signals an error when the destination cannot wholly process the uabi buffer (i.e., asked to restore more state than the destination processor has present).
The downside of my approach is above -- the flip side, though is that this approach requires a patch to be applied on the source. However, we cannot apply a patch at the source until it is evacuated of VMs -- chicken and egg problem...
Unless I am grossly misunderstanding things here -- forgive me... :-)
I suspect you're overlooking the impact on the destination. Trimming features only when saving XSAVE state into the userspace buffer means that the destination will play nice with the "bad" snapshot. It won't help setups where a VM is being migrated from a host with more XSAVE features to a host with fewer XSAVE features, but I don't see a sane way to retroactively fix that situation. And IIUC, that's not the situation you're in (unless the Skylake host vs. Broadwell guests is only the test environment).
One clarification: this comment from Tyler's proposed patch is somewhat misleading.
I do think KVM should filter state only at KVM_GET_XSAVE, because otherwise any off-by-default feature will have different behavior than on-by-default features, e.g. restoring AMX state requires KVM_SET_CPUID, whereas every other features depends only on host capabilities.
+ /* + * In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate + * based on what the host CPU supported. Recent kernels changed this + * and only accept ustate containing xfeatures that the guest CPU is + * capable of supporting. + */ + ustate->xsave.header.xfeatures &= user_xfeatures;
It's only the "realloc" path used when granting access to an off-by-default feature, i.e. AMX, that skips initialize user_xfeatures. __fpstate_reset() does initialize user_xfeatures to the default set for guest state, i.e. the problem is not in kvm_arch_vcpu_create().
The problem is specifically that KVM rejects KVM_SET_XSAVE if userspace limits guest supported xfeatures via KVM_SET_CPUID. In a perfect world, that would be desirable, e.g. KVM's ABI when loading MSRs from userspace is to (mostly) honor guest CPUID, but again this is a clear breakage of userspace. I.e. QEMU does KVM_SET_CPUID2, and *then* KVM_SET_XSAVE, which fails. If QEMU reversed those, KVM_SET_XSAVE would actually succeed.
There's another related oddity that will be fixed by my approach, assuming the realloc change is also reverted (I missed that in my pasted patch). Userspace *must* do KVM_SET_CPUID{2} in order to load off-by-default state, whereas there is no such requirement for on-by-default state.
On Tue, Sep 26, 2023, Sean Christopherson wrote:
There's another related oddity that will be fixed by my approach, assuming the realloc change is also reverted (I missed that in my pasted patch). Userspace *must* do KVM_SET_CPUID{2} in order to load off-by-default state, whereas there is no such requirement for on-by-default state.
Scratch that, KVM explicitly requires KVM_SET_CPUID2 to grant the guest access to off-by-default features, e.g. so that the kernel/KVM doesn't need to context AMX state if it's not exposed to the guest. Thankfully, that has always been true for XFD-based features, i.e. AMX, so it's safe to keep that behavior even though it diverges from on-by-default features.
On Thu, Sep 14, 2023 at 10:05:57AM -0700, Dongli Zhang wrote:
On 9/14/23 2:11 AM, Tyler Stachecki wrote:
On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
So, IIUC, the xfeatures from the source guest will be different than the xfeatures of the target (destination) guest. Is that correct?
Correct.
It does not seem right to me. I mean, from the guest viewpoint, some features will simply vanish during execution, and this could lead to major issues in the guest.
I fully agree with this.
I think the original commit ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0") is for the source server, not destination server.
That is:
Without the commit (src and dst), something bad may happen.
With the commit on src, issue is fixed.
With the commit only dst, it is expected that issue is not fixed.
Therefore, from administrator's perspective, the bugfix should always be applied no the source server, in order to succeed the migration.
BTW, we may not be able to use commit ad856280ddea in the Fixes tag.
My assumption is that the guest CPU model should confine access to registers that make sense for that (guest) CPU.
e.g., take a host CPU capable of AVX-512 running a guest CPU model that only has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should not really be perceivable as %ymm architecturally remains unchanged.
Though maybe I'm being too rash here? Is there a case where this assumption breaks down?
The idea here is that if the target (destination) host can't provide those features for the guest, then migration should fail.
I mean, qemu should fail the migration, and that's correct behavior. Is it what is happening?
Unfortunately, no, it is not... and that is biggest concern right now.
I do see some discussion between Peter and you on this topic and see that there was an RFC to implement such behavior stemming from it, here: https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/
I agree that bug is at QEMU side, not KVM side.
It is better to improve at QEMU side.
I agree fixing QEMU is the best solution we have.
4508 int kvm_arch_put_registers(CPUState *cpu, int level) 4509 { 4510 X86CPU *x86_cpu = X86_CPU(cpu); 4511 int ret; ... ... 4546 ret = kvm_put_xsave(x86_cpu); 4547 if (ret < 0) { 4548 return ret; 4549 } ... ...--> the rest of kvm_arch_put_registers() won't execute !!!
Thank you very much!
Dongli Zhang
... though I do not believe that work ever landed in the tree. Looking at qemu's master branch now, the error from kvm_arch_put_registers is just discarded in do_kvm_cpu_synchronize_post_init...
static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg) { kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE); cpu->vcpu_dirty = false; }
Best, Tyler
On Thu, Sep 14, 2023 at 05:11:50AM -0400, Tyler Stachecki wrote:
On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
So, IIUC, the xfeatures from the source guest will be different than the xfeatures of the target (destination) guest. Is that correct?
Correct.
It does not seem right to me. I mean, from the guest viewpoint, some features will simply vanish during execution, and this could lead to major issues in the guest.
My assumption is that the guest CPU model should confine access to registers that make sense for that (guest) CPU.
e.g., take a host CPU capable of AVX-512 running a guest CPU model that only has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should not really be perceivable as %ymm architecturally remains unchanged.
Though maybe I'm being too rash here? Is there a case where this assumption breaks down?
There is no guarantee that it would be ok to simple remove a feature from the guest. Maybe it's fine for 99% of the cases for given feature, but it could always go wrong.
The idea here is that if the target (destination) host can't provide those features for the guest, then migration should fail.
I mean, qemu should fail the migration, and that's correct behavior. Is it what is happening?
Unfortunately, no, it is not... and that is biggest concern right now.
I do see some discussion between Peter and you on this topic and see that there was an RFC to implement such behavior stemming from it, here: https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/
... though I do not believe that work ever landed in the tree. Looking at qemu's master branch now, the error from kvm_arch_put_registers is just discarded in do_kvm_cpu_synchronize_post_init...
This is wrong, then. QEMU should abort the migration in this case, so the VM is not lost.
Of course, with this issue fixed, there is another issue to deal with: - VMs running on hosts with older kernel get stuck in hosts without the fixes.
Thanks, Leo
static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg) { kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE); cpu->vcpu_dirty = false; }
Best, Tyler
linux-stable-mirror@lists.linaro.org