On Tue, Sep 26, 2023, Sean Christopherson wrote:
On Mon, Sep 25, 2023, Tyler Stachecki wrote:
@@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
- It supports partial copy but @to.pos always starts from zero.
*/ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
u32 pkru_val, enum xstate_copy_mode copy_mode)
u64 xfeatures, u32 pkru_val,
enum xstate_copy_mode copy_mode)
{ const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); struct xregs_state *xinit = &init_fpstate.regs.xsave; @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, break; case XSTATE_COPY_XSAVE:
header.xfeatures &= fpstate->user_xfeatures;
break;header.xfeatures &= fpstate->user_xfeatures & xfeatures;
This changes the consideration of the xfeatures copied *into* the uabi buffer with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only).
IOW: we are still trimming guest xfeatures, just at the source...?
KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's ABI for a very long time, and userspace absolutely relies on that functionality to ensure that a VM can be migrated within a pool of heterogenous systems so long as the features that are *exposed* to the guest are supported on all platforms.
Before the big FPU rework, KVM manually filled xregs_state and directly "trimmed" based on the guest supported XCR0 (see fill_xsave() below).
The major difference between my proposed patch and the current code is that KVM trims *only* at KVM_GET_XSAVE{2}, which in addition to being KVM's historical ABI, (see load_xsave() below), ensures that the *destination* can load state irrespective of the possibly-not-yet-defined guest CPUID model.
...
That being said: the patch that I gave siliently allows unacceptable things to be accepted at the destination, whereas this maintains status-quo and signals an error when the destination cannot wholly process the uabi buffer (i.e., asked to restore more state than the destination processor has present).
The downside of my approach is above -- the flip side, though is that this approach requires a patch to be applied on the source. However, we cannot apply a patch at the source until it is evacuated of VMs -- chicken and egg problem...
Unless I am grossly misunderstanding things here -- forgive me... :-)
I suspect you're overlooking the impact on the destination. Trimming features only when saving XSAVE state into the userspace buffer means that the destination will play nice with the "bad" snapshot. It won't help setups where a VM is being migrated from a host with more XSAVE features to a host with fewer XSAVE features, but I don't see a sane way to retroactively fix that situation. And IIUC, that's not the situation you're in (unless the Skylake host vs. Broadwell guests is only the test environment).
One clarification: this comment from Tyler's proposed patch is somewhat misleading.
I do think KVM should filter state only at KVM_GET_XSAVE, because otherwise any off-by-default feature will have different behavior than on-by-default features, e.g. restoring AMX state requires KVM_SET_CPUID, whereas every other features depends only on host capabilities.
+ /* + * In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate + * based on what the host CPU supported. Recent kernels changed this + * and only accept ustate containing xfeatures that the guest CPU is + * capable of supporting. + */ + ustate->xsave.header.xfeatures &= user_xfeatures;
It's only the "realloc" path used when granting access to an off-by-default feature, i.e. AMX, that skips initialize user_xfeatures. __fpstate_reset() does initialize user_xfeatures to the default set for guest state, i.e. the problem is not in kvm_arch_vcpu_create().
The problem is specifically that KVM rejects KVM_SET_XSAVE if userspace limits guest supported xfeatures via KVM_SET_CPUID. In a perfect world, that would be desirable, e.g. KVM's ABI when loading MSRs from userspace is to (mostly) honor guest CPUID, but again this is a clear breakage of userspace. I.e. QEMU does KVM_SET_CPUID2, and *then* KVM_SET_XSAVE, which fails. If QEMU reversed those, KVM_SET_XSAVE would actually succeed.
There's another related oddity that will be fixed by my approach, assuming the realloc change is also reverted (I missed that in my pasted patch). Userspace *must* do KVM_SET_CPUID{2} in order to load off-by-default state, whereas there is no such requirement for on-by-default state.