On Thu, 07 Dec 2023 12:30:45 +0000, Mark Brown broonie@kernel.org wrote:
On Thu, Dec 07, 2023 at 08:39:46AM +0000, Marc Zyngier wrote:
Mark Brown broonie@kernel.org wrote:
#define HCRX_GUEST_FLAGS \
- (HCRX_EL2_SMPME | HCRX_EL2_TCR2En | \
- (HCRX_EL2_SMPME | HCRX_EL2_TCR2En | HCRX_EL2_EnFPM | \
We really should start making all of these things conditional. See below.
Is there an overarching theory behind how these things are intended to work? I agree with you that I'd have expected more conditionality here, I was trying to fit in with the existing pattern. It's kind of hard to follow what the intention is, I think to some extent due to things having evolved over time.
The intention is that *everything* becomes conditional, based on both the host CPU support and the feature set advertised to the guest. Which means that we don't stuff that isn't advertised to the guest, and instead make these things UNDEF. Just like on any real CPU implementation.
@@ -517,7 +519,6 @@ struct kvm_vcpu_arch { enum fp_type fp_type; unsigned int sve_max_vl; u64 svcr;
- u64 fpmr;
Why do this change here? Why isn't done like that the first place?
It didn't seem right to add the register to struct vcpu_sysreg before it was handled by KVM. As referenced in the cover letter normally this wouldn't come up because KVM doesn't rely on the host kernel for managing register state so we add KVM support then enable the host kernel but for FPSIMD we're reusing fpsimd_save() so we need the host kernel support to be in place when we enable KVM.
That doesn't explain why you can't be upfront with it and populate the FPMR entry. In either case, you are wasting a u64.
CGT_MDCR_TDE, @@ -279,6 +281,12 @@ static const struct trap_bits coarse_trap_bits[] = { .mask = HCR_TTLBOS, .behaviour = BEHAVE_FORWARD_ANY, },
- [CGT_HCRX_EnFPM] = {
.index = HCRX_EL2,
.value = HCRX_EL2_EnFPM,
.mask = HCRX_EL2_EnFPM,
.behaviour = BEHAVE_FORWARD_ANY,
This looks wrong. HCRX_EL2.EnFPM is an enable bit.
Right, it's the wrong way round.
+static void *fpsimd_share_end(struct user_fpsimd_state *fpsimd) +{
- void *share_end = fpsimd + 1;
- if (cpus_have_final_cap(ARM64_HAS_FPMR))
share_end += sizeof(u64);
- return share_end;
+}
This is horrible. Why can't you just have a new structure wrapping both user_fpsimd_state and fpmr? This is going to break in subtle ways, just like the SVE/SME stuff.
I agree that it's not great, the main issue was that fpsimd_state is both already embedded in uw for hardened usercopy and very widely referenced by exactly which struct it's in so I was taking a guess as to what would get the least objections. The obvious thing would be to add FPMR to uw and share the whole thing with the hypervisor, if people don't mind adding another field to uw I could do that?
Either that, or you create a KVM-specific structure that contains these fields. If that results in KVM changes, so be it. But I won't take this sort of pointer arithmetic that assumes some pre-defined layout.
vcpu->arch.host_fpsimd_state = kern_hyp_va(fpsimd);
- if (cpus_have_final_cap(ARM64_HAS_FPMR)) {
WARN_ON_ONCE(¤t->thread.fpmr + 1 != fpsimd_share_end(fpsimd));
How can this happen?
It shouldn't, but it'd be bad if it did so I put a check in to make sure we haven't messed up.
See my earlier point: you shouldn't have to check if you used a data structure.
vcpu->arch.host_fpmr = kern_hyp_va(¤t->thread.fpmr);
- }
We really need to stop piling the save/restore of stuff that isn't advertised to the guest.
I'm not clear what you're referencing here? The feature is advertised to the guest via the ID registers and in the past you've pushed back on making things where the state is just a single register like this optional. Do you mean that we should be making this conditional on the guest ID registers? If that is the case is there a plan for how that's supposed to work, set flags when kvm_vcpu_run_pid_change() happens for example?
See the beginning of this email. It is high time that we stop enabling everything by default, because this totally breaks VM migration. We already have a huge backlog of these things, and I don't want to add more of it.
Which means that at the very least, enabling *any* feature also comes with sanitising the state one way or another when this feature is disabled by userspace.
How this is being done is still a work in progress: my current plan is based on a set of trap bits that are computed on a per-VM basis, and some side state that indicates whether the trap handling is for emulation or feature disabling purpose. This will probably reuse the NV infrastructure which has an exhaustive list of the sysregs that can be trapped from EL0/EL1.
At the very least, userspace shouldn't be able to observe the state that a guest isn't supposed to generate, and we should be mindful of not creating covert channels.
M.