On Mon, Feb 14, 2022 at 06:19:58PM +0000, Catalin Marinas wrote:
On Fri, Feb 11, 2022 at 06:13:58PM +0000, Mark Brown wrote:
We could preserve PSTATE.SM, though since all the other register state for streaming mode is shared with SVE I would expect that we should be applying the SVE discard rules to it and there is therefore no other state that should be retained.
So when clearing PSTATE.SM, the streaming SVE regs become unknown (well, the wording is a bit more verbose). I think this fits well with the proposal to drop the streaming SVE state entirely on syscalls.
They're preserved or zeroed, yes.
The ZA state I think is not affected by the PSTATE.SM change (early internal SME specs were listing this as unknown after SM clearing but I can't find it in the latest spec). However, after the syscall, the user won't be able to execute SME instruction until turning on PSTATE.SM again.
Yes, ZA is preserved unless PSTATE.ZA is disabled. There are some instructions that can be used to interact with it outside of streaming mode, a subset of the instructions for loading and storing values in ZA.
Would the libc wrappers preserve PSTATE.SM? What I find a bit confusing is that we only partially preserve some state while in streaming mode - the ZA registers but not the SVE ones.
I would expect that libc wrappers would expect to be called with streaming mode already disabled - that's what default functions in the PCS expect, and since without FA64 enabled a huge proportion of FPSIMD instructions and some SVE instructions become undefined standard code could easily generate traps if it uses those instructions for anything. I wouldn't expect that libc would explicitly disable SME itself in standard configurations.
Is the user more likely to turn
PSTATE.SM on for ZA processing or for SVE? If the former, we don't want to unnecessarily save/restore some SVE state that the user doesn't care
It's expected that any active work with ZA will require enabling streaming mode, you can't do any actual computation with it without doing so and most of the work with ZA will involve using the streaming mode SVE registers as part of the computation (eg, collecting results in a Z register, or doing an operation to a ZA tile using the contents of a Z register as an operand).
It is also expected that some applications may prefer to execute what is mainly a SVE workload in streaming mode, as well as any performance relevant differences in the implementation choices the hardware makes it is likely that some systems will have vector lengths available in streaming mode that are otherwise unavailable (eg, you might have PEs with 128 bit FPSIMD/SVE units and a 512 bit SMCU).
I don't have a good handle on which sort of usage is going to be more common, and I expect that the answer is going to be very system dependent varying based on both the mix of applications running on the system at any given moment and the capabilities of the standard and streaming mode floating point implementations that the system has.
However the existing syscall ABI for the Z and P registers (which is all the SVE register state, FFR is a magic P register) means that unless we treat streaming mode differently to non-streaming mode we'll be discarding whatever state is there anyway so userspace by definition shouldn't have anything in there it expects to be preserved when it does a syscall. I'd rather not introduce an ABI that guarantees that we preserve the streaming mode SVE register state in cases where we discard (or can discard) the non-streaming SVE register state, that's both going to be more complicated to implement and more likely to cause unexpected differences that trip userspace up.
about (can we even trap SVE instructions independently of SME while in streaming mode?).
I'd need to check through but I don't believe so.
I'd find it clearer if we preserved PSTATE.SM and, w.r.t. the streaming SVE state, we somewhat follow the PCS and not restore the regs (input from the libc people welcomed).
Like I say we can do that easily enough, it's not something I expect to ever come up in practical usage though.
Having said that as with ZA userspace can just exit streaming mode to avoid any overhead having it enabled introduces and the common case is expected to be that it will have done so due to the PCS, it should be an extremely rare case - unlike keeping ZA active there doesn't seem to be any case where it would be sensible to want to do this and the PCS means you'd have to actively try to do so.
IIUC, the PCS introduced the notion of streaming-compatible functions that preserve the SM bit. If they are non-streaming, SM should be 0 on
Yes, it isn't the default though.
entry. It would be nice if we put the syscalls in one of these categories, so either mandate SM == 0 on entry or preserve (the latter being easier, I think, I haven't looked at what it takes to save/restore the streaming SVE state; I may change my mind after reviewing at the other patches).
The streaming SVE state is identical to the SVE state with the exception of the FFR predicate register which is not present unless FA64 is available in the system and enabled and the separatly configured vector length.
It's sounding like we may as well just preserve SM, it shouldn't come up that often anyway and if it causes performance problems we can probably optimise it, and/or userspace can simply just not do that. Like I say I don't have particularly strong feelings, the current behaviour was just the easiest thing to implement and it doesn't seem like there is a use case. This is fine by me, I can do that for the next version.
[fork()/clone() behaviour]
(few hours later) I think instead of singling out fork() (clone3() actually), we can just say that new tasks (process/thread) always start with PSTATE.ZA == 0, PSTATE.SM == 0 (tbd for this) and TPIDR2_EL0 == 0 irrespective of any clone3() flags (even CLONE_SETTLS). The C library will have to implement the lazy ZA saving in the parent before the syscall and the child will automatically recover the state if it follows the PCS.
Works for me, I think forcing the userspace to consider this is going to work out more robust.