On Fri, Feb 11, 2022 at 05:02:16PM +0000, Catalin Marinas wrote:
On Thu, Feb 10, 2022 at 07:45:49PM +0000, Mark Brown wrote:
If we don't preserve ZA then userspace will be forced to save it when enabled which increases overall costs, if we do preserve ZA then it's no more expensive for the kernel to save it than userspace, we avoid the cost of restoring in the case where return directly to userspace without context switching and if we do future work to save more lazily then we may be able to avoid some of the saves.
Thanks for the explanation and the PCS pointer. I guess doing the lazy saving scheme in the syscall handler is a lot more painful (faults etc.) and it's a user-only ABI/PCS, so we shouldn't tie the kernel into it.
Yes, other than the considerations around clone() it's clearly more complicated to engage with.
Given that Linux doesn't plan to use the ZA registers itself, in most cases it won't need to restore anything. But we still need to save the ZA registers on context switch in case the thread wakes up on a different CPU. How often do you reckon would the user do a syscall with active ZA?
I would expect it to be very rare that userspace would want to do a syscall with ZA enabled, though obviously there's not a huge body of real world SME code to validate that against yet. The expected usage pattern is that both ZA and SM are only enabled for fairly brief bursts of intense computation and disabled when not actively used. It's possible that you will see things like logging during computation, or perhaps streaming data to/from a running algorithm incrementally during operation, generating syscalls so I wouldn't be surprised to see it happen but it for most systems it should be a very small percentage of system calls.
What does that mean? Is this as per the sve.rst doc (unspecified but zeroed in practice)?
Yes, we will exit streaming mode and proceed as per sve.rst and the rest of the ABI.
So in this case we consider the syscall interface as non-streaming (as per the PCS terminology). Should we require that the PSTATE.SM is cleared by the user as well? Alternatively, we could make it streaming-compatible and just preserve it. Are there any drawbacks? kernel_neon_begin() could clear SM if needed.
In fact kernel_neon_begin() already disables PSTATE.SM since we need to account for the case where userspace was preempted rather than issued a syscall. We could require that PSTATE.SM is disabled by the user, though it's questionable what we could usefully and helpfully do about it if they forget other than disable it anyway or generate a signal.
We could preserve PSTATE.SM, though since all the other register state for streaming mode is shared with SVE I would expect that we should be applying the SVE discard rules to it and there is therefore no other state that should be retained. As things stand this would either result in more overhead or complicate the register save and restore a bit since if we're in streaming mode we currently assume that we should save and restore the full SVE register contents but normally in a syscall we only need to save and restore the FPSIMD subset. The overhead might go away anyway as a result of general work on syscall optimisation for SVE, though that work isn't done yet and may not end up working out that way.
Having said that as with ZA userspace can just exit streaming mode to avoid any overhead having it enabled introduces and the common case is expected to be that it will have done so due to the PCS, it should be an extremely rare case - unlike keeping ZA active there doesn't seem to be any case where it would be sensible to want to do this and the PCS means you'd have to actively try to do so.
Largely just because it's more complicated to implement copying the ZA backing store for this and it seemed more likely that someone would be surprised by a new process getting stuck carrying a potentially large copy of ZA around that it was unaware of than that someone would actually want that to happen. It's not a particularly strongly held opinon.
If PSTATE.ZA is valid and the user does a fork() (well, implemented as clone()), normally it expects a nearly identical state in the child. With clone() if a new thread is created, we likely don't need the additional ZA state. We got away with having to think about this for SVE as the state is lost on syscall. Here we risk having a vaguely defined ABI - fork() is disabled on arm64 for example but we do have clone() and clone3().
Still thinking about this but maybe we could do something like always copy the ZA state unless CLONE_VM is passed for example. It is marginally more precise.
We should definitely write this up a bit more explictly whatever we do, like I say I don't really have strong opinions here.
There's also the interaction with the lazy save state to consider - TPIDR2 is cleared if CLONE_SETTLS is specified which would interfere with any lazy state saving that had already happened, though hopefully userspace is taking care of that as part of setting up the new thread so I think it's fine.