When we collect a signal context with one of the SME modes enabled we will have enabled that mode behind the compiler and libc's back so they may issue some instructions not valid in streaming mode, causing spurious failures.
For the code prior to issuing the BRK to trigger signal handling we need to stay in streaming mode if we were already there since that's a part of the signal context the caller is trying to collect. Unfortunately this code includes a memset() which is likely to be heavily optimised and is likely to use FP instructions incompatible with streaming mode. We can avoid this happening by open coding the memset(), inserting a volatile assembly statement to avoid the compiler recognising what's being done and doing something in optimisation. This code is not performance critical so the inefficiency should not be an issue.
After collecting the context we can simply exit streaming mode, avoiding these issues. Use a full SMSTOP for safety to prevent any issues appearing with ZA.
Reported-by: Will Deacon will@kernel.org Signed-off-by: Mark Brown broonie@kernel.org --- Changes in v2: - Rebase onto v6.5-rc1. - Link to v1: https://lore.kernel.org/r/20230628-arm64-signal-memcpy-fix-v1-1-db3e0300829e... --- .../selftests/arm64/signal/test_signals_utils.h | 28 +++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/signal/test_signals_utils.h b/tools/testing/selftests/arm64/signal/test_signals_utils.h index 222093f51b67..db28409fd44b 100644 --- a/tools/testing/selftests/arm64/signal/test_signals_utils.h +++ b/tools/testing/selftests/arm64/signal/test_signals_utils.h @@ -60,13 +60,28 @@ static __always_inline bool get_current_context(struct tdescr *td, size_t dest_sz) { static volatile bool seen_already; + int i; + char *uc = (char *)dest_uc;
assert(td && dest_uc); /* it's a genuine invocation..reinit */ seen_already = 0; td->live_uc_valid = 0; td->live_sz = dest_sz; - memset(dest_uc, 0x00, td->live_sz); + + /* + * This is a memset() but we don't want the compiler to + * optimise it into either instructions or a library call + * which might be incompatible with streaming mode. + */ + for (i = 0; i < td->live_sz; i++) { + asm volatile("nop" + : "+m" (*dest_uc) + : + : "memory"); + uc[i] = 0; + } + td->live_uc = dest_uc; /* * Grab ucontext_t triggering a SIGTRAP. @@ -103,6 +118,17 @@ static __always_inline bool get_current_context(struct tdescr *td, : : "memory");
+ /* + * If we were grabbing a streaming mode context then we may + * have entered streaming mode behind the system's back and + * libc or compiler generated code might decide to do + * something invalid in streaming mode, or potentially even + * the state of ZA. Issue a SMSTOP to exit both now we have + * grabbed the state. + */ + if (td->feats_supported & FEAT_SME) + asm volatile("msr S0_3_C4_C6_3, xzr"); + /* * If we get here with seen_already==1 it implies the td->live_uc * context has been used to get back here....this probably means
--- base-commit: 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5 change-id: 20230628-arm64-signal-memcpy-fix-7de3b3c8fa10
Best regards,
Hi Mark,
On Wed, Jul 12, 2023 at 12:02:30PM +0100, Mark Brown wrote:
When we collect a signal context with one of the SME modes enabled we will have enabled that mode behind the compiler and libc's back so they may issue some instructions not valid in streaming mode, causing spurious failures.
For the code prior to issuing the BRK to trigger signal handling we need to stay in streaming mode if we were already there since that's a part of the signal context the caller is trying to collect. Unfortunately this code includes a memset() which is likely to be heavily optimised and is likely to use FP instructions incompatible with streaming mode. We can avoid this happening by open coding the memset(), inserting a volatile assembly statement to avoid the compiler recognising what's being done and doing something in optimisation. This code is not performance critical so the inefficiency should not be an issue.
After collecting the context we can simply exit streaming mode, avoiding these issues. Use a full SMSTOP for safety to prevent any issues appearing with ZA.
Thanks for looking at this. Comments inline.
diff --git a/tools/testing/selftests/arm64/signal/test_signals_utils.h b/tools/testing/selftests/arm64/signal/test_signals_utils.h index 222093f51b67..db28409fd44b 100644 --- a/tools/testing/selftests/arm64/signal/test_signals_utils.h +++ b/tools/testing/selftests/arm64/signal/test_signals_utils.h @@ -60,13 +60,28 @@ static __always_inline bool get_current_context(struct tdescr *td, size_t dest_sz) { static volatile bool seen_already;
- int i;
- char *uc = (char *)dest_uc;
assert(td && dest_uc); /* it's a genuine invocation..reinit */ seen_already = 0; td->live_uc_valid = 0; td->live_sz = dest_sz;
- memset(dest_uc, 0x00, td->live_sz);
- /*
* This is a memset() but we don't want the compiler to
* optimise it into either instructions or a library call
* which might be incompatible with streaming mode.
*/
- for (i = 0; i < td->live_sz; i++) {
asm volatile("nop"
: "+m" (*dest_uc)
:
: "memory");
I don't think it's save to use "+m" here, since the compiler can assume that the address is used exactly once in the asm. If a post-indexed addressing mode is generated, then you can end up with register corruption.
Stepping back, why not use either barrier() or OPTIMIZER_HIDE_VAR() instead?
The most robust fix would be to write all of the streaming mode code in asm, but I can appreciate that's a tonne of work for a testcase.
Will
On Thu, Jul 20, 2023 at 11:31:13AM +0100, Will Deacon wrote:
On Wed, Jul 12, 2023 at 12:02:30PM +0100, Mark Brown wrote:
- /*
* This is a memset() but we don't want the compiler to
* optimise it into either instructions or a library call
* which might be incompatible with streaming mode.
*/
- for (i = 0; i < td->live_sz; i++) {
asm volatile("nop"
: "+m" (*dest_uc)
:
: "memory");
I don't think it's save to use "+m" here, since the compiler can assume that the address is used exactly once in the asm. If a post-indexed addressing mode is generated, then you can end up with register corruption.
Stepping back, why not use either barrier() or OPTIMIZER_HIDE_VAR() instead?
That should work. I was mostly just open coding OPTIMIZER_HIDE_VAR() and noticed that memory constraints were a thing.
The most robust fix would be to write all of the streaming mode code in asm, but I can appreciate that's a tonne of work for a testcase.
It's probably more proportionate to add a dependency on toolchain support for SME, but that'd mean we hardly ever run the tests.
linux-kselftest-mirror@lists.linaro.org