On Wed, Dec 20, 2023 at 06:06:53PM -0600, Daniel Díaz wrote:
We have been seeing this problem in other instances, specifically on the following kernels:
- 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
- 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
- 6.3.10, 6.3.11
- 6.4.7
- 6.5.2, 6.5.10-rc2
This is a huge range of kernels with some substantial reworkings of the FP code, and I do note that v5.15 appears to have backported only one change there (an incidental one related to ESR handling). This makes me think this is likely to be something that's been sitting there for a very long time and is unrelated to those versions and any changes that went into them. I see you're still testing back to v4.19 which suggests an issue introduced between v5.10 and v5.15, my change cccb78ce89c45a4 ("arm64/sve: Rework SVE access trap to convert state in registers") does jump out there though I don't immediately see what the issue would be.
Looking at the list of versions you've posted the earliest is from the very end of June with others in July, was there something that changed in your test environment in broadly that time? I see that the logs you and Naresh posted are both using a Debian 12/Bookworm based root filesystem and that was released a couple of weeks before this started appearing, Bookworm introduced glibc usage of SVE which makes usage much more common. Is this perhaps tied to you upgrading your root filesystems to Bookworm or were you tracking testing before then?
Most recent case is for the current 5.15 RC. Decoded stack trace is here: -----8<----- <4>[ 29.297166] ------------[ cut here ]------------ <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at arch/arm64/kernel/fpsimd.c:950 do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
That's an assert that we shouldn't take a SVE trap when SVE is alreadly enabled for the thread. The backtrace Naresh originally supplied was a NULL pointer dereference attempting to save SVE state (indicating that we think we're trying to save SVE state but don't have any storage allocated for it) during thread switch. It's very plausible that the two are the same underlying issue but it's also not 100% a given. Can you double check exactly how similar the various issues you are seeing are please?
I have coincidentally been chasing some other stuff in the past week or two which might potentially be different manifestations of the same underlying issue with current code, broadly in the area of the register state and task state getting out of sync.