Following kernel crash noticed while running selftests: ftrace: ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
This is not an easy to reproduce issue and not seen on mainline and next. We are investigating this report.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: Naresh Kamboju naresh.kamboju@linaro.org
Test log: ---------- kselftest: Running tests in ftrace TAP version 13 1..1 # timeout set to 0 # selftests: ftrace: ftracetest-ktap # TAP version 13 # 1..129 # ok 1 Basic trace file check # ok 2 Basic test for tracers # ok 3 Basic trace clock test # ok 4 Basic event tracing check # ok 5 Change the ringbuffer size # ok 6 Snapshot and tracing setting # ok 7 Snapshot and tracing_cpumask # ok 8 trace_pipe and trace_marker [ 471.689140] [ 471.689264] ********************************************************** [ 471.689422] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** [ 471.689574] ** ** [ 471.689716] ** trace_printk() being used. Allocating extra memory. ** [ 471.689878] ** ** [ 471.690031] ** This means that this is a DEBUG kernel and it is ** [ 471.690183] ** unsafe for production use. ** [ 471.690335] ** ** [ 471.690487] ** If you see this message and you are not debugging ** [ 471.690728] ** the kernel, report this immediately to your vendor! ** [ 471.690881] ** ** [ 471.691033] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** [ 471.691185] ********************************************************** [ 543.243648] hrtimer: interrupt took 11937170 ns [ 764.987161] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 764.992518] Mem abort info: [ 764.995330] ESR = 0x0000000096000044 [ 764.998562] EC = 0x25: DABT (current EL), IL = 32 bits [ 765.002434] SET = 0, FnV = 0 [ 765.005361] EA = 0, S1PTW = 0 [ 765.008327] FSC = 0x04: level 0 translation fault [ 765.012011] Data abort info: [ 765.014858] ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000 [ 765.018797] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 765.022562] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 765.026438] user pgtable: 4k pages, 48-bit VAs, pgdp=00000008848bd000 [ 765.030782] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 765.037045] Internal error: Oops: 0000000096000044 [#1] PREEMPT SMP [ 765.038392] Modules linked in: ftrace_direct pl111_drm arm_spe_pmu drm_dma_helper crct10dif_ce panel_simple drm_kms_helper fuse drm dm_mod ip_tables x_tables [last unloaded: ftrace_direct] [ 765.044892] CPU: 3 PID: 808 Comm: rmmod Not tainted 6.5.8-rc2 #1 [ 765.046192] Hardware name: FVP Base RevC (DT) [ 765.047264] pstate: 234020c9 (nzCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 765.048693] pc : sve_save_state+0x4/0xf0 [ 765.049820] lr : fpsimd_save+0x1b8/0x218 [ 765.050933] sp : ffff800080a83ac0 [ 765.051871] x29: ffff800080a83ac0 x28: ffff000805257058 x27: 0000000000000001 [ 765.054108] x26: 0000000000000000 x25: ffffd7c64d332980 x24: 0000000000000000 [ 765.056341] x23: 0000000000000001 x22: ffff284232103000 x21: 0000000000000040 [ 765.058575] x20: ffff00087f7470b0 x19: ffffd7c64d6440b0 x18: 0000000000000000 [ 765.060811] x17: 0000000000000000 x16: ffff800080018000 x15: 0000000000000000 [ 765.063041] x14: 0000000000000000 x13: 0000000000000000 x12: 0000380a873b560e [ 765.065277] x11: ffffd7c64e0ae390 x10: ffff800080a83b10 x9 : ffffd7c64b5b7710 [ 765.067516] x8 : ffff800080a839b8 x7 : 000000000000001e x6 : ffff00080000c200 [ 765.069752] x5 : ffffd7c64b78cc30 x4 : 0000000000000000 x3 : 0000000000000000 [ 765.071983] x2 : 0000000000000001 x1 : ffff000805257820 x0 : 0000000000000880 [ 765.074221] Call trace: [ 765.075045] sve_save_state+0x4/0xf0 [ 765.076138] fpsimd_thread_switch+0x2c/0xe8 [ 765.077305] __switch_to+0x20/0x158 [ 765.078384] __schedule+0x2cc/0xb38 [ 765.079464] preempt_schedule_irq+0x44/0xa8 [ 765.080633] el1_interrupt+0x4c/0x68 [ 765.081691] el1h_64_irq_handler+0x18/0x28 [ 765.082829] el1h_64_irq+0x64/0x68 [ 765.083874] ftrace_return_to_handler+0x98/0x158 [ 765.085090] return_to_handler+0x20/0x48 [ 765.086205] do_sve_acc+0x64/0x128 [ 765.087272] el0_sve_acc+0x3c/0xa0 [ 765.088356] el0t_64_sync_handler+0x114/0x130 [ 765.089524] el0t_64_sync+0x190/0x198 [ 765.090712] Code: d51b4408 d65f03c0 d503201f d503245f (e5bb5800) [ 765.092024] ---[ end trace 0000000000000000 ]--- [ 765.904294] pstore: backend (efi_pstore) writing error (-5) [ 765.905531] note: rmmod[808] exited with irqs disabled
Links: test log link: - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.5.y/build/v6.5.7-...
Details of tests: - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2WrHwIIFdZp...
Build link: - https://storage.tuxsuite.com/public/linaro/lkft/builds/2WrHvExYOOOZVoxlISTdA...
-- Linaro LKFT https://lkft.linaro.org
On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed while running selftests: ftrace: ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
This is not an easy to reproduce issue and not seen on mainline and next. We are investigating this report.
To confirm have you seen this on other stables as well or is this only v6.5? For how long have you been seeing this?
[ 764.987161] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 765.074221] Call trace: [ 765.075045] sve_save_state+0x4/0xf0 [ 765.076138] fpsimd_thread_switch+0x2c/0xe8 [ 765.077305] __switch_to+0x20/0x158 [ 765.078384] __schedule+0x2cc/0xb38 [ 765.079464] preempt_schedule_irq+0x44/0xa8 [ 765.080633] el1_interrupt+0x4c/0x68 [ 765.081691] el1h_64_irq_handler+0x18/0x28 [ 765.082829] el1h_64_irq+0x64/0x68 [ 765.083874] ftrace_return_to_handler+0x98/0x158 [ 765.085090] return_to_handler+0x20/0x48 [ 765.086205] do_sve_acc+0x64/0x128 [ 765.087272] el0_sve_acc+0x3c/0xa0 [ 765.088356] el0t_64_sync_handler+0x114/0x130 [ 765.089524] el0t_64_sync+0x190/0x198
So something managed to get flagged as having SVE state without having the backing storage allocated. We *were* preempted in the SVE access handler which does the allocation but I can't see the path that would trigger that since we allocate the state before setting TIF_SVE. It's possible the compiler did something funky, a decode of the backtrace might help show that?
On Tue, Oct 17, 2023 at 01:22:14PM +0100, Mark Brown wrote:
On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed while running selftests: ftrace: ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
This is not an easy to reproduce issue and not seen on mainline and next. We are investigating this report.
To confirm have you seen this on other stables as well or is this only v6.5? For how long have you been seeing this?
[ 764.987161] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 765.074221] Call trace: [ 765.075045] sve_save_state+0x4/0xf0 [ 765.076138] fpsimd_thread_switch+0x2c/0xe8 [ 765.077305] __switch_to+0x20/0x158 [ 765.078384] __schedule+0x2cc/0xb38 [ 765.079464] preempt_schedule_irq+0x44/0xa8 [ 765.080633] el1_interrupt+0x4c/0x68 [ 765.081691] el1h_64_irq_handler+0x18/0x28 [ 765.082829] el1h_64_irq+0x64/0x68 [ 765.083874] ftrace_return_to_handler+0x98/0x158 [ 765.085090] return_to_handler+0x20/0x48 [ 765.086205] do_sve_acc+0x64/0x128 [ 765.087272] el0_sve_acc+0x3c/0xa0 [ 765.088356] el0t_64_sync_handler+0x114/0x130 [ 765.089524] el0t_64_sync+0x190/0x198
So something managed to get flagged as having SVE state without having the backing storage allocated. We *were* preempted in the SVE access handler which does the allocation but I can't see the path that would trigger that since we allocate the state before setting TIF_SVE. It's possible the compiler did something funky, a decode of the backtrace might help show that?
Having a vmlinux would be *really* helpful...
I tried generating fpsimd.o using the same config and the kernel.org crosstool GCC 13.2.0, code dump below. Assuming the code generation is the same as for Naresh, do_sve_acc+0x64 is at 0x191c, and is just after the call to sve_alloc().
So IIUC what's happening here is that sve_alloc() has been called, its entry has been traced, its body has been run, and in the process of tracing its return an IRQ has preempted the task and caused a reschedule.
So unless sve_alloc() failed, at the instant the IRQ was taken:
* `task->thread.sve_state` should be non-NULL * `task->thread_info.flags & TIF_SVE` should be 0
... so if `task->thread.sve_state` becomes NULL, I wonder if we end up accidentally blatting that as part of the context switch? I can't immedaitely see how/
Mark.
00000000000018b8 <do_sve_acc>: 18b8: d503201f nop 18bc: d503201f nop 18c0: d503233f paciasp 18c4: a9be7bfd stp x29, x30, [sp, #-32]! 18c8: 910003fd mov x29, sp 18cc: 1400000a b 18f4 <do_sve_acc+0x3c> 18d0: d503201f nop 18d4: f9408022 ldr x2, [x1, #256] 18d8: d2800003 mov x3, #0x0 // #0 18dc: 52800080 mov w0, #0x4 // #4 18e0: 52800021 mov w1, #0x1 // #1 18e4: 94000000 bl 0 <force_signal_inject> 18e8: a8c27bfd ldp x29, x30, [sp], #32 18ec: d50323bf autiasp 18f0: d65f03c0 ret 18f4: 90000000 adrp x0, 0 <system_cpucaps> 18f8: f9400000 ldr x0, [x0] 18fc: b6f7fec0 tbz x0, #62, 18d4 <do_sve_acc+0x1c> 1900: f9000bf3 str x19, [sp, #16] 1904: d5384113 mrs x19, sp_el0 1908: f9400260 ldr x0, [x19] 190c: 37b005e0 tbnz w0, #22, 19c8 <do_sve_acc+0x110> 1910: aa1303e0 mov x0, x19 1914: 52800021 mov w1, #0x1 // #1 1918: 94000000 bl 1140 <sve_alloc> 191c: f946be60 ldr x0, [x19, #3448] 1920: b4000480 cbz x0, 19b0 <do_sve_acc+0xf8> 1924: 97fffb59 bl 688 <get_cpu_fpsimd_context> 1928: 14000016 b 1980 <do_sve_acc+0xc8> 192c: d2a01000 mov x0, #0x800000 // #8388608 1930: f8e03260 ldsetal x0, x0, [x19] 1934: 36b80040 tbz w0, #23, 193c <do_sve_acc+0x84> 1938: d4210000 brk #0x800 193c: d5384113 mrs x19, sp_el0 1940: f9400260 ldr x0, [x19] 1944: 371802c0 tbnz w0, #3, 199c <do_sve_acc+0xe4> 1948: b94d8a73 ldr w19, [x19, #3464] 194c: 53047e73 lsr w19, w19, #4 1950: 51000673 sub w19, w19, #0x1 1954: aa1303e0 mov x0, x19 1958: 94000000 bl 0 <sve_set_vq> 195c: aa1303e1 mov x1, x19 1960: 52800020 mov w0, #0x1 // #1 1964: 94000000 bl 0 <sve_flush_live> 1968: 97fffbb4 bl 838 <fpsimd_bind_task_to_cpu> 196c: 97fffb61 bl 6f0 <put_cpu_fpsimd_context> 1970: f9400bf3 ldr x19, [sp, #16] 1974: a8c27bfd ldp x29, x30, [sp], #32 1978: d50323bf autiasp 197c: d65f03c0 ret 1980: f9800271 prfm pstl1strm, [x19] 1984: c85f7e60 ldxr x0, [x19] 1988: b2690001 orr x1, x0, #0x800000 198c: c802fe61 stlxr w2, x1, [x19] 1990: 35ffffa2 cbnz w2, 1984 <do_sve_acc+0xcc> 1994: d5033bbf dmb ish 1998: 17ffffe7 b 1934 <do_sve_acc+0x7c> 199c: aa1303e0 mov x0, x19 19a0: 97fffaf2 bl 568 <fpsimd_to_sve> 19a4: 52800040 mov w0, #0x2 // #2 19a8: b90d7260 str w0, [x19, #3440] 19ac: 17fffff0 b 196c <do_sve_acc+0xb4> 19b0: 52800120 mov w0, #0x9 // #9 19b4: 94000000 bl 0 <force_sig> 19b8: f9400bf3 ldr x19, [sp, #16] 19bc: a8c27bfd ldp x29, x30, [sp], #32 19c0: d50323bf autiasp 19c4: d65f03c0 ret 19c8: d4210000 brk #0x800 19cc: f9400bf3 ldr x19, [sp, #16] 19d0: 17ffffc1 b 18d4 <do_sve_acc+0x1c> 19d4: d503201f nop 19d8: d503201f nop 19dc: d503201f nop
On Tue, Oct 17, 2023 at 02:42:01PM +0100, Mark Rutland wrote:
So unless sve_alloc() failed, at the instant the IRQ was taken:
- `task->thread.sve_state` should be non-NULL
- `task->thread_info.flags & TIF_SVE` should be 0
... so if `task->thread.sve_state` becomes NULL, I wonder if we end up accidentally blatting that as part of the context switch? I can't immedaitely see how/
We're possibly missing a fpsimd_bind_task_to_cpu() somewhere since all the hilarity with KVM means that we don't use the task_struct to save state, though the task that's taking the SVE trap shouldn't be impacted there if it didn't set TIF_SVE yet. There *is* a window where we have TIF_SVE set but didn't yet do the rebind but that should be in a preempt disabled section.
On Tue, 17 Oct 2023 at 17:52, Mark Brown broonie@kernel.org wrote:
On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed while running selftests: ftrace: ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
This is not an easy to reproduce issue and not seen on mainline and next. We are investigating this report.
To confirm have you seen this on other stables as well or is this only v6.5? For how long have you been seeing this?
This is only seen on 6.5.8-rc2 and seen only once. I have checked on mainline / next and other stable branches and this crash is not seen anywhere else.
However, I will keep checking them on other branches and next and mainline.
[ 764.987161] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 765.074221] Call trace: [ 765.075045] sve_save_state+0x4/0xf0 [ 765.076138] fpsimd_thread_switch+0x2c/0xe8 [ 765.077305] __switch_to+0x20/0x158 [ 765.078384] __schedule+0x2cc/0xb38 [ 765.079464] preempt_schedule_irq+0x44/0xa8 [ 765.080633] el1_interrupt+0x4c/0x68 [ 765.081691] el1h_64_irq_handler+0x18/0x28 [ 765.082829] el1h_64_irq+0x64/0x68 [ 765.083874] ftrace_return_to_handler+0x98/0x158 [ 765.085090] return_to_handler+0x20/0x48 [ 765.086205] do_sve_acc+0x64/0x128 [ 765.087272] el0_sve_acc+0x3c/0xa0 [ 765.088356] el0t_64_sync_handler+0x114/0x130 [ 765.089524] el0t_64_sync+0x190/0x198
So something managed to get flagged as having SVE state without having the backing storage allocated. We *were* preempted in the SVE access handler which does the allocation but I can't see the path that would trigger that since we allocate the state before setting TIF_SVE. It's possible the compiler did something funky, a decode of the backtrace might help show that?
We have not uploaded vmlinux and System.map to this specific build. However, I have requested to have these files get uploaded for upcoming builds.
- Naresh
On Thu, Oct 19, 2023 at 09:07:02PM +0530, Naresh Kamboju wrote:
On Tue, 17 Oct 2023 at 17:52, Mark Brown broonie@kernel.org wrote:
On Tue, Oct 17, 2023 at 01:34:18PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed while running selftests: ftrace: ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
This is not an easy to reproduce issue and not seen on mainline and next. We are investigating this report.
To confirm have you seen this on other stables as well or is this only v6.5? For how long have you been seeing this?
This is only seen on 6.5.8-rc2 and seen only once. I have checked on mainline / next and other stable branches and this crash is not seen anywhere else.
However, I will keep checking them on other branches and next and mainline.
I see. If the bug has never been reproduced I don't think we can draw any conclusion about which branches are affected, I would be a bit surprised by something that affects v6.5 only.
Hello!
On Thu, 19 Oct 2023 at 11:26, Mark Brown broonie@kernel.org wrote:
On Thu, Oct 19, 2023 at 09:07:02PM +0530, Naresh Kamboju wrote:
On Tue, 17 Oct 2023 at 17:52, Mark Brown broonie@kernel.org wrote:
[...]
To confirm have you seen this on other stables as well or is this only v6.5? For how long have you been seeing this?
This is only seen on 6.5.8-rc2 and seen only once. I have checked on mainline / next and other stable branches and this crash is not seen anywhere else.
However, I will keep checking them on other branches and next and mainline.
I see. If the bug has never been reproduced I don't think we can draw any conclusion about which branches are affected, I would be a bit surprised by something that affects v6.5 only.
We have been seeing this problem in other instances, specifically on the following kernels: * 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1 * 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63 * 6.3.10, 6.3.11 * 6.4.7 * 6.5.2, 6.5.10-rc2
Most recent case is for the current 5.15 RC. Decoded stack trace is here: -----8<----- <4>[ 29.297166] ------------[ cut here ]------------ <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at arch/arm64/kernel/fpsimd.c:950 do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1)) <4>[ 29.300418] Modules linked in: fuse drm dm_mod ip_tables x_tables <4>[ 29.302720] CPU: 1 PID: 220 Comm: systemd-udevd Not tainted 5.15.145-rc1 #1 <4>[ 29.303601] Hardware name: linux,dummy-virt (DT) <4>[ 29.304804] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) <4>[ 29.305615] pc : do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1)) <4>[ 29.306499] lr : do_sve_acc (/builds/linux/arch/arm64/include/asm/jump_label.h:38 /builds/linux/arch/arm64/include/asm/lse.h:24 /builds/linux/arch/arm64/include/asm/atomic.h:86 /builds/linux/include/linux/atomic/atomic-long.h:335 /builds/linux/include/asm-generic/bitops/atomic.h:42 /builds/linux/include/asm-generic/bitops/instrumented-atomic.h:71 /builds/linux/include/linux/thread_info.h:108 /builds/linux/arch/arm64/kernel/fpsimd.c:949) <4>[ 29.307466] sp : ffff8000083b3e60 <4>[ 29.308447] x29: ffff8000083b3e60 x28: ffff0000c2f42000 x27: 0000000000000000 <4>[ 29.310500] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 <4>[ 29.312034] x23: 0000000080001000 x22: 0000ffffb3c8a454 x21: 00000000ffffffff <4>[ 29.313338] x20: 0000000064000000 x19: ffff0000c2f42000 x18: 0000000000000000 <4>[ 29.314636] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 <4>[ 29.316090] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 <4>[ 29.317096] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffb047a36178f4 <4>[ 29.318131] x8 : ffff0000c2c4d888 x7 : 0000000000000000 x6 : 0000000000000200 <4>[ 29.319171] x5 : ffffb047a5c6f260 x4 : 0000000000000200 x3 : 0000000000000001 <4>[ 29.320180] x2 : 0000000000000000 x1 : ffff4fb95a329000 x0 : 0000000000800800 <4>[ 29.321599] Call trace: <4>[ 29.321996] do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1)) <4>[ 29.322674] el0_sve_acc (/builds/linux/arch/arm64/include/asm/daifflags.h:28 /builds/linux/arch/arm64/kernel/entry-common.c:129 /builds/linux/arch/arm64/kernel/entry-common.c:138 /builds/linux/arch/arm64/kernel/entry-common.c:529) <4>[ 29.323273] el0t_64_sync_handler (/builds/linux/arch/arm64/kernel/entry-common.c:639) <4>[ 29.323951] el0t_64_sync (/builds/linux/arch/arm64/kernel/entry.S:584) <4>[ 29.324598] ---[ end trace b31d3b95f436e1b5 ]--- ----->8-----
Complete log (and reproducer) here: https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2ZoYpOR79Ej...
Artifacts (including vmlinux and System.map) can be found here: https://storage.tuxsuite.com/public/linaro/lkft/builds/2ZoYoIewSKk26jISEo4Ql...
Greetings!
Daniel Díaz daniel.diaz@linaro.org
Hi Daniel,
Would you know what ftrace selftest testcase script is running when this crash happened? I think it depends on the selftest version but I don't know which one you run.
Thank you,
On Wed, 20 Dec 2023 18:06:53 -0600 Daniel Díaz daniel.diaz@linaro.org wrote:
Hello!
On Thu, 19 Oct 2023 at 11:26, Mark Brown broonie@kernel.org wrote:
On Thu, Oct 19, 2023 at 09:07:02PM +0530, Naresh Kamboju wrote:
On Tue, 17 Oct 2023 at 17:52, Mark Brown broonie@kernel.org wrote:
[...]
To confirm have you seen this on other stables as well or is this only v6.5? For how long have you been seeing this?
This is only seen on 6.5.8-rc2 and seen only once. I have checked on mainline / next and other stable branches and this crash is not seen anywhere else.
However, I will keep checking them on other branches and next and mainline.
I see. If the bug has never been reproduced I don't think we can draw any conclusion about which branches are affected, I would be a bit surprised by something that affects v6.5 only.
We have been seeing this problem in other instances, specifically on the following kernels:
- 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
- 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
- 6.3.10, 6.3.11
- 6.4.7
- 6.5.2, 6.5.10-rc2
Most recent case is for the current 5.15 RC. Decoded stack trace is here: -----8<----- <4>[ 29.297166] ------------[ cut here ]------------ <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at arch/arm64/kernel/fpsimd.c:950 do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1)) <4>[ 29.300418] Modules linked in: fuse drm dm_mod ip_tables x_tables <4>[ 29.302720] CPU: 1 PID: 220 Comm: systemd-udevd Not tainted 5.15.145-rc1 #1 <4>[ 29.303601] Hardware name: linux,dummy-virt (DT) <4>[ 29.304804] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) <4>[ 29.305615] pc : do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1)) <4>[ 29.306499] lr : do_sve_acc (/builds/linux/arch/arm64/include/asm/jump_label.h:38 /builds/linux/arch/arm64/include/asm/lse.h:24 /builds/linux/arch/arm64/include/asm/atomic.h:86 /builds/linux/include/linux/atomic/atomic-long.h:335 /builds/linux/include/asm-generic/bitops/atomic.h:42 /builds/linux/include/asm-generic/bitops/instrumented-atomic.h:71 /builds/linux/include/linux/thread_info.h:108 /builds/linux/arch/arm64/kernel/fpsimd.c:949) <4>[ 29.307466] sp : ffff8000083b3e60 <4>[ 29.308447] x29: ffff8000083b3e60 x28: ffff0000c2f42000 x27: 0000000000000000 <4>[ 29.310500] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 <4>[ 29.312034] x23: 0000000080001000 x22: 0000ffffb3c8a454 x21: 00000000ffffffff <4>[ 29.313338] x20: 0000000064000000 x19: ffff0000c2f42000 x18: 0000000000000000 <4>[ 29.314636] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 <4>[ 29.316090] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 <4>[ 29.317096] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffb047a36178f4 <4>[ 29.318131] x8 : ffff0000c2c4d888 x7 : 0000000000000000 x6 : 0000000000000200 <4>[ 29.319171] x5 : ffffb047a5c6f260 x4 : 0000000000000200 x3 : 0000000000000001 <4>[ 29.320180] x2 : 0000000000000000 x1 : ffff4fb95a329000 x0 : 0000000000800800 <4>[ 29.321599] Call trace: <4>[ 29.321996] do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1)) <4>[ 29.322674] el0_sve_acc (/builds/linux/arch/arm64/include/asm/daifflags.h:28 /builds/linux/arch/arm64/kernel/entry-common.c:129 /builds/linux/arch/arm64/kernel/entry-common.c:138 /builds/linux/arch/arm64/kernel/entry-common.c:529) <4>[ 29.323273] el0t_64_sync_handler (/builds/linux/arch/arm64/kernel/entry-common.c:639) <4>[ 29.323951] el0t_64_sync (/builds/linux/arch/arm64/kernel/entry.S:584) <4>[ 29.324598] ---[ end trace b31d3b95f436e1b5 ]--- ----->8-----
Complete log (and reproducer) here: https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2ZoYpOR79Ej...
Artifacts (including vmlinux and System.map) can be found here: https://storage.tuxsuite.com/public/linaro/lkft/builds/2ZoYoIewSKk26jISEo4Ql...
Greetings!
Daniel Díaz daniel.diaz@linaro.org
Hello, Masami-san!
Sorry for the late reply -- winter break, expired artifacts, life, etc.
On Wed, 20 Dec 2023 at 18:34, Masami Hiramatsu mhiramat@kernel.org wrote:
Hi Daniel, Would you know what ftrace selftest testcase script is running when this crash happened? I think it depends on the selftest version but I don't know which one you run.
If I remember correctly, that actually happened during LTP testing. I found a new instance on 6.1 today, on ltp-controllers, so there were no kselftests involved.
I'll share more details in a bit.
Greetings!
Daniel Díaz daniel.diaz@linaro.org
On Wed, Dec 20, 2023 at 06:06:53PM -0600, Daniel Díaz wrote:
We have been seeing this problem in other instances, specifically on the following kernels:
- 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
- 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
- 6.3.10, 6.3.11
- 6.4.7
- 6.5.2, 6.5.10-rc2
This is a huge range of kernels with some substantial reworkings of the FP code, and I do note that v5.15 appears to have backported only one change there (an incidental one related to ESR handling). This makes me think this is likely to be something that's been sitting there for a very long time and is unrelated to those versions and any changes that went into them. I see you're still testing back to v4.19 which suggests an issue introduced between v5.10 and v5.15, my change cccb78ce89c45a4 ("arm64/sve: Rework SVE access trap to convert state in registers") does jump out there though I don't immediately see what the issue would be.
Looking at the list of versions you've posted the earliest is from the very end of June with others in July, was there something that changed in your test environment in broadly that time? I see that the logs you and Naresh posted are both using a Debian 12/Bookworm based root filesystem and that was released a couple of weeks before this started appearing, Bookworm introduced glibc usage of SVE which makes usage much more common. Is this perhaps tied to you upgrading your root filesystems to Bookworm or were you tracking testing before then?
Most recent case is for the current 5.15 RC. Decoded stack trace is here: -----8<----- <4>[ 29.297166] ------------[ cut here ]------------ <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at arch/arm64/kernel/fpsimd.c:950 do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
That's an assert that we shouldn't take a SVE trap when SVE is alreadly enabled for the thread. The backtrace Naresh originally supplied was a NULL pointer dereference attempting to save SVE state (indicating that we think we're trying to save SVE state but don't have any storage allocated for it) during thread switch. It's very plausible that the two are the same underlying issue but it's also not 100% a given. Can you double check exactly how similar the various issues you are seeing are please?
I have coincidentally been chasing some other stuff in the past week or two which might potentially be different manifestations of the same underlying issue with current code, broadly in the area of the register state and task state getting out of sync.
Hello!
On 20/12/23 7:06 p. m., Mark Brown wrote:
On Wed, Dec 20, 2023 at 06:06:53PM -0600, Daniel Díaz wrote:
We have been seeing this problem in other instances, specifically on the following kernels:
- 5.15.132, 5.15.134-rc1, 5.15.135, 5.15.136-rc1, 5.15.142, 5.15.145-rc1
- 6.1.42, 6.1.43, 6.1.51-rc1, 6.1.56-rc1, 6.1.59-rc1, 6.1.63
- 6.3.10, 6.3.11
- 6.4.7
- 6.5.2, 6.5.10-rc2
[...]> Looking at the list of versions you've posted the earliest is from the
very end of June with others in July, was there something that changed in your test environment in broadly that time?
Yes! Important to note here is that I started collecting that date around that time, second half of 2023. It's possible that the same problem was there before but I didn't register that.
I see that the logs you and Naresh posted are both using a Debian 12/Bookworm based root filesystem and that was released a couple of weeks before this started appearing, Bookworm introduced glibc usage of SVE which makes usage much more common. Is this perhaps tied to you upgrading your root filesystems to Bookworm or were you tracking testing before then?
I don't have an easy way to swap the Debian version right now, I'll have to ask around, but that might be a good experiment.
This message appears so sporadically that it would be hard to determine with 100% certainty if the changes we make really fix the issue. We can try to figure out just how sporadic that error is, though, and go from there, but the numbers looks slim with this happening once every so many weeks.
Most recent case is for the current 5.15 RC. Decoded stack trace is here: -----8<----- <4>[ 29.297166] ------------[ cut here ]------------ <4>[ 29.298039] WARNING: CPU: 1 PID: 220 at arch/arm64/kernel/fpsimd.c:950 do_sve_acc (/builds/linux/arch/arm64/kernel/fpsimd.c:950 (discriminator 1))
That's an assert that we shouldn't take a SVE trap when SVE is alreadly enabled for the thread. The backtrace Naresh originally supplied was a NULL pointer dereference attempting to save SVE state (indicating that we think we're trying to save SVE state but don't have any storage allocated for it) during thread switch. It's very plausible that the two are the same underlying issue but it's also not 100% a given. Can you double check exactly how similar the various issues you are seeing are please?
I'm not sure of how similar the test cases are, but I'm inclined to think its occurrence is not related specifically to ftrace or one test case. It looks like these appear on FVP and Qemu-arm64, on several LTS branches, and while running different set of tests. On today's episode, we had it happening in the middle of ltp-controllers' cgroup_fj_stress:
-----8<----- <4>[ 619.431962] ------------[ cut here ]------------ <4>[ 619.432530] WARNING: CPU: 0 PID: 10382 at arch/arm64/kernel/fpsimd.c:1403 do_sve_acc (arch/arm64/kernel/fpsimd.c:1402) <4>[ 619.433889] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm dm_mod ip_tables x_tables <4>[ 619.436649] CPU: 0 PID: 10382 Comm: cgroup_fj_stres Not tainted 6.1.76-rc2 #1 <4>[ 619.437435] Hardware name: linux,dummy-virt (DT) <4>[ 619.438319] pstate: 42400009 (nZcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--) <4>[ 619.439680] pc : do_sve_acc (arch/arm64/kernel/fpsimd.c:1402) <4>[ 619.440003] lr : do_sve_acc (arch/arm64/kernel/fpsimd.c:227 arch/arm64/kernel/fpsimd.c:253 arch/arm64/kernel/fpsimd.c:1400) <4>[ 619.440572] sp : ffff80000f3e3e40 <4>[ 619.441022] x29: ffff80000f3e3e40 x28: ffff0000c4e42100 x27: 0000000000000000 <4>[ 619.441860] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 <4>[ 619.442789] x23: 0000000080001000 x22: 0000ffffaa47a454 x21: 0000000000000001 <4>[ 619.443893] x20: ffffb6cd7e13c0b0 x19: ffff0000c4e42100 x18: 0000000000000000 <4>[ 619.445033] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 <4>[ 619.446017] x14: 0000000000000000 x13: ffff80000f3e0000 x12: ffff80000f3e4000 <4>[ 619.446991] x11: 25fdfcd46891cd00 x10: 0000000000000200 x9 : 0000000000000000 <4>[ 619.447614] x8 : 0000000000800000 x7 : 0000000000000000 x6 : 000000000000003f <4>[ 619.448285] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000000 <4>[ 619.448963] x2 : 0000000000000008 x1 : ffffb6cd7c019eac x0 : ffffb6cd7c019eac <4>[ 619.449852] Call trace: <4>[ 619.450219] do_sve_acc (arch/arm64/kernel/fpsimd.c:1402) <4>[ 619.450983] el0_sve_acc (arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:550) <4>[ 619.451394] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:0) <4>[ 619.451849] el0t_64_sync (arch/arm64/kernel/entry.S:585) <4>[ 619.452333] ---[ end trace 0000000000000000 ]--- ----->8-----
The full log, reproducers, artifacts, can be found here:
https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2bgeMqo59p7...
Greetings!
Daniel Díaz daniel.diaz@linaro.org
On Tue, Jan 30, 2024 at 06:15:11PM -0600, Daniel Díaz wrote:
That's an assert that we shouldn't take a SVE trap when SVE is alreadly enabled for the thread. The backtrace Naresh originally supplied was a NULL pointer dereference attempting to save SVE state
...
given. Can you double check exactly how similar the various issues you are seeing are please?
I'm not sure of how similar the test cases are, but I'm inclined to think its occurrence is not related specifically to ftrace or one test case. It looks like these appear on FVP and Qemu-arm64, on several LTS
That question is not about the test that is running, that question is about the error that the kernel reports. There were two different but potentially related kernel errors in the reports that you and Naresh sent, I'm asking how similar the other cases actually are - is it more instances of these two issues, are there possibly other issues?
On Tue, 17 Oct 2023 13:34:18 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Following kernel crash noticed while running selftests: ftrace: ftracetest-ktap on FVP models running stable-rc 6.5.8-rc2.
This is not an easy to reproduce issue and not seen on mainline and next. We are investigating this report.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: Naresh Kamboju naresh.kamboju@linaro.org
Test log:
kselftest: Running tests in ftrace TAP version 13 1..1 # timeout set to 0 # selftests: ftrace: ftracetest-ktap # TAP version 13 # 1..129 # ok 1 Basic trace file check # ok 2 Basic test for tracers # ok 3 Basic trace clock test # ok 4 Basic event tracing check # ok 5 Change the ringbuffer size # ok 6 Snapshot and tracing setting # ok 7 Snapshot and tracing_cpumask # ok 8 trace_pipe and trace_marker
Sad part about ktap format is that it doesn't show you what test crashed. I had to run it to see what was next. That was:
ok 9 Test ftrace direct functions against tracers
Adding Mark Rutland and Florent, as they worked on the direct trampolines on Arm 64.
-- Steve
[ 471.689140] [ 471.689264] ********************************************************** [ 471.689422] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** [ 471.689574] ** ** [ 471.689716] ** trace_printk() being used. Allocating extra memory. ** [ 471.689878] ** ** [ 471.690031] ** This means that this is a DEBUG kernel and it is ** [ 471.690183] ** unsafe for production use. ** [ 471.690335] ** ** [ 471.690487] ** If you see this message and you are not debugging ** [ 471.690728] ** the kernel, report this immediately to your vendor! ** [ 471.690881] ** ** [ 471.691033] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** [ 471.691185] ********************************************************** [ 543.243648] hrtimer: interrupt took 11937170 ns [ 764.987161] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 764.992518] Mem abort info: [ 764.995330] ESR = 0x0000000096000044 [ 764.998562] EC = 0x25: DABT (current EL), IL = 32 bits [ 765.002434] SET = 0, FnV = 0 [ 765.005361] EA = 0, S1PTW = 0 [ 765.008327] FSC = 0x04: level 0 translation fault [ 765.012011] Data abort info: [ 765.014858] ISV = 0, ISS = 0x00000044, ISS2 = 0x00000000 [ 765.018797] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 765.022562] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 765.026438] user pgtable: 4k pages, 48-bit VAs, pgdp=00000008848bd000 [ 765.030782] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 765.037045] Internal error: Oops: 0000000096000044 [#1] PREEMPT SMP [ 765.038392] Modules linked in: ftrace_direct pl111_drm arm_spe_pmu drm_dma_helper crct10dif_ce panel_simple drm_kms_helper fuse drm dm_mod ip_tables x_tables [last unloaded: ftrace_direct] [ 765.044892] CPU: 3 PID: 808 Comm: rmmod Not tainted 6.5.8-rc2 #1 [ 765.046192] Hardware name: FVP Base RevC (DT) [ 765.047264] pstate: 234020c9 (nzCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 765.048693] pc : sve_save_state+0x4/0xf0 [ 765.049820] lr : fpsimd_save+0x1b8/0x218 [ 765.050933] sp : ffff800080a83ac0 [ 765.051871] x29: ffff800080a83ac0 x28: ffff000805257058 x27: 0000000000000001 [ 765.054108] x26: 0000000000000000 x25: ffffd7c64d332980 x24: 0000000000000000 [ 765.056341] x23: 0000000000000001 x22: ffff284232103000 x21: 0000000000000040 [ 765.058575] x20: ffff00087f7470b0 x19: ffffd7c64d6440b0 x18: 0000000000000000 [ 765.060811] x17: 0000000000000000 x16: ffff800080018000 x15: 0000000000000000 [ 765.063041] x14: 0000000000000000 x13: 0000000000000000 x12: 0000380a873b560e [ 765.065277] x11: ffffd7c64e0ae390 x10: ffff800080a83b10 x9 : ffffd7c64b5b7710 [ 765.067516] x8 : ffff800080a839b8 x7 : 000000000000001e x6 : ffff00080000c200 [ 765.069752] x5 : ffffd7c64b78cc30 x4 : 0000000000000000 x3 : 0000000000000000 [ 765.071983] x2 : 0000000000000001 x1 : ffff000805257820 x0 : 0000000000000880 [ 765.074221] Call trace: [ 765.075045] sve_save_state+0x4/0xf0 [ 765.076138] fpsimd_thread_switch+0x2c/0xe8 [ 765.077305] __switch_to+0x20/0x158 [ 765.078384] __schedule+0x2cc/0xb38 [ 765.079464] preempt_schedule_irq+0x44/0xa8 [ 765.080633] el1_interrupt+0x4c/0x68 [ 765.081691] el1h_64_irq_handler+0x18/0x28 [ 765.082829] el1h_64_irq+0x64/0x68 [ 765.083874] ftrace_return_to_handler+0x98/0x158 [ 765.085090] return_to_handler+0x20/0x48 [ 765.086205] do_sve_acc+0x64/0x128 [ 765.087272] el0_sve_acc+0x3c/0xa0 [ 765.088356] el0t_64_sync_handler+0x114/0x130 [ 765.089524] el0t_64_sync+0x190/0x198 [ 765.090712] Code: d51b4408 d65f03c0 d503201f d503245f (e5bb5800) [ 765.092024] ---[ end trace 0000000000000000 ]--- [ 765.904294] pstore: backend (efi_pstore) writing error (-5) [ 765.905531] note: rmmod[808] exited with irqs disabled
Links: test log link:
Details of tests:
Build link:
-- Linaro LKFT https://lkft.linaro.org