Following kernel oops noticed while running kselftests arm64 on qemu-arm64 on stable-rc linux-6.6.y branch.
I have re-built vmlinux with the same config and ran decode stackdump.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
Logs: ==== # selftests: arm64: fp-stress # TAP version 13 # 1..32 # # 2 CPUs, 5 SVE VLs, 5 SME VLs, SME2 absent # # Will run for 10s # # Started FPSIMD-0-0 <> # # SVE-VL-64-0: Expected [3904000039044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003] <> # # Finishing up... # # SSVE-VL-16-0: Terminated by signal 15, no error, iterations=50467, signals=9 # # SVE-VL-16-0: Terminated by signal 15, no error, iterations=56669, signals=9 # # FPSIMD-1-0: Terminated by signal 15, no error, iterations=20632, signals=10 # # FPSIMD-0-0: Terminated by signal 15, no error, iterations=21549, signals=9 # # SSVE-VL-16-1: Terminated by signal 15, no error, iterations=49077, signals=10 # # ZA-VL-16-0: Terminated by signal 15, no error, iterations=24878, signals=9 # # ZA-VL-16-1: Terminated by signal 15, no error, iterations=22452, signals=10 # # SVE-VL-16-1: Terminated by signal 15, no error, iterations=49039, signals=10 # ok 1 FPSIMD-0-0 # # SVE-VL-256-<1>[ 88.160313] Unable to handle kernel paging request at virtual address 00550f0344550f02 <1>[ 88.161949] Mem abort info: <1>[ 88.162574] ESR = 0x0000000096000004 <1>[ 88.163283] EC = 0x25: DABT (current EL), IL = 32 bits <1>[ 88.164330] SET = 0, FnV = 0 <1>[ 88.164930] EA = 0, S1PTW = 0 <1>[ 88.165854] FSC = 0x04: level 0 translation fault <1>[ 88.166852] Data abort info: <1>[ 88.167463] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 <1>[ 88.168566] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 <1>[ 88.169558] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 <1>[ 88.170580] [00550f0344550f02] address between user and kernel address ranges <0>[ 88.172317] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP <4>[ 88.173833] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm backlight dm_mod ip_tables x_tables <4>[ 88.177601] CPU: 1 PID: 1 Comm: systemd Not tainted 6.6.1-rc1 #1 0<4>[ 88.178992] Hardware name: linux,dummy-virt (DT) <4>[ 88.180334] pstate: 224000c9 (nzCv daIF +PAN -UAO +TCO -DIT -SSBS BTYPE=--) <4>[ 88.181149] pc : percpu_ref_get_many (include/linux/percpu-refcount.h:174 (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2)) <4>[ 88.182885] lr : percpu_ref_get_many (include/linux/percpu-refcount.h:174 (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2)) <4>[ 88.183621] sp : ffff80008000bd80 <4>[ 88.184039] x29: ffff80008000bd80 x28: ffff0000c02c8000 x27: 000000000000000a <4>[ 88.185245] x26: 0000000000000000 x25: 0000000000000002 x24: 0000000000000000 <4>[ 88.187718] x23: ffff0000c2306f40 x22: 0000000000000000 x21: 44550f0344550f02 <4>[ 88.188696] x20: 44550f0344550f02 x19: 0000000000000001 x18: 0000000000000000 <4>[ 88.189556] x17: ffff436cf77c7000 x16: ffff800080008000 x15: 0000000000000000 <4>[ 88.190568] x14: 0000000000000000 x13: ffff0000c2290026 x12: ffff80008002bcb4 <4>[ 88.191589] x11: 0000000000000040 x10: ffff0000c00ea0a8 x9 : ffffbc9405d93864 <4>[ 88.192573] x8 : ffff80008000bcd8 x7 : ffff0000c09fe000 x6 : ffff436cf77c7000 <4>[ 88.193523] x5 : ffff80008000bd40 x4 : fffffffffffffef8 x3 : 0000000000000040 <4>[ 88.194472] x2 : 0000000000000002 x1 : ffff0000c02c8000 x0 : 0000000000000001 <4>[ 88.195706] Call trace: <4>[ 88.196098] percpu_ref_get_many (include/linux/percpu-refcount.h:174 (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2)) <4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2)) <4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406) <4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630 include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208 mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831) <4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118) <4>[ 88.198870] delayed_put_pid (kernel/pid.c:127) <4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13 (discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1) kernel/rcu/tree.c:2403 (discriminator 1)) <4>[ 88.200978] rcu_core_si (kernel/rcu/tree.c:2421) <4>[ 88.201972] __do_softirq (arch/arm64/include/asm/jump_label.h:21 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:554) <4>[ 88.202587] ____do_softirq (arch/arm64/kernel/irq.c:81) <4>[ 88.203049] call_on_irq_stack (arch/arm64/kernel/entry.S:892) <4>[ 88.203544] do_softirq_own_stack (arch/arm64/kernel/irq.c:86) <4>[ 88.204008] irq_exit_rcu (arch/arm64/include/asm/percpu.h:44 kernel/softirq.c:612 kernel/softirq.c:634 kernel/softirq.c:644) <4>[ 88.204401] el1_interrupt (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/entry-common.c:246 arch/arm64/kernel/entry-common.c:505 arch/arm64/kernel/entry-common.c:517) <4>[ 88.205751] el1h_64_irq_handler (arch/arm64/kernel/entry-common.c:523) <4>[ 88.206672] el1h_64_irq (arch/arm64/kernel/entry.S:591) <4>[ 88.207329] map_id_range_down (kernel/user_namespace.c:299 kernel/user_namespace.c:319) <4>[ 88.208250] make_kuid (kernel/user_namespace.c:412) <4>[ 88.208826] inode_init_always (include/linux/fs.h:1343 (discriminator 1) fs/inode.c:174 (discriminator 1)) <4>[ 88.209678] alloc_inode (fs/inode.c:266 (discriminator 2)) <4>[ 88.210105] new_inode (fs/inode.c:1004 fs/inode.c:1030) <4>[ 88.210542] proc_pid_make_inode (fs/proc/base.c:1898) <4>[ 88.210963] proc_pid_instantiate (fs/proc/base.c:1949 fs/proc/base.c:3420) <4>[ 88.211361] proc_pid_lookup (fs/proc/base.c:3464) <4>[ 88.211762] proc_root_lookup (fs/proc/root.c:325 (discriminator 1)) <4>[ 88.212299] __lookup_slow (fs/namei.c:1694) <4>[ 88.212739] walk_component (fs/namei.c:1711 fs/namei.c:2002) <4>[ 88.213244] link_path_walk.part.0.constprop.0 (fs/namei.c:2331 (discriminator 1)) <4>[ 88.213803] path_openat (fs/namei.c:2254 (discriminator 1) fs/namei.c:3793 (discriminator 1)) <4>[ 88.214264] do_filp_open (fs/namei.c:3824) <4>[ 88.214550] do_sys_openat2 (fs/open.c:1422) <4>[ 88.215080] __arm64_sys_openat (fs/open.c:1448) <4>[ 88.215495] invoke_syscall (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:56) <4>[ 88.215986] el0_svc_common.constprop.0 (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:144 (discriminator 2)) <4>[ 88.216476] do_el0_svc (arch/arm64/kernel/syscall.c:156) <4>[ 88.216910] el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:679) <4>[ 88.217246] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:697) <4>[ 88.217766] el0t_64_sync (arch/arm64/kernel/entry.S:595) <0>[ 88.218477] Code: a90153f3 aa0003f4 aa0103f3 97f6f396 (f9400280) All code ======== 0: a90153f3 stp x19, x20, [sp, #16] 4: aa0003f4 mov x20, x0 8: aa0103f3 mov x19, x1 c: 97f6f396 bl 0xffffffffffdbce64 10:* f9400280 ldr x0, [x20] <-- trapping instruction
Code starting with the faulting instruction =========================================== 0: f9400280 ldr x0, [x20] <4>[ 88.219947] ---[ end trace 0000000000000000 ]--- <0>[ 88.220779] Kernel panic - not syncing: Oops: Fatal exception in interrupt <2>[ 88.221715] SMP: stopping secondary CPUs <0>[ 88.226328] Kernel Offset: 0x3c9385a00000 from 0xffff800080000000 <0>[ 88.226953] PHYS_OFFSET: 0x40000000 <0>[ 88.227382] CPU features: 0x0,00000000,d1e2cf43,99e6773f <0>[ 88.228141] Memory Limit: none <0>[ 88.228905] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
Links: - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.6.y/build/v6.6-31... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.6.y/build/v6.6-31... - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2XntnCQFUyH...
-- Linaro LKFT https://lkft.linaro.org
On Tue, Nov 07, 2023 at 06:43:25PM +0530, Naresh Kamboju wrote:
# # SVE-VL-64-0: Expected [3904000039044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003] <>
You've elided *lots* of error reports from the actual test which suggest that there is substantial memory corruption, it looks like tearing part way through loading or saving the values - the start of the vectors looks fine but at some point they get what looks like a related process' data, eg:
# # SVE-VL-64-0: Expected [3904000039044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003] # # SVE-VL-64-0: Got [3904000039044000390480003904c000390480003904c00039040001390440013904000139044001390480013904c001390480013904c0013904000239044002]
This only appears to affect SVE and SME, I didn't spot any FPSIMD corruption but then that is the smallest case (and I didn't notice any VL 16 cases either). It looks like the corruption is on the first thing we check each time (either register 0 or the highest ZA.H vector for ZA), all the values do look lke they were plausibly generated by fp-stress test programs.
Then we get what looks like memory corruption:
# # SVE-VL-256-<1>[ 88.160313] Unable to handle kernel paging request at virtual address 00550f0344550f02
<4>[ 88.195706] Call trace: <4>[ 88.196098] percpu_ref_get_many (include/linux/percpu-refcount.h:174 (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2)) <4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2)) <4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406) <4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630 include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208 mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831) <4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118) <4>[ 88.198870] delayed_put_pid (kernel/pid.c:127) <4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13 (discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1) kernel/rcu/tree.c:2403 (discriminator 1))
This all seems very surprising, especially given that AFAICT there are no changes in stable-6.6-rc for arch/arm64.
On Tue, 7 Nov 2023 at 19:51, Mark Brown broonie@kernel.org wrote:
On Tue, Nov 07, 2023 at 06:43:25PM +0530, Naresh Kamboju wrote:
# # SVE-VL-64-0: Expected [3904000039044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003] <>
You've elided *lots* of error reports from the actual test which suggest that there is substantial memory corruption, it looks like tearing part way through loading or saving the values - the start of the vectors looks fine but at some point they get what looks like a related process' data, eg:
# # SVE-VL-64-0: Expected [3904000039044000390480003904c0003904000139044001390480013904c0013904000239044002390480023904c0023904000339044003390480033904c003] # # SVE-VL-64-0: Got [3904000039044000390480003904c000390480003904c00039040001390440013904000139044001390480013904c001390480013904c0013904000239044002]
This only appears to affect SVE and SME, I didn't spot any FPSIMD corruption but then that is the smallest case (and I didn't notice any VL 16 cases either). It looks like the corruption is on the first thing we check each time (either register 0 or the highest ZA.H vector for ZA), all the values do look lke they were plausibly generated by fp-stress test programs.
Then we get what looks like memory corruption:
# # SVE-VL-256-<1>[ 88.160313] Unable to handle kernel paging request at virtual address 00550f0344550f02
<4>[ 88.195706] Call trace: <4>[ 88.196098] percpu_ref_get_many (include/linux/percpu-refcount.h:174 (discriminator 2) include/linux/percpu-refcount.h:204 (discriminator 2)) <4>[ 88.196815] refill_obj_stock (mm/memcontrol.c:3339 (discriminator 2)) <4>[ 88.197367] obj_cgroup_uncharge (mm/memcontrol.c:3406) <4>[ 88.197835] kmem_cache_free (include/linux/mm.h:1630 include/linux/mm.h:1849 include/linux/mm.h:1859 mm/slab.h:208 mm/slab.h:572 mm/slub.c:3804 mm/slub.c:3831) <4>[ 88.198407] put_pid.part.0 (kernel/pid.c:118) <4>[ 88.198870] delayed_put_pid (kernel/pid.c:127) <4>[ 88.200527] rcu_core (arch/arm64/include/asm/preempt.h:13 (discriminator 1) kernel/rcu/tree.c:2146 (discriminator 1) kernel/rcu/tree.c:2403 (discriminator 1))
This all seems very surprising, especially given that AFAICT there are no changes in stable-6.6-rc for arch/arm64.
We do not see on the mainline and next. Is this reported problems on stable-rc 6.6 and 6.5 are due to running latest kselftest on older kernels ?
-- # # SSVE-VL-32-1: Mismatch: PID=641, iteration=0, reg=0 # # SSVE-VL-128-1: Got [<junk>] # # SSVE-VL-256-1: Got [<junk>]
Unable to handle kernel paging request at virtual address 00740f0322740f02 0<1>[ 89.400173] Mem abort info: <1>[ 89.400844] ESR = 0x0000000096000004 <1>[ 89.401974] EC = 0x25: DABT (current EL), IL = 32 bits <1>[ 89.403399] SET = 0, FnV = 0 <1>[ 89.404421] EA = 0, S1PTW = 0 <1>[ 89.405317] FSC = 0x04: level 0 translation fault <1>[ 89.406545] Data abort info: <1>[ 89.407493] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 <1>[ 89.408785] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 <1>[ 89.410001] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 <1>[ 89.411485] [00740f0322740f02] address between user and kernel address ranges <0>[ 89.413851] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP <4>[ 89.415573] Modules linked in: crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm dm_mod ip_tables x_tables <4>[ 89.419561] CPU: 1 PID: 22 Comm: ksoftirqd/1 Not tainted 6.5.11-rc1 #1 <4>[ 89.420795] Hardware name: linux,dummy-virt (DT) <4>[ 89.422676] pstate: 624000c9 (nZCv daIF +PAN -UAO +TCO -DIT -SSBS BTYPE=--) <4>[ 89.424344] pc : refill_obj_stock+0x6c/0x250 <4>[ 89.426324] lr : refill_obj_stock+0x6c/0x250 <trim> <4>[ 89.447170] Call trace: <4>[ 89.447756] refill_obj_stock+0x6c/0x250 <4>[ 89.449033] obj_cgroup_uncharge+0x20/0x38 <4>[ 89.450457] kmem_cache_free+0xf8/0x500 <4>[ 89.451066] delayed_put_pid+0x50/0xb0 <4>[ 89.452401] rcu_core+0x3cc/0x950 <4>[ 89.453369] rcu_core_si+0x1c/0x30 <4>[ 89.454465] __do_softirq+0x118/0x438 <4>[ 89.455738] run_ksoftirqd+0x40/0xf8 <4>[ 89.456893] smpboot_thread_fn+0x1d0/0x248 <4>[ 89.457969] kthread+0xfc/0x1a0 <4>[ 89.459171] ret_from_fork+0x10/0x20 <0>[ 89.460445] Code: aa1603e0 97fffef8 aa0003f4 97f6cbf6 (f9400269) <4>[ 89.462028] ---[ end trace 0000000000000000 ]--- <0>[ 89.463494] Kernel panic - not syncing: Oops: Fatal exception in interrupt <2>[ 89.465046] SMP: stopping secondary CPUs <0>[ 89.466327] Kernel Offset: 0x2dabffa00000 from 0xffff800080000000 <0>[ 89.467385] PHYS_OFFSET: 0x40000000 <0>[ 89.468131] CPU features: 0x00000000,68f167a1,cce6773f <0>[ 89.469850] Memory Limit: none <0>[ 89.470836] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
Links: https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.5.y/build/v6.5.10... https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.5.y/build/v6.5.10... https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.5.y/build/v6.5.10...
- Naresh
On Tue, Nov 07, 2023 at 08:14:59PM +0530, Naresh Kamboju wrote:
On Tue, 7 Nov 2023 at 19:51, Mark Brown broonie@kernel.org wrote:
This all seems very surprising, especially given that AFAICT there are no changes in stable-6.6-rc for arch/arm64.
We do not see on the mainline and next. Is this reported problems on stable-rc 6.6 and 6.5 are due to running latest kselftest on older kernels ?
There's also no backports I can see in the selftests (at all, never mind just arm64). There were a small number of selftest changes for arm64 went in during the merge window but nothing that looks super relevant.
Hi Mark,
On Tue, 7 Nov 2023 at 21:37, Mark Brown broonie@kernel.org wrote:
On Tue, Nov 07, 2023 at 08:14:59PM +0530, Naresh Kamboju wrote:
On Tue, 7 Nov 2023 at 19:51, Mark Brown broonie@kernel.org wrote:
This all seems very surprising, especially given that AFAICT there are no changes in stable-6.6-rc for arch/arm64.
We do not see on the mainline and next. Is this reported problems on stable-rc 6.6 and 6.5 are due to running latest kselftest on older kernels ?
There's also no backports I can see in the selftests (at all, never mind just arm64). There were a small number of selftest changes for arm64 went in during the merge window but nothing that looks super relevant.
The Qemu version got updated from v8.0 to v8.1 and started getting these test failures.
- Naresh