Following kernel warning noticed while running kselftest arm64 sve-ptrace on qemu-arm64 on ampere-altra server.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
/usr/bin/qemu-system-aarch64 -cpu max,pauth-impdef=on \ -machine virt-2.10 \ -nographic \ -net nic,model=virtio,macaddr=BA:DD:AD:FC:09:12 \ -net tap -m 4096 -monitor none \ -kernel Image.gz --append "console=ttyAMA0 root=/dev/vda rw" -hda lkft-kselftest-image-juno-20221114150409.rootfs.ext4 -smp 4 -nographic
Boot log: --------- [ 0.000000] Linux version 6.0.9-rc1 (tuxmake@tuxmake) (aarch64-linux-gnu-gcc (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP PREEMPT @1668438377 [ 0.000000] random: crng init done [ 0.000000] Machine model: linux,dummy-virt
# selftests: arm64: sve-ptrace # ok 680 # SKIP SVE set FPSIMD get SVE for VL 2704 # ok 681 Set SVE VL 2720
[ 422.607034] ------------[ cut here ]------------ [ 422.615382] WARNING: CPU: 0 PID: 1111 at arch/arm64/kernel/fpsimd.c:464 fpsimd_save+0x170/0x1b0 [ 422.617588] Modules linked in: cfg80211 bluetooth rfkill crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm [ 422.619758] CPU: 0 PID: 1111 Comm: sve-ptrace Not tainted 6.0.9-rc1 #1 [ 422.620402] Hardware name: linux,dummy-virt (DT) [ 422.620958] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 422.621614] pc : fpsimd_save+0x170/0x1b0 [ 422.621988] lr : fpsimd_save+0xd8/0x1b0 [ 422.622307] sp : ffff800008f3bb00 [ 422.622612] x29: ffff800008f3bb00 x28: ffffae14dd664bc0 x27: 0000000000000001 [ 422.623519] x26: ffff0000ff773858 x25: 0000000000000000 x24: ffff0000c0994fa8 [ 422.624102] x23: 0000000000000001 x22: 0000000000000100 x21: ffff0000ff75f0b0 [ 422.624706] x20: ffff51ec22a8b000 x19: ffffae14dccd40b0 x18: 0000000000000000 [ 422.625292] x17: ffff51ec22a8b000 x16: 0000000000000000 x15: 0000000000000000 [ 422.626041] x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000002 [ 422.626647] x11: ffffae14ddbee840 x10: 0000000000000312 x9 : ffffae14da818210 [ 422.627326] x8 : ffff0000c09935c0 x7 : ffffae14de2b8d08 x6 : 0000000000000000 [ 422.627889] x5 : 000000c91075a4a8 x4 : 0000000000000000 x3 : 0000000000000001 [ 422.628487] x2 : ffff51ec22a8b000 x1 : 0000000000000204 x0 : 0000000000000010 [ 422.629203] Call trace: [ 422.629579] fpsimd_save+0x170/0x1b0 [ 422.630014] fpsimd_thread_switch+0x2c/0xc4 [ 422.630431] __switch_to+0x20/0x160 [ 422.630745] __schedule+0x380/0xb90 [ 422.631038] preempt_schedule_irq+0x4c/0x130 [ 422.631386] el1_interrupt+0x4c/0x64 [ 422.631689] el1h_64_irq_handler+0x18/0x24 [ 422.632037] el1h_64_irq+0x64/0x68 [ 422.632335] do_page_fault+0x31c/0x4d0 [ 422.632660] do_translation_fault+0xd8/0x100 [ 422.632993] do_mem_abort+0x58/0xb0 [ 422.633311] el0_ia+0x8c/0x134 [ 422.633685] el0t_64_sync_handler+0x134/0x140 [ 422.634061] el0t_64_sync+0x18c/0x190 [ 422.634580] irq event stamp: 654 [ 422.634923] hardirqs last enabled at (653): [<ffffae14dbeafc94>] exit_to_kernel_mode+0x34/0x130 [ 422.635713] hardirqs last disabled at (654): [<ffffae14dbeb7700>] __schedule+0x3f0/0xb90 [ 422.636309] softirqs last enabled at (650): [<ffffae14da810be4>] __do_softirq+0x514/0x62c [ 422.636877] softirqs last disabled at (637): [<ffffae14da8b4f58>] __irq_exit_rcu+0x164/0x19c [ 422.637446] ---[ end trace 0000000000000000 ]---
Full test log: https://lkft.validation.linaro.org/scheduler/job/5847349#L2206 https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.0.y/build/v6.0.8-... https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.0.y/build/v6.0.8-...
metadata: git_ref: linux-6.0.y git_repo: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc git_sha: f8896c3ebbcfcc053d9c27413bea3af94c01fd71 git_describe: v6.0.8-191-gf8896c3ebbcf kernel_version: 6.0.9-rc1 kernel-config: https://builds.tuxbuild.com/2HXisCgbMlQAU85bS1QC4TvzydK/config build-url: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc/-/pipelines/69... artifact-location: https://builds.tuxbuild.com/2HXisCgbMlQAU85bS1QC4TvzydK toolchain: gcc-11
-- Linaro LKFT https://lkft.linaro.org
On Tue, Nov 15, 2022, at 08:27, Naresh Kamboju wrote:
Following kernel warning noticed while running kselftest arm64 sve-ptrace on qemu-arm64 on ampere-altra server.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
/usr/bin/qemu-system-aarch64 -cpu max,pauth-impdef=on \ -machine virt-2.10 \ -nographic \ -net nic,model=virtio,macaddr=BA:DD:AD:FC:09:12 \ -net tap -m 4096 -monitor none \ -kernel Image.gz --append "console=ttyAMA0 root=/dev/vda rw" -hda lkft-kselftest-image-juno-20221114150409.rootfs.ext4 -smp 4 -nographic
Hi Naresh,
Have you tried what happens if you run the same thing on an x86 machine? I would expect them to behave the same way, but it's possible something goes wrong with the guest CPU if this ends up using some (but not all) of the logic from KVM that would use '-cpu host' instead of '-cpu max'. Note that the Neoverse CPU in the Altra machine does not support SVE.
Other things you could easily try would use the same command line as above, with the possible combinations of '-cpu host' (replacing -cpu max) and '-enable-kvm'. Do you always get the same result?
Boot log:
[ 0.000000] Linux version 6.0.9-rc1 (tuxmake@tuxmake) (aarch64-linux-gnu-gcc (Debian 11.3.0-6) 11.3.0, GNU ld (GNU Binutils for Debian) 2.39) #1 SMP PREEMPT @1668438377 [ 0.000000] random: crng init done [ 0.000000] Machine model: linux,dummy-virt
# selftests: arm64: sve-ptrace # ok 680 # SKIP SVE set FPSIMD get SVE for VL 2704 # ok 681 Set SVE VL 2720
[ 422.607034] ------------[ cut here ]------------ [ 422.615382] WARNING: CPU: 0 PID: 1111 at arch/arm64/kernel/fpsimd.c:464 fpsimd_save+0x170/0x1b0 [ 422.617588] Modules linked in: cfg80211 bluetooth rfkill crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm [ 422.619758] CPU: 0 PID: 1111 Comm: sve-ptrace Not tainted 6.0.9-rc1 #1 [ 422.620402] Hardware name: linux,dummy-virt (DT) [ 422.620958] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 422.621614] pc : fpsimd_save+0x170/0x1b0 [ 422.621988] lr : fpsimd_save+0xd8/0x1b0 [ 422.622307] sp : ffff800008f3bb00 [ 422.622612] x29: ffff800008f3bb00 x28: ffffae14dd664bc0 x27: 0000000000000001 [ 422.623519] x26: ffff0000ff773858 x25: 0000000000000000 x24: ffff0000c0994fa8 [ 422.624102] x23: 0000000000000001 x22: 0000000000000100 x21: ffff0000ff75f0b0 [ 422.624706] x20: ffff51ec22a8b000 x19: ffffae14dccd40b0 x18: 0000000000000000 [ 422.625292] x17: ffff51ec22a8b000 x16: 0000000000000000 x15: 0000000000000000 [ 422.626041] x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000002 [ 422.626647] x11: ffffae14ddbee840 x10: 0000000000000312 x9 : ffffae14da818210 [ 422.627326] x8 : ffff0000c09935c0 x7 : ffffae14de2b8d08 x6 : 0000000000000000 [ 422.627889] x5 : 000000c91075a4a8 x4 : 0000000000000000 x3 : 0000000000000001 [ 422.628487] x2 : ffff51ec22a8b000 x1 : 0000000000000204 x0 : 0000000000000010 [ 422.629203] Call trace: [ 422.629579] fpsimd_save+0x170/0x1b0 [ 422.630014] fpsimd_thread_switch+0x2c/0xc4
This is the location of the WARN_ON(), it tests that the vector size matches. If for some reason it takes the vector size of the host CPU, this would warn.
if (IS_ENABLED(CONFIG_ARM64_SVE) && save_sve_regs) { /* Get the configured VL from RDVL, will account for SM */ if (WARN_ON(sve_get_vl() != vl)) { /*
[ 422.630431] __switch_to+0x20/0x160 [ 422.630745] __schedule+0x380/0xb90 [ 422.631038] preempt_schedule_irq+0x4c/0x130 [ 422.631386] el1_interrupt+0x4c/0x64 [ 422.631689] el1h_64_irq_handler+0x18/0x24 [ 422.632037] el1h_64_irq+0x64/0x68 [ 422.632335] do_page_fault+0x31c/0x4d0 [ 422.632660] do_translation_fault+0xd8/0x100 [ 422.632993] do_mem_abort+0x58/0xb0 [ 422.633311] el0_ia+0x8c/0x134 [ 422.633685] el0t_64_sync_handler+0x134/0x140 [ 422.634061] el0t_64_sync+0x18c/0x190 [ 422.634580] irq event stamp: 654 [ 422.634923] hardirqs last enabled at (653): [<ffffae14dbeafc94>] exit_to_kernel_mode+0x34/0x130 [ 422.635713] hardirqs last disabled at (654): [<ffffae14dbeb7700>] __schedule+0x3f0/0xb90 [ 422.636309] softirqs last enabled at (650): [<ffffae14da810be4>] __do_softirq+0x514/0x62c [ 422.636877] softirqs last disabled at (637): [<ffffae14da8b4f58>] __irq_exit_rcu+0x164/0x19c [ 422.637446] ---[ end trace 0000000000000000 ]---
Full test log: https://lkft.validation.linaro.org/scheduler/job/5847349#L2206 https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.0.y/build/v6.0.8-... https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.0.y/build/v6.0.8-...
On Tue, Nov 15, 2022 at 09:22:53AM +0100, Arnd Bergmann wrote:
Have you tried what happens if you run the same thing on an x86 machine? I would expect them to behave the same way, but it's possible something goes wrong with the guest CPU if this ends up using some (but not all) of the logic from KVM that would use '-cpu host' instead of '-cpu max'. Note that the Neoverse CPU in the Altra machine does not support SVE.
I'm finding it hard to think of a failure pattern that would make it through VL discovery then fail at runtime but also not obviously trigger any issues in syscall-abi...
Other things you could easily try would use the same command line as above, with the possible combinations of '-cpu host' (replacing -cpu max) and '-enable-kvm'. Do you always get the same result?
The machine parameter accel={tcg,kvm} is useful for forcing a specific backend - it's probably wise to force TCG if you might be running on a job on a native architecture.
BTW there's some other funky stuff going on with that job, the syscall-abi test is stopped with a timeout after 45 seconds (as is sve-ptrace) which appears to be coming from a harness somewhere. The selection of FP tests run seems to miss fp-stress too.
On Tue, Nov 15, 2022 at 12:57:53PM +0530, Naresh Kamboju wrote:
Following kernel warning noticed while running kselftest arm64 sve-ptrace on qemu-arm64 on ampere-altra server.
[ 422.607034] ------------[ cut here ]------------ [ 422.615382] WARNING: CPU: 0 PID: 1111 at arch/arm64/kernel/fpsimd.c:464 fpsimd_save+0x170/0x1b0 [ 422.617588] Modules linked in: cfg80211 bluetooth rfkill crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm
Without the ability to reproduce this or more information this isn't really actionable - since I'm not seeing any changes that look in the least bit relevant in the stable queue I'm guessing that it's just happened once?
You mention that this is hosted on an Altra but it looks like you're running the TCG backend, if there's some reason to expect that qemu might be unstable when hosted on that platform it's probably worth looping the qemu people in.