Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 and 6.6.0-rc7-next-20231025.
BAD: next-20231025 Good: next-20231024
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: Naresh Kamboju naresh.kamboju@linaro.org
Log: ---- <1>[ 203.119139] Unable to handle kernel unknown 43 at virtual address 0001ffff9e2e7d78 <1>[ 203.119838] Mem abort info: <1>[ 203.120064] ESR = 0x000000009793002b <1>[ 203.121040] EC = 0x25: DABT (current EL), IL = 32 bits set_robust_list01 1 TPASS : set_robust_list: retval = -1 (expected -1), errno = 22 (expected 22) set_robust_list01 2 TPASS : set_robust_list: retval = 0 (expected 0), errno = 0 (expected 0) <1>[ 203.124496] SET = 0, FnV = 0 <1>[ 203.124778] EA = 0, S1PTW = 0 <1>[ 203.125029] FSC = 0x2b: unknown 43 <1>[ 203.126470] Data abort info: <1>[ 203.126710] Access size = 4 byte(s) <1>[ 203.126969] SSE = 0, SRT = 19 <1>[ 203.127708] SF = 0, AR = 0 <1>[ 203.128213] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 <1>[ 203.128788] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 <1>[ 203.130416] user pgtable: 4k pages, 52-bit VAs, pgdp=000000010606a780 <1>[ 203.130817] [0001ffff9e2e7d78] pgd=0000000000000000 <0>[ 203.132603] Internal error: Oops: 000000009793002b [#1] PREEMPT SMP <4>[ 203.133483] Modules linked in: btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm backlight dm_mod ip_tables x_tables <4>[ 203.135177] CPU: 1 PID: 653 Comm: set_robust_list Not tainted 6.6.0-rc7-next-20231026 #1 <4>[ 203.135642] Hardware name: linux,dummy-virt (DT) <4>[ 203.136609] pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) <4>[ 203.137028] pc : handle_futex_death (kernel/futex/core.c:661 (discriminator 6)) <4>[ 203.138844] lr : handle_futex_death (arch/arm64/include/asm/uaccess.h:46 (discriminator 1) kernel/futex/core.c:661 (discriminator 1)) <4>[ 203.139132] sp : ffff8000805c3c10 <4>[ 203.139356] x29: ffff8000805c3c10 x28: 0000ffffbf187740 x27: d53bd04035000220 <4>[ 203.140366] x26: 0000000000000000 x25: fff00000c6195280 x24: fff00000c6195280 <4>[ 203.141055] x23: 0000000000000001 x22: ffffa4e6aeef09d0 x21: 0001ffff9e2e7d78 <4>[ 203.141771] x20: 0001ffff9e2e7d78 x19: 0001ffff9e2e7d78 x18: ffff8000805c3cf8 <4>[ 203.142457] x17: 0000000000000000 x16: ffffa4e6aeae7078 x15: 000000000000000a <4>[ 203.143134] x14: 0000000000000000 x13: 1ffe000018258661 x12: ffff8000805c3cf8 <4>[ 203.143809] x11: 0000000000000000 x10: fff00000c12c3308 x9 : ffffa4e6ad0e5748 <4>[ 203.144504] x8 : ffff8000805c3c38 x7 : 0000000000000000 x6 : 0000000000000001 <4>[ 203.145186] x5 : 0000000000000000 x4 : fff00000c6195280 x3 : 0000000000000000 <4>[ 203.145929] x2 : 0000000000000000 x1 : 000ffffffffffffc x0 : 0001ffff9e2e7d78 <4>[ 203.147032] Call trace: <4>[ 203.147254] handle_futex_death (kernel/futex/core.c:661 (discriminator 6)) <4>[ 203.147560] exit_robust_list (kernel/futex/core.c:828) <4>[ 203.148348] futex_exit_release (kernel/futex/core.c:1035 (discriminator 1) kernel/futex/core.c:1131 (discriminator 1)) <4>[ 203.148891] exit_mm_release (kernel/fork.c:1657) <4>[ 203.149669] do_exit (kernel/exit.c:541 kernel/exit.c:858) <4>[ 203.149897] do_group_exit (kernel/exit.c:1002) <4>[ 203.150209] __arm64_sys_exit_group (kernel/exit.c:1032) <4>[ 203.150980] invoke_syscall (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:56) <4>[ 203.151234] el0_svc_common.constprop.0 (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:144 (discriminator 2)) <4>[ 203.151999] do_el0_svc (arch/arm64/kernel/syscall.c:156) <4>[ 203.152231] el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:679) <4>[ 203.152936] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:697) <4>[ 203.153518] el0t_64_sync (arch/arm64/kernel/entry.S:595) <0>[ 203.154424] Code: d50323bf d65f03c0 9248fa93 52800002 (b8400a73) All code ======== 0: d50323bf autiasp 4: d65f03c0 ret 8: 9248fa93 and x19, x20, #0xff7fffffffffffff c: 52800002 mov w2, #0x0 // #0 10:* b8400a73 ldtr w19, [x19] <-- trapping instruction
Code starting with the faulting instruction =========================================== 0: b8400a73 ldtr w19, [x19] <4>[ 203.155308] ---[ end trace 0000000000000000 ]--- <1>[ 203.156234] Fixing recursive fault but reboot is needed! <3>[ 203.157116] BUG: using smp_processor_id() in preemptible [00000000] code: set_robust_list/653 <4>[ 203.158116] caller is debug_smp_processor_id (lib/smp_processor_id.c:61) <4>[ 203.158983] CPU: 1 PID: 653 Comm: set_robust_list Tainted: G D 6.6.0-rc7-next-20231026 #1 <4>[ 203.159451] Hardware name: linux,dummy-virt (DT) <4>[ 203.159990] Call trace: <4>[ 203.160394] dump_backtrace (arch/arm64/kernel/stacktrace.c:235) <4>[ 203.160625] show_stack (arch/arm64/kernel/stacktrace.c:242) <4>[ 203.160854] dump_stack_lvl (lib/dump_stack.c:107) <4>[ 203.161869] dump_stack (lib/dump_stack.c:114) <4>[ 203.162093] check_preemption_disabled (arch/arm64/include/asm/current.h:19 arch/arm64/include/asm/preempt.h:54 lib/smp_processor_id.c:53) <4>[ 203.162898] debug_smp_processor_id (lib/smp_processor_id.c:61) <4>[ 203.163176] __schedule (kernel/sched/core.c:6578 (discriminator 1)) <4>[ 203.163894] do_task_dead (kernel/sched/core.c:6705) <4>[ 203.164143] make_task_dead (arch/arm64/include/asm/atomic_ll_sc.h:95 (discriminator 3) arch/arm64/include/asm/atomic.h:49 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:747 (discriminator 3) include/linux/atomic/atomic-instrumented.h:253 (discriminator 3) include/linux/refcount.h:193 (discriminator 3) include/linux/refcount.h:250 (discriminator 3) include/linux/refcount.h:267 (discriminator 3) kernel/exit.c:979 (discriminator 3)) <4>[ 203.164871] die (arch/arm64/kernel/traps.c:239) <4>[ 203.165093] die_kernel_fault (arch/arm64/mm/fault.c:321) <4>[ 203.165905] do_mem_abort (arch/arm64/mm/fault.c:850) <4>[ 203.166149] el1_abort (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:399) <4>[ 203.166864] el1h_64_sync_handler (arch/arm64/kernel/entry-common.c:486) <4>[ 203.167173] el1h_64_sync (arch/arm64/kernel/entry.S:590) <4>[ 203.167824] handle_futex_death (kernel/futex/core.c:661 (discriminator 6)) <4>[ 203.168329] exit_robust_list (kernel/futex/core.c:828) <4>[ 203.168829] futex_exit_release (kernel/futex/core.c:1035 (discriminator 1) kernel/futex/core.c:1131 (discriminator 1)) <4>[ 203.169375] exit_mm_release (kernel/fork.c:1657) <4>[ 203.169884] do_exit (kernel/exit.c:541 kernel/exit.c:858) <4>[ 203.170372] do_group_exit (kernel/exit.c:1002) <4>[ 203.170857] __arm64_sys_exit_group (kernel/exit.c:1032) <4>[ 203.171643] invoke_syscall (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:56) <4>[ 203.172281] el0_svc_common.constprop.0 (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:144 (discriminator 2)) <4>[ 203.172815] do_el0_svc (arch/arm64/kernel/syscall.c:156) <4>[ 203.173284] el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:679) <4>[ 203.173769] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:697) <4>[ 203.174052] el0t_64_sync (arch/arm64/kernel/entry.S:595)
Links: - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231026/tes... - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231026/tes... - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231026/tes...
-- Linaro LKFT https://lkft.linaro.org
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 and 6.6.0-rc7-next-20231025.
BAD: next-20231025 Good: next-20231024
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: Naresh Kamboju naresh.kamboju@linaro.org
Log:
<1>[ 203.119139] Unable to handle kernel unknown 43 at virtual address 0001ffff9e2e7d78 <1>[ 203.119838] Mem abort info: <1>[ 203.120064] ESR = 0x000000009793002b <1>[ 203.121040] EC = 0x25: DABT (current EL), IL = 32 bits set_robust_list01 1 TPASS : set_robust_list: retval = -1 (expected -1), errno = 22 (expected 22) set_robust_list01 2 TPASS : set_robust_list: retval = 0 (expected 0), errno = 0 (expected 0) <1>[ 203.124496] SET = 0, FnV = 0 <1>[ 203.124778] EA = 0, S1PTW = 0 <1>[ 203.125029] FSC = 0x2b: unknown 43
It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Mark.
<1>[ 203.126470] Data abort info: <1>[ 203.126710] Access size = 4 byte(s) <1>[ 203.126969] SSE = 0, SRT = 19 <1>[ 203.127708] SF = 0, AR = 0 <1>[ 203.128213] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 <1>[ 203.128788] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 <1>[ 203.130416] user pgtable: 4k pages, 52-bit VAs, pgdp=000000010606a780 <1>[ 203.130817] [0001ffff9e2e7d78] pgd=0000000000000000 <0>[ 203.132603] Internal error: Oops: 000000009793002b [#1] PREEMPT SMP <4>[ 203.133483] Modules linked in: btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress crct10dif_ce sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 fuse drm backlight dm_mod ip_tables x_tables <4>[ 203.135177] CPU: 1 PID: 653 Comm: set_robust_list Not tainted 6.6.0-rc7-next-20231026 #1 <4>[ 203.135642] Hardware name: linux,dummy-virt (DT) <4>[ 203.136609] pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) <4>[ 203.137028] pc : handle_futex_death (kernel/futex/core.c:661 (discriminator 6)) <4>[ 203.138844] lr : handle_futex_death (arch/arm64/include/asm/uaccess.h:46 (discriminator 1) kernel/futex/core.c:661 (discriminator 1)) <4>[ 203.139132] sp : ffff8000805c3c10 <4>[ 203.139356] x29: ffff8000805c3c10 x28: 0000ffffbf187740 x27: d53bd04035000220 <4>[ 203.140366] x26: 0000000000000000 x25: fff00000c6195280 x24: fff00000c6195280 <4>[ 203.141055] x23: 0000000000000001 x22: ffffa4e6aeef09d0 x21: 0001ffff9e2e7d78 <4>[ 203.141771] x20: 0001ffff9e2e7d78 x19: 0001ffff9e2e7d78 x18: ffff8000805c3cf8 <4>[ 203.142457] x17: 0000000000000000 x16: ffffa4e6aeae7078 x15: 000000000000000a <4>[ 203.143134] x14: 0000000000000000 x13: 1ffe000018258661 x12: ffff8000805c3cf8 <4>[ 203.143809] x11: 0000000000000000 x10: fff00000c12c3308 x9 : ffffa4e6ad0e5748 <4>[ 203.144504] x8 : ffff8000805c3c38 x7 : 0000000000000000 x6 : 0000000000000001 <4>[ 203.145186] x5 : 0000000000000000 x4 : fff00000c6195280 x3 : 0000000000000000 <4>[ 203.145929] x2 : 0000000000000000 x1 : 000ffffffffffffc x0 : 0001ffff9e2e7d78 <4>[ 203.147032] Call trace: <4>[ 203.147254] handle_futex_death (kernel/futex/core.c:661 (discriminator 6)) <4>[ 203.147560] exit_robust_list (kernel/futex/core.c:828) <4>[ 203.148348] futex_exit_release (kernel/futex/core.c:1035 (discriminator 1) kernel/futex/core.c:1131 (discriminator 1)) <4>[ 203.148891] exit_mm_release (kernel/fork.c:1657) <4>[ 203.149669] do_exit (kernel/exit.c:541 kernel/exit.c:858) <4>[ 203.149897] do_group_exit (kernel/exit.c:1002) <4>[ 203.150209] __arm64_sys_exit_group (kernel/exit.c:1032) <4>[ 203.150980] invoke_syscall (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:56) <4>[ 203.151234] el0_svc_common.constprop.0 (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:144 (discriminator 2)) <4>[ 203.151999] do_el0_svc (arch/arm64/kernel/syscall.c:156) <4>[ 203.152231] el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:679) <4>[ 203.152936] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:697) <4>[ 203.153518] el0t_64_sync (arch/arm64/kernel/entry.S:595) <0>[ 203.154424] Code: d50323bf d65f03c0 9248fa93 52800002 (b8400a73) All code ======== 0: d50323bf autiasp 4: d65f03c0 ret 8: 9248fa93 and x19, x20, #0xff7fffffffffffff c: 52800002 mov w2, #0x0 // #0 10:* b8400a73 ldtr w19, [x19] <-- trapping instruction
Code starting with the faulting instruction
0: b8400a73 ldtr w19, [x19] <4>[ 203.155308] ---[ end trace 0000000000000000 ]--- <1>[ 203.156234] Fixing recursive fault but reboot is needed! <3>[ 203.157116] BUG: using smp_processor_id() in preemptible [00000000] code: set_robust_list/653 <4>[ 203.158116] caller is debug_smp_processor_id (lib/smp_processor_id.c:61) <4>[ 203.158983] CPU: 1 PID: 653 Comm: set_robust_list Tainted: G D 6.6.0-rc7-next-20231026 #1 <4>[ 203.159451] Hardware name: linux,dummy-virt (DT) <4>[ 203.159990] Call trace: <4>[ 203.160394] dump_backtrace (arch/arm64/kernel/stacktrace.c:235) <4>[ 203.160625] show_stack (arch/arm64/kernel/stacktrace.c:242) <4>[ 203.160854] dump_stack_lvl (lib/dump_stack.c:107) <4>[ 203.161869] dump_stack (lib/dump_stack.c:114) <4>[ 203.162093] check_preemption_disabled (arch/arm64/include/asm/current.h:19 arch/arm64/include/asm/preempt.h:54 lib/smp_processor_id.c:53) <4>[ 203.162898] debug_smp_processor_id (lib/smp_processor_id.c:61) <4>[ 203.163176] __schedule (kernel/sched/core.c:6578 (discriminator 1)) <4>[ 203.163894] do_task_dead (kernel/sched/core.c:6705) <4>[ 203.164143] make_task_dead (arch/arm64/include/asm/atomic_ll_sc.h:95 (discriminator 3) arch/arm64/include/asm/atomic.h:49 (discriminator 3) include/linux/atomic/atomic-arch-fallback.h:747 (discriminator 3) include/linux/atomic/atomic-instrumented.h:253 (discriminator 3) include/linux/refcount.h:193 (discriminator 3) include/linux/refcount.h:250 (discriminator 3) include/linux/refcount.h:267 (discriminator 3) kernel/exit.c:979 (discriminator 3)) <4>[ 203.164871] die (arch/arm64/kernel/traps.c:239) <4>[ 203.165093] die_kernel_fault (arch/arm64/mm/fault.c:321) <4>[ 203.165905] do_mem_abort (arch/arm64/mm/fault.c:850) <4>[ 203.166149] el1_abort (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:399) <4>[ 203.166864] el1h_64_sync_handler (arch/arm64/kernel/entry-common.c:486) <4>[ 203.167173] el1h_64_sync (arch/arm64/kernel/entry.S:590) <4>[ 203.167824] handle_futex_death (kernel/futex/core.c:661 (discriminator 6)) <4>[ 203.168329] exit_robust_list (kernel/futex/core.c:828) <4>[ 203.168829] futex_exit_release (kernel/futex/core.c:1035 (discriminator 1) kernel/futex/core.c:1131 (discriminator 1)) <4>[ 203.169375] exit_mm_release (kernel/fork.c:1657) <4>[ 203.169884] do_exit (kernel/exit.c:541 kernel/exit.c:858) <4>[ 203.170372] do_group_exit (kernel/exit.c:1002) <4>[ 203.170857] __arm64_sys_exit_group (kernel/exit.c:1032) <4>[ 203.171643] invoke_syscall (arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:56) <4>[ 203.172281] el0_svc_common.constprop.0 (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:144 (discriminator 2)) <4>[ 203.172815] do_el0_svc (arch/arm64/kernel/syscall.c:156) <4>[ 203.173284] el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:144 arch/arm64/kernel/entry-common.c:679) <4>[ 203.173769] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:697) <4>[ 203.174052] el0t_64_sync (arch/arm64/kernel/entry.S:595)
Links:
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231026/tes...
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231026/tes...
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231026/tes...
-- Linaro LKFT https://lkft.linaro.org
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote:
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 and 6.6.0-rc7-next-20231025.
BAD: next-20231025 Good: next-20231024
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: Naresh Kamboju naresh.kamboju@linaro.org
Log:
<1>[ 203.119139] Unable to handle kernel unknown 43 at virtual address 0001ffff9e2e7d78 <1>[ 203.119838] Mem abort info: <1>[ 203.120064] ESR = 0x000000009793002b <1>[ 203.121040] EC = 0x25: DABT (current EL), IL = 32 bits set_robust_list01 1 TPASS : set_robust_list: retval = -1 (expected -1), errno = 22 (expected 22) set_robust_list01 2 TPASS : set_robust_list: retval = 0 (expected 0), errno = 0 (expected 0) <1>[ 203.124496] SET = 0, FnV = 0 <1>[ 203.124778] EA = 0, S1PTW = 0 <1>[ 203.125029] FSC = 0x2b: unknown 43
It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
On Thu, 26 Oct 2023 at 21:09, Ard Biesheuvel ardb@kernel.org wrote:
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote:
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 and 6.6.0-rc7-next-20231025.
BAD: next-20231025 Good: next-20231024
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: Naresh Kamboju naresh.kamboju@linaro.org
Log:
<1>[ 203.119139] Unable to handle kernel unknown 43 at virtual address 0001ffff9e2e7d78 <1>[ 203.119838] Mem abort info: <1>[ 203.120064] ESR = 0x000000009793002b <1>[ 203.121040] EC = 0x25: DABT (current EL), IL = 32 bits set_robust_list01 1 TPASS : set_robust_list: retval = -1 (expected -1), errno = 22 (expected 22) set_robust_list01 2 TPASS : set_robust_list: retval = 0 (expected 0), errno = 0 (expected 0) <1>[ 203.124496] SET = 0, FnV = 0 <1>[ 203.124778] EA = 0, S1PTW = 0 <1>[ 203.125029] FSC = 0x2b: unknown 43
It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
I am happy to test any proposed fix patch.
- Naresh
On Fri, 27 Oct 2023 at 12:57, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 26 Oct 2023 at 21:09, Ard Biesheuvel ardb@kernel.org wrote:
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote:
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 ...
It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
I am happy to test any proposed fix patch.
Thanks Naresh. Patch attached.
On Sat, 28 Oct 2023 at 13:12, Ard Biesheuvel ardb@kernel.org wrote:
On Fri, 27 Oct 2023 at 12:57, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 26 Oct 2023 at 21:09, Ard Biesheuvel ardb@kernel.org wrote:
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote:
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 ...
It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
I am happy to test any proposed fix patch.
Thanks Naresh. Patch attached.
This patch did not solve the reported problem. Test log links, - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/naresh/tests/2XTP1lXcU...
- Naresh
On Mon, 30 Oct 2023 at 09:07, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Sat, 28 Oct 2023 at 13:12, Ard Biesheuvel ardb@kernel.org wrote:
On Fri, 27 Oct 2023 at 12:57, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 26 Oct 2023 at 21:09, Ard Biesheuvel ardb@kernel.org wrote:
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote:
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 ...
It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
I am happy to test any proposed fix patch.
Thanks Naresh. Patch attached.
This patch did not solve the reported problem. Test log links,
Oops, sorry about that.
Fixed patch attched.
On Mon, 30 Oct 2023 at 13:45, Ard Biesheuvel ardb@kernel.org wrote:
On Mon, 30 Oct 2023 at 09:07, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Sat, 28 Oct 2023 at 13:12, Ard Biesheuvel ardb@kernel.org wrote:
On Fri, 27 Oct 2023 at 12:57, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 26 Oct 2023 at 21:09, Ard Biesheuvel ardb@kernel.org wrote:
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote:
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote: > Following kernel crash noticed on qemu-arm64 while running LTP syscalls > set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 ... It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
I am happy to test any proposed fix patch.
Thanks Naresh. Patch attached.
This patch did not solve the reported problem. Test log links,
Oops, sorry about that.
Fixed patch attched.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
- Naresh
Hi Ard,
Your V2 patch works perfectly. Thanks for providing a fix patch.
- Naresh
On Mon, 30 Oct 2023 at 17:20, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Mon, 30 Oct 2023 at 13:45, Ard Biesheuvel ardb@kernel.org wrote:
On Mon, 30 Oct 2023 at 09:07, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Sat, 28 Oct 2023 at 13:12, Ard Biesheuvel ardb@kernel.org wrote:
On Fri, 27 Oct 2023 at 12:57, Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 26 Oct 2023 at 21:09, Ard Biesheuvel ardb@kernel.org wrote:
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote: > > On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote: > > Following kernel crash noticed on qemu-arm64 while running LTP syscalls > > set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 ... > It looks like this is fallout from the LPA2 enablement. > > According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown > 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault: > > 0b101011 When FEAT_LPA2 is implemented: > Translation fault, level -1. > > It's triggered here by an LDTR in a get_user() on a bogus userspace address. > The exception is expected, and it's supposed to be handled via the exception > fixups, but the LPA2 patches didn't update the fault_info table entries for all > the level -1 faults, and so those all get handled by do_bad() and don't call > fixup_exception(), causing them to be fatal. > > It should be relatively simple to update the fault_info table for the level -1 > faults, but given the other issues we're seeing I think it's probably worth > dropping the LPA2 patches for the moment. >
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
I am happy to test any proposed fix patch.
Thanks Naresh. Patch attached.
This patch did not solve the reported problem. Test log links,
Oops, sorry about that.
Fixed patch attched.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
- Naresh
On Mon, Oct 30, 2023 at 09:14:56AM +0100, Ard Biesheuvel wrote:
From 97dea432bceadfcece84484609374c277afc2c81 Mon Sep 17 00:00:00 2001 From: Ard Biesheuvel ardb@kernel.org Date: Sat, 28 Oct 2023 09:40:29 +0200 Subject: [PATCH v2] Add missing ESR decoding for level -1 translation faults
Signed-off-by: Ard Biesheuvel ardb@kernel.org
As a heads-up, looking at this some more we'll also need to rework the usage of of ESR_ELx_FSC_TYPE and ESR_ELx_FSC_LEVEL, since those no longer work correctly Level -1 xFSC value. ESR_ELx_FSC_TYPE is 0x3c and ESR_ELx_FSC_LEVEL is 0x3, and work on the basis that the xFSC fault types are encoded as xxxxyy, where the xxxx is the type and the yy is the level (0 to 3).
That didn't expand naturally to level -1. For example, Level {0,1,2,3} translation faults get reported as 0b0001xx, where the xx encodes the level, while Level -1 translation faults get reported as 0b101011.
That ends up affecting:
* All the is_${FOO}_fault() predicat functions, e.g. is_translation_fault(), is_el1_permission_fault() and is_spurious_el1_translation_fault().
* Places where we synthesize an xFSC value, e.g. set_thread_esr()
* A bunch of KVM due to the use of kvm_vcpu_trap_get_fault_type()
... and we probably need to remove ESR_ELx_FSC_TYPE and ESR_ELx_FSC_LEVEL entirely to avoid the possiblity of misuse.
Mark.
arch/arm64/mm/fault.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 2e5d1e238af9..13f192691060 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -780,18 +780,18 @@ static const struct fault_info fault_info[] = { { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 1 translation fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 2 translation fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation fault" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 8" },
- { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 0 access flag fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 access flag fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 access flag fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 access flag fault" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 12" },
- { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 0 permission fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" }, { do_sea, SIGBUS, BUS_OBJERR, "synchronous external abort" }, { do_tag_check_fault, SIGSEGV, SEGV_MTESERR, "synchronous tag check fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 18" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 19" },
- { do_sea, SIGKILL, SI_KERNEL, "level -1 (translation table walk)" }, { do_sea, SIGKILL, SI_KERNEL, "level 0 (translation table walk)" }, { do_sea, SIGKILL, SI_KERNEL, "level 1 (translation table walk)" }, { do_sea, SIGKILL, SI_KERNEL, "level 2 (translation table walk)" },
@@ -799,7 +799,7 @@ static const struct fault_info fault_info[] = { { do_sea, SIGBUS, BUS_OBJERR, "synchronous parity or ECC error" }, // Reserved when RAS is implemented { do_bad, SIGKILL, SI_KERNEL, "unknown 25" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 26" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 27" },
- { do_sea, SIGKILL, SI_KERNEL, "level -1 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented { do_sea, SIGKILL, SI_KERNEL, "level 0 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented { do_sea, SIGKILL, SI_KERNEL, "level 1 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented { do_sea, SIGKILL, SI_KERNEL, "level 2 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented
@@ -813,9 +813,9 @@ static const struct fault_info fault_info[] = { { do_bad, SIGKILL, SI_KERNEL, "unknown 38" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 39" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 40" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 41" },
- { do_bad, SIGKILL, SI_KERNEL, "level -1 address size fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 42" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 43" },
- { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level -1 translation fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 44" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 45" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 46" },
-- 2.42.0.820.g83a721a137-goog
On Thu, Oct 26, 2023 at 05:39:11PM +0200, Ard Biesheuvel wrote:
On Thu, 26 Oct 2023 at 17:30, Mark Rutland mark.rutland@arm.com wrote:
On Thu, Oct 26, 2023 at 08:11:26PM +0530, Naresh Kamboju wrote:
Following kernel crash noticed on qemu-arm64 while running LTP syscalls set_robust_list test case running Linux next 6.6.0-rc7-next-20231026 and 6.6.0-rc7-next-20231025.
BAD: next-20231025 Good: next-20231024
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Reported-by: Naresh Kamboju naresh.kamboju@linaro.org
Log:
<1>[ 203.119139] Unable to handle kernel unknown 43 at virtual address 0001ffff9e2e7d78 <1>[ 203.119838] Mem abort info: <1>[ 203.120064] ESR = 0x000000009793002b <1>[ 203.121040] EC = 0x25: DABT (current EL), IL = 32 bits set_robust_list01 1 TPASS : set_robust_list: retval = -1 (expected -1), errno = 22 (expected 22) set_robust_list01 2 TPASS : set_robust_list: retval = 0 (expected 0), errno = 0 (expected 0) <1>[ 203.124496] SET = 0, FnV = 0 <1>[ 203.124778] EA = 0, S1PTW = 0 <1>[ 203.125029] FSC = 0x2b: unknown 43
It looks like this is fallout from the LPA2 enablement.
According to the latest ARM ARM (ARM DDI 0487J.a), page D19-6475, that "unknown 43" (0x2b / 0b101011) is the DFSC for a level -1 translation fault:
0b101011 When FEAT_LPA2 is implemented: Translation fault, level -1.
It's triggered here by an LDTR in a get_user() on a bogus userspace address. The exception is expected, and it's supposed to be handled via the exception fixups, but the LPA2 patches didn't update the fault_info table entries for all the level -1 faults, and so those all get handled by do_bad() and don't call fixup_exception(), causing them to be fatal.
It should be relatively simple to update the fault_info table for the level -1 faults, but given the other issues we're seeing I think it's probably worth dropping the LPA2 patches for the moment.
Thanks for the analysis Mark.
I agree that this should not be difficult to fix, but given the other CI problems and identified loose ends, I am not going to object to dropping this partially or entirely at this point. I'm sure everybody will be thrilled to go over those 60 patches again after I rebase them onto v6.7-rc1 :-)
FWIW, I'm more than happy to try; the issue has lagely been finding the time. Hopefully that'll be a bit easier after LPC!
Mark.