While running LTP controllers following kernel crash noticed on qemu-x86_64 compat mode with stable-rc 6.3.4-rc2.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
Linux version 6.3.4-rc2 (tuxmake@tuxmake) (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC @1684862676 .. ./runltp -f controllers ... cpuset_inherit 11 TPASS: cpus: Inherited information is right! cpuset_inherit 13 TPASS: mems: Inherited information is right! <4>[ 1130.117922] int3: 0000 [#1] PREEMPT SMP PTI <4>[ 1130.118132] CPU: 0 PID: 32748 Comm: cpuset_inherit_ Not tainted 6.3.4-rc2 #1 <4>[ 1130.118216] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <4>[ 1130.118320] RIP: 0010:__alloc_pages+0xeb/0x340 <4>[ 1130.118605] Code: 48 c1 e0 04 48 8d 84 01 00 13 00 00 48 89 45 a8 8b 05 d9 31 cf 01 85 c0 0f 85 05 02 00 00 89 d8 c1 e8 03 83 e0 03 89 45 c0 66 <90> 41 89 df 41 be 01 00 00 00 f6 c7 04 75 66 44 89 e6 89 df e8 ec <4>[ 1130.118653] RSP: 0018:ffffa3d085d07b08 EFLAGS: 00000246 <4>[ 1130.118694] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: ffffa2b9ffffa000 <4>[ 1130.118706] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000400dc0 <4>[ 1130.118717] RBP: ffffa3d085d07b60 R08: 00007fffffffe000 R09: 00007fffffffefff <4>[ 1130.118728] R10: ffffa2b981faaa0c R11: 0000000000000000 R12: 0000000000000000 <4>[ 1130.118739] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fffffffefec <4>[ 1130.118783] FS: 0000000000000000(0003) GS:ffffa2b9fbc00000(0063) knlGS:00000000f7f05880 <4>[ 1130.118798] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 <4>[ 1130.118810] CR2: 00000000f7c10bec CR3: 00000001085ba000 CR4: 00000000000006f0 <4>[ 1130.118899] Call Trace: <4>[ 1130.118974] <TASK> <4>[ 1130.119069] alloc_pages+0x94/0x140 <4>[ 1130.119128] get_zeroed_page+0x1d/0x50 <4>[ 1130.119142] __pud_alloc+0x33/0xe0 <4>[ 1130.119156] __handle_mm_fault+0x50c/0x1310 <4>[ 1130.119175] handle_mm_fault+0xf8/0x320 <4>[ 1130.119187] ? check_vma_flags+0x53/0x130 <4>[ 1130.119199] __get_user_pages+0x1ed/0x600 <4>[ 1130.119214] get_user_pages_remote+0x137/0x3c0 <4>[ 1130.119229] get_arg_page+0x65/0x150 <4>[ 1130.119245] copy_string_kernel+0xd7/0x1e0 <4>[ 1130.119258] do_execveat_common.isra.0+0x11e/0x240 <4>[ 1130.119272] __ia32_compat_sys_execve+0x3f/0x50 <4>[ 1130.119285] __do_fast_syscall_32+0x6b/0xf0 <4>[ 1130.119300] do_fast_syscall_32+0x38/0x80 <4>[ 1130.119312] do_SYSENTER_32+0x23/0x30 <4>[ 1130.119324] entry_SYSENTER_compat_after_hwframe+0x7f/0x91 <4>[ 1130.119374] RIP: 0023:0xf7f0a579 <4>[ 1130.119570] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 <4>[ 1130.119578] RSP: 002b:00000000ffcc16e8 EFLAGS: 00000206 ORIG_RAX: 000000000000000b <4>[ 1130.119594] RAX: ffffffffffffffda RBX: 00000000086cc480 RCX: 00000000086d8810 <4>[ 1130.119600] RDX: 00000000086dc490 RSI: 00000000086cc480 RDI: 0000000000000020 <4>[ 1130.119605] RBP: 00000000086d6270 R08: 0000000000000000 R09: 0000000000000000 <4>[ 1130.119610] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 <4>[ 1130.119614] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 <4>[ 1130.119652] </TASK> <4>[ 1130.119698] Modules linked in: <4>[ 1130.148538] ---[ end trace 0000000000000000 ]--- <4>[ 1130.148708] RIP: 0010:__alloc_pages+0xeb/0x340 <4>[ 1130.148907] Code: 48 c1 e0 04 48 8d 84 01 00 13 00 00 48 89 45 a8 8b 05 d9 31 cf 01 85 c0 0f 85 05 02 00 00 89 d8 c1 e8 03 83 e0 03 89 45 c0 66 <90> 41 89 df 41 be 01 00 00 00 f6 c7 04 75 66 44 89 e6 89 df e8 ec <4>[ 1130.148923] RSP: 0018:ffffa3d085d07b08 EFLAGS: 00000246 <4>[ 1130.148947] RAX: 0000000000000000 RBX: 0000000000400dc0 RCX: ffffa2b9ffffa000 <4>[ 1130.148952] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000400dc0 <4>[ 1130.148958] RBP: ffffa3d085d07b60 R08: 00007fffffffe000 R09: 00007fffffffefff <4>[ 1130.148963] R10: ffffa2b981faaa0c R11: 0000000000000000 R12: 0000000000000000 <4>[ 1130.148968] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fffffffefec <4>[ 1130.148974] FS: 0000000000000000(0003) GS:ffffa2b9fbc00000(0063) knlGS:00000000f7f05880 <4>[ 1130.148981] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 <4>[ 1130.148987] CR2: 00000000f7c10bec CR3: 00000001085ba000 CR4: 00000000000006f0 <0>[ 1130.149129] Kernel panic - not syncing: Fatal exception in interrupt <0>[ 1130.152835] Kernel Offset: 0x8400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
links, - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.3-... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.3-... - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.3-...
-- Linaro LKFT https://lkft.linaro.org
On Wed, May 24, 2023, at 11:02, Naresh Kamboju wrote:
While running LTP controllers following kernel crash noticed on qemu-x86_64 compat mode with stable-rc 6.3.4-rc2.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
Linux version 6.3.4-rc2 (tuxmake@tuxmake) (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC @1684862676 .. ./runltp -f controllers ... cpuset_inherit 11 TPASS: cpus: Inherited information is right! cpuset_inherit 13 TPASS: mems: Inherited information is right! <4>[ 1130.117922] int3: 0000 [#1] PREEMPT SMP PTI <4>[ 1130.118132] CPU: 0 PID: 32748 Comm: cpuset_inherit_ Not tainted 6.3.4-rc2 #1 <4>[ 1130.118216] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <4>[ 1130.118320] RIP: 0010:__alloc_pages+0xeb/0x340 <4>[ 1130.118605] Code: 48 c1 e0 04 48 8d 84 01 00 13 00 00 48 89 45 a8 8b 05 d9 31 cf 01 85 c0 0f 85 05 02 00 00 89 d8 c1 e8 03 83 e0 03 89 45 c0 66 <90> 41 89 df 41 be 01 00 00 00 f6 c7 04 75 66 44 89 e6 89 df e8 ec
I haven't figured out what is going on here, but I tracked down the trapping instruction <90> to the middle of the 'xchg %ax,%ax' two-byte nop in:
ffffffff814218f4: 83 e0 03 and $0x3,%eax ffffffff814218f7: 89 45 c0 mov %eax,-0x40(%rbp) ffffffff814218fa: 66 90 xchg %ax,%ax ffffffff814218fc: 41 89 df mov %ebx,%r15d ffffffff814218ff: 41 be 01 00 00 00 mov $0x1,%r14d
which in turn is the cpusets_enabled() check in prepare_alloc_pages().
static inline bool cpusets_enabled(void) { return static_branch_unlikely(&cpusets_enabled_key); }
static __always_inline bool arch_static_branch(struct static_key *key, bool branch) { asm_volatile_goto("1:" "jmp %l[l_yes] # objtool NOPs this \n\t" JUMP_TABLE_ENTRY : : "i" (key), "i" (2 | branch) : : l_yes);
return false; l_yes: return true; }
I don't see any changes related to this between 6.3.3 and 6.3.4-rc2.
Arnd
From: Arnd Bergmann
Sent: 24 May 2023 12:18
On Wed, May 24, 2023, at 11:02, Naresh Kamboju wrote:
While running LTP controllers following kernel crash noticed on qemu-x86_64 compat mode with stable-rc 6.3.4-rc2.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
Linux version 6.3.4-rc2 (tuxmake@tuxmake) (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC @1684862676 .. ./runltp -f controllers ... cpuset_inherit 11 TPASS: cpus: Inherited information is right! cpuset_inherit 13 TPASS: mems: Inherited information is right! <4>[ 1130.117922] int3: 0000 [#1] PREEMPT SMP PTI <4>[ 1130.118132] CPU: 0 PID: 32748 Comm: cpuset_inherit_ Not tainted 6.3.4-rc2 #1 <4>[ 1130.118216] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <4>[ 1130.118320] RIP: 0010:__alloc_pages+0xeb/0x340 <4>[ 1130.118605] Code: 48 c1 e0 04 48 8d 84 01 00 13 00 00 48 89 45 a8 8b 05 d9 31 cf 01 85 c0 0f 85 05 02 00 00 89 d8 c1 e8 03 83 e0 03 89 45 c0 66 <90> 41 89 df 41 be 01 00 00 00 f6 c7 04 75 66 44 89 e6 89 df e8 ec
I haven't figured out what is going on here, but I tracked down the trapping instruction <90> to the middle of the 'xchg %ax,%ax' two-byte nop in:
ffffffff814218f4: 83 e0 03 and $0x3,%eax ffffffff814218f7: 89 45 c0 mov %eax,-0x40(%rbp) ffffffff814218fa: 66 90 xchg %ax,%ax ffffffff814218fc: 41 89 df mov %ebx,%r15d ffffffff814218ff: 41 be 01 00 00 00 mov $0x1,%r14d
which in turn is the cpusets_enabled() check in prepare_alloc_pages().
Does that code actually match the call/return stack?
It is pretty much impossible to get a trap on an 0x90 byte. I think you'd need to jump to it and then get a page fault.
So I bet that isn't the code that was actually being executed. So either the fault address is garbage or something horrid(tm) has happened to the page tables.
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, May 24, 2023 at 02:32:20PM +0530, Naresh Kamboju wrote:
While running LTP controllers following kernel crash noticed on qemu-x86_64 compat mode with stable-rc 6.3.4-rc2.
Both your reports are stable-rc 6.3.4-rc2; can I assume that stable 6.3.3 is good?
Either way, could you please:
1) try linus/master 2) bisect stable-rc
I don't immediately see a patch in that tree that would cause either of these things.
Hi Peter,
On Wed, 24 May 2023 at 19:37, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 24, 2023 at 02:32:20PM +0530, Naresh Kamboju wrote:
While running LTP controllers following kernel crash noticed on qemu-x86_64 compat mode with stable-rc 6.3.4-rc2.
Both your reports are stable-rc 6.3.4-rc2; can I assume that stable 6.3.3 is good?
It was not good. starting from 6.3.1-c1 these issues were there on both i386 and x86_64.
I need to check back on other branches and compare it with Linux mainline and Linux next master branches.
Either way, could you please:
- try linus/master
- bisect stable-rc
I don't immediately see a patch in that tree that would cause either of these things.
Thanks for asking these questions. I should have included this information in my earlier email. I have been noticing this from day one on stable-rc 6.3.1-rc1.
As per your suggestions, I will try to reproduce on other trees and branches and get back to you.+
FYI, These are running in AWS cloud as qemu-i386 and qemu-x86_64.
A few old links showing the history of the problem. https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-6.3.y/build/v6.3.3-...
i386: ==== Boot failed due to the following kernel crash.
<6>[ 2.078988] sched_clock: Marking stable (2023078833, 55554488)->(2088116191, -9482870) <4>[ 2.081669] int3: 0000 [#1] PREEMPT SMP <4>[ 2.082070] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.3.3-rc1 #1 <4>[ 2.082174] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <4>[ 2.082326] EIP: sched_clock_cpu+0xa/0x2b0
i386: while running LTP controllers tests ==== <4>[ 888.113619] int3: 0000 [#1] PREEMPT SMP <4>[ 888.113966] CPU: 0 PID: 8805 Comm: pids.sh Not tainted 6.3.1-rc1 #1 <4>[ 888.114134] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <4>[ 888.114360] EIP: get_page_from_freelist+0xf1/0xc70
x86_64: while running LTP controllers tests ======
<4>[ 3182.753415] int3: 0000 [#1] PREEMPT SMP PTI <4>[ 3182.755092] CPU: 0 PID: 69163 Comm: cgroup_fj_stres Not tainted 6.3.1-rc1 #1 <4>[ 3182.755228] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <4>[ 3182.755394] RIP: 0010:__alloc_pages+0xeb/0x340
x86_64: while running LTP tracing tests ======
<4>[ 52.392251] int3: 0000 [#1] PREEMPT SMP PTI <4>[ 52.392648] CPU: 0 PID: 331 Comm: journal-offline Not tainted 6.3.3-rc1 #1 <4>[ 52.392794] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 <4>[ 52.393070] RIP: 0010:syscall_trace_enter.constprop.0+0x1/0x1b0
- Naresh
On Wed, May 24, 2023 at 09:39:50PM +0530, Naresh Kamboju wrote:
FYI, These are running in AWS cloud as qemu-i386 and qemu-x86_64.
Are these hosted on x86 and using KVM or are they hosted on Graviton and using TCG x86 ?
Supposedly TCG x86 is known 'funny' and if that's what you're using it would be very good to confirm the problem on x86 hardware.
On Wed, May 24, 2023, at 19:54, Peter Zijlstra wrote:
On Wed, May 24, 2023 at 09:39:50PM +0530, Naresh Kamboju wrote:
FYI, These are running in AWS cloud as qemu-i386 and qemu-x86_64.
Are these hosted on x86 and using KVM or are they hosted on Graviton and using TCG x86 ?
Supposedly TCG x86 is known 'funny' and if that's what you're using it would be very good to confirm the problem on x86 hardware.
Even on x86 cloud instances you are likely to run with TCG if the host does not support nested virtualization. So the question really is what specific cloud instance type this was running on, and if KVM was actually used or not. From what I could find on the web, Amazon EC2 only supports KVM guests inside of bare-metal instances but not any of the normal virtualized ones, while other providers using KVM (Google, Microsoft, ...) do support nested guests.
Arnd
On Thu, 25 May 2023 at 02:03, Arnd Bergmann arnd@arndb.de wrote:
On Wed, May 24, 2023, at 19:54, Peter Zijlstra wrote:
On Wed, May 24, 2023 at 09:39:50PM +0530, Naresh Kamboju wrote:
FYI, These are running in AWS cloud as qemu-i386 and qemu-x86_64.
Are these hosted on x86 and using KVM or are they hosted on Graviton and using TCG x86 ?
Supposedly TCG x86 is known 'funny' and if that's what you're using it would be very good to confirm the problem on x86 hardware.
I see the following logs while booting.
<3>[ 1.834686] kvm_intel: VMX not supported by CPU 0 <3>[ 1.835860] kvm_amd: SVM not supported by CPU 0, not amd or hygon
And they are running on x86 machines.
Even on x86 cloud instances you are likely to run with TCG if the host does not support nested virtualization. So the question really is what specific cloud instance type this was running on, and if KVM was actually used or not. From what I could find on the web, Amazon EC2 only supports KVM guests inside of bare-metal instances but not any of the normal virtualized ones, while other providers using KVM (Google, Microsoft, ...) do support nested guests.
Arnd