Regressions found on qemu-x86_64 with compat mode (64-bit kernel running on 32-bit userspace) while running LTP tracing test suite on Linux next-20250605 tag kernel.
Regressions found on - LTP tracing
Regression Analysis: - New regression? Yes - Reproducible? Intermittent
Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe <4>[ 58.998610] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246 <4>[ 58.998715] RAX: ffff912a042edd00 RBX: 000000000000000b RCX: 0000000000000000 <4>[ 58.998727] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff912a00f2c8c0 <4>[ 58.998737] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09: 0000000000000000 <4>[ 58.998748] R10: 0000000000000000 R11: 0000000000000000 R12: ffff912a00f2c8c0 <4>[ 58.998759] R13: ffff912a00f2c840 R14: 0000000000000006 R15: 0000000000000000 <4>[ 58.998804] FS: 0000000000000000(0000) GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580 <4>[ 58.998821] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 <4>[ 58.998832] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4: 00000000000006f0 <4>[ 58.998915] Call Trace: <4>[ 58.999010] <TASK> <4>[ 58.999077] ? file_close_fd+0x32/0x60 <4>[ 58.999147] __ia32_sys_close+0x18/0x90 <4>[ 58.999172] ia32_sys_call+0x1c3c/0x27e0 <4>[ 58.999183] __do_fast_syscall_32+0x79/0x1e0 <4>[ 58.999194] do_fast_syscall_32+0x37/0x80 <4>[ 58.999203] do_SYSENTER_32+0x23/0x30 <4>[ 58.999211] entry_SYSENTER_compat_after_hwframe+0x84/0x8e <4>[ 58.999254] RIP: 0023:0xf7f0c579 <4>[ 58.999459] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 2e 8d b4 26 00 00 00 00 8d b4 26 00 00 00 <4>[ 58.999466] RSP: 002b:00000000fff98500 EFLAGS: 00000206 ORIG_RAX: 0000000000000006 <4>[ 58.999479] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 0000000000000000 <4>[ 58.999484] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 <4>[ 58.999488] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 <4>[ 58.999492] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 <4>[ 58.999497] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 <4>[ 58.999534] </TASK> <4>[ 58.999579] Modules linked in: <4>[ 58.999895] ---[ end trace 0000000000000000 ]--- <4>[ 58.999892] Oops: int3: 0000 [#2] SMP PTI <4>[ 58.999997] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 59.000008] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe <4>[ 59.000010] CPU: 1 UID: 0 PID: 339 Comm: sh Tainted: G D 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 59.000014] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246 <4>[ 59.000021] RAX: ffff912a042edd00 RBX: 000000000000000b RCX: 0000000000000000 <4>[ 59.000026] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff912a00f2c8c0 <4>[ 59.000030] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09: 0000000000000000 <4>[ 59.000040] R10: 0000000000000000 R11: 0000000000000000 R12: ffff912a00f2c8c0 <4>[ 59.000044] R13: ffff912a00f2c840 R14: 0000000000000006 R15: 0000000000000000 <4>[ 59.000049] FS: 0000000000000000(0000) GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580 <4>[ 59.000054] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 <4>[ 59.000059] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4: 00000000000006f0 <4>[ 59.000070] Tainted: [D]=DIE <4>[ 59.000080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 59.000085] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 59.000101] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe <4>[ 59.000108] RSP: 0018:ffff9494000e0e88 EFLAGS: 00000097 <4>[ 59.000117] RAX: 0000000000010002 RBX: ffff912a7bd29500 RCX: ffff912a7bd2a400 <0>[ 59.000179] Kernel panic - not syncing: Fatal exception in interrupt <0>[ 60.592321] Shutting down cpus with NMI <0>[ 60.593242] Kernel Offset: 0x20800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) <0>[ 60.618536] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
## Source * Kernel version: 6.15.0-next-20250605 * Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git * Git sha: 4f27f06ec12190c7c62c722e99ab6243dea81a94
## Build * Test log: https://qa-reports.linaro.org/api/testruns/28675335/log_file/ * Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taH... * Kernel config: https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taH...
-- Linaro LKFT https://lkft.linaro.org
On Thu, 5 Jun 2025 17:12:10 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Regressions found on qemu-x86_64 with compat mode (64-bit kernel running on 32-bit userspace) while running LTP tracing test suite on Linux next-20250605 tag kernel.
Regressions found on
- LTP tracing
Regression Analysis:
- New regression? Yes
- Reproducible? Intermittent
Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
Interesting. This hits a stray int3 for ftrace on _raw_spin_lock.
Here is the compiled code of _raw_spin_lock.
ffffffff825daa00 <_raw_spin_lock>: ffffffff825daa00: f3 0f 1e fa endbr64 ffffffff825daa04: e8 47 a6 d5 fe call ffffffff81335050 <__fentry__>
Since int3 exception happens after decoded int3 (1 byte), the RIP `_raw_spin_lock+0x05` is not an instruction boundary.
<4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
And the call is already modified back to a 5-bytes nop when we dump the code. Thus it may hit the intermediate int3 for transforming code.
e8 47 a6 d5 fe (first step) cc 47 a6 d5 fe (second step) cc 1f 44 00 00 <- hit? (third step) 0f 1f 44 00 00 <- handle int3
It is very unlikely scenario (and I'm not sure qemu can correctly emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4 before anoter CPU' runs third step in smp_text_poke_batch_finish(), and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs the thrid step and sets text_poke_array_refs 0, the smp_text_poke_int3_handler() returns 0 and causes the same problem.
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Finish second step. Hit int3 (*) Finish third step. Run smp_text_poke_sync_each_cpu().(**) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
But as I said it is very unlikely, because as far as I know;
(*) smp_text_poke_int3_handler() is called directly from exc_int3() which is a kind of NMI, so other interrupt should not run. (**) In the third step, smp_text_poke_batch_finish() sends IPI for sync core after removing int3. Thus any int3 exception handling should be finished.
Is this bug reproducible easier recently?
Thanks,
On Mon, 9 Jun 2025 22:09:34 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
[...]
Here is the compiled code of _raw_spin_lock.
ffffffff825daa00 <_raw_spin_lock>: ffffffff825daa00: f3 0f 1e fa endbr64 ffffffff825daa04: e8 47 a6 d5 fe call ffffffff81335050 <__fentry__>
Since int3 exception happens after decoded int3 (1 byte), the RIP `_raw_spin_lock+0x05` is not an instruction boundary.
<4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
And the call is already modified back to a 5-bytes nop when we dump the code. Thus it may hit the intermediate int3 for transforming code.
e8 47 a6 d5 fe (first step) cc 47 a6 d5 fe (second step) cc 1f 44 00 00 <- hit? (third step) 0f 1f 44 00 00 <- handle int3
It is very unlikely scenario (and I'm not sure qemu can correctly emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4 before anoter CPU' runs third step in smp_text_poke_batch_finish(), and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs the thrid step and sets text_poke_array_refs 0, the smp_text_poke_int3_handler() returns 0 and causes the same problem.
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Finish second step. Hit int3 (*) Finish third step. Run smp_text_poke_sync_each_cpu().(**) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
But as I said it is very unlikely, because as far as I know;
(*) smp_text_poke_int3_handler() is called directly from exc_int3() which is a kind of NMI, so other interrupt should not run. (**) In the third step, smp_text_poke_batch_finish() sends IPI for sync core after removing int3. Thus any int3 exception handling should be finished.
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache?).
------ <CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache?) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3 ------
SERIALIZE instruction may flash pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
If that hypotheses is correct, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
Or, if it is unsure, we can just evacuate the kernel from die("int3") by retrying the new instruction, when the INT3 is disappeared.
Thank you,
On Tue, 10 Jun 2025 17:41:36 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
SERIALIZE instruction may flash pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
From my understanding, an IPI on a CPU is equivalent to a smp_mb() on that CPU. There shouldn't be any need for flushing the cache.
If that hypotheses is correct, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
I'm not sure how the TLB would be affected.
-- Steve
Or, if it is unsure, we can just evacuate the kernel from die("int3") by retrying the new instruction, when the INT3 is disappeared.
On Mon, 9 Jun 2025 at 18:39, Masami Hiramatsu mhiramat@kernel.org wrote:
On Thu, 5 Jun 2025 17:12:10 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Regressions found on qemu-x86_64 with compat mode (64-bit kernel running on 32-bit userspace) while running LTP tracing test suite on Linux next-20250605 tag kernel.
Regressions found on
- LTP tracing
Regression Analysis:
- New regression? Yes
- Reproducible? Intermittent
Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
Interesting. This hits a stray int3 for ftrace on _raw_spin_lock.
Here is the compiled code of _raw_spin_lock.
ffffffff825daa00 <_raw_spin_lock>: ffffffff825daa00: f3 0f 1e fa endbr64 ffffffff825daa04: e8 47 a6 d5 fe call ffffffff81335050 <__fentry__>
Since int3 exception happens after decoded int3 (1 byte), the RIP `_raw_spin_lock+0x05` is not an instruction boundary.
<4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
And the call is already modified back to a 5-bytes nop when we dump the code. Thus it may hit the intermediate int3 for transforming code.
e8 47 a6 d5 fe (first step) cc 47 a6 d5 fe (second step) cc 1f 44 00 00 <- hit? (third step) 0f 1f 44 00 00 <- handle int3
It is very unlikely scenario (and I'm not sure qemu can correctly emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4 before anoter CPU' runs third step in smp_text_poke_batch_finish(), and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs the thrid step and sets text_poke_array_refs 0, the smp_text_poke_int3_handler() returns 0 and causes the same problem.
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Finish second step. Hit int3 (*) Finish third step. Run smp_text_poke_sync_each_cpu().(**) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
But as I said it is very unlikely, because as far as I know;
(*) smp_text_poke_int3_handler() is called directly from exc_int3() which is a kind of NMI, so other interrupt should not run. (**) In the third step, smp_text_poke_batch_finish() sends IPI for sync core after removing int3. Thus any int3 exception handling should be finished.
Is this bug reproducible easier recently?
Yes. It is easy to reproduce.
Thanks,
-- Masami Hiramatsu (Google) mhiramat@kernel.org
- Naresh
On Tue, 10 Jun 2025 18:50:05 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Mon, 9 Jun 2025 at 18:39, Masami Hiramatsu mhiramat@kernel.org wrote:
On Thu, 5 Jun 2025 17:12:10 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Regressions found on qemu-x86_64 with compat mode (64-bit kernel running on 32-bit userspace) while running LTP tracing test suite on Linux next-20250605 tag kernel.
Regressions found on
- LTP tracing
Regression Analysis:
- New regression? Yes
- Reproducible? Intermittent
Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50
Interesting. This hits a stray int3 for ftrace on _raw_spin_lock.
Here is the compiled code of _raw_spin_lock.
ffffffff825daa00 <_raw_spin_lock>: ffffffff825daa00: f3 0f 1e fa endbr64 ffffffff825daa04: e8 47 a6 d5 fe call ffffffff81335050 <__fentry__>
Since int3 exception happens after decoded int3 (1 byte), the RIP `_raw_spin_lock+0x05` is not an instruction boundary.
<4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
And the call is already modified back to a 5-bytes nop when we dump the code. Thus it may hit the intermediate int3 for transforming code.
e8 47 a6 d5 fe (first step) cc 47 a6 d5 fe (second step) cc 1f 44 00 00 <- hit? (third step) 0f 1f 44 00 00 <- handle int3
It is very unlikely scenario (and I'm not sure qemu can correctly emulate it.) But if a CPU hits the int3 (cc) on _raw_spin_lock()+0x4 before anoter CPU' runs third step in smp_text_poke_batch_finish(), and before the CPU runs smp_text_poke_int3_handler(), the CPU' runs the thrid step and sets text_poke_array_refs 0, the smp_text_poke_int3_handler() returns 0 and causes the same problem.
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Finish second step. Hit int3 (*) Finish third step. Run smp_text_poke_sync_each_cpu().(**) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
But as I said it is very unlikely, because as far as I know;
(*) smp_text_poke_int3_handler() is called directly from exc_int3() which is a kind of NMI, so other interrupt should not run. (**) In the third step, smp_text_poke_batch_finish() sends IPI for sync core after removing int3. Thus any int3 exception handling should be finished.
Is this bug reproducible easier recently?
Yes. It is easy to reproduce.
Good, can you test the following 2 patches (I'll send a series)? I think [1/2] may avoid the kernel crash, but still shows a warning, and [2/2] may fix it if my guess is correct.
Thank you,
Thanks,
-- Masami Hiramatsu (Google) mhiramat@kernel.org
- Naresh
From: Masami Hiramatsu (Google) mhiramat@kernel.org
An Oops caused by a stray INT3 is reported by LKFT.
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
But INT3(cc) is not shown in the dumped code. This means there is a chance to handle an INT3 exception when the INT3 is replaecd with the original instruction.
To evacuate the kernel from this stuation, when the kernel failed to handle the INT3, check whether there is an INT3 at the trapped address. If there isn't, retry executing the new instruction.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org --- arch/x86/kernel/traps.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c5c897a86418..f489e86c1b5e 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -880,6 +880,29 @@ static void do_int3_user(struct pt_regs *regs) cond_local_irq_disable(regs); }
+static int handle_disappeared_int3(struct pt_regs *regs) +{ + unsigned long addr = instruction_pointer(regs) - INT3_INSN_SIZE; + unsigned char opcode; + int ret; + + /* + * Evacuate the kernel from disappeared int3, which was there when + * the exception happens, but it is removed now by another CPU. + */ + ret = copy_from_kernel_nofault(&opcode, (void *)addr, INT3_INSN_SIZE); + if (ret < 0) + return ret; + if (opcode == INT3_INSN_OPCODE) + return -EFAULT; + + /* There is no INT3 here. Retry with the new instruction. */ + WARN_ONCE(1, "A disappeared INT3 was handled at %pS.", (void *)addr); + instruction_pointer_set(regs, addr); + + return 0; +} + DEFINE_IDTENTRY_RAW(exc_int3) { /* @@ -907,7 +930,7 @@ DEFINE_IDTENTRY_RAW(exc_int3) irqentry_state_t irq_state = irqentry_nmi_enter(regs);
instrumentation_begin(); - if (!do_int3(regs)) + if (!do_int3(regs) && handle_disappeared_int3(regs) < 0) die("int3", regs, 0); instrumentation_end(); irqentry_nmi_exit(regs, irq_state);
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Invalidate the cache after replacing INT3 with the new instruction. This will prevent the other CPUs seeing the removed INT3 in their cache after serializing the pipeline.
LKFT reported an oops by INT3 but there is no INT3 shown in the dumped code. This means the INT3 is removed after the CPU hits INT3.
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
------ <CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3 ------
SERIALIZE instruction flashes pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
To prevent reloading replaced INT3, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org --- arch/x86/kernel/alternative.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..1b606db48017 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2949,8 +2949,16 @@ void smp_text_poke_batch_finish(void) do_sync++; }
- if (do_sync) + if (do_sync) { + /* + * Flush the instructions on the cache, then serialize the + * pipeline of each CPU. + */ + flush_tlb_kernel_range((unsigned long)text_poke_addr(&text_poke_array.vec[0]), + (unsigned long)text_poke_addr(text_poke_array.vec + + text_poke_array.nr_entries - 1)); smp_text_poke_sync_each_cpu(); + }
/* * Remove and wait for refs to be zero.
On Tue, 10 Jun 2025 23:47:48 +0900 "Masami Hiramatsu (Google)" mhiramat@kernel.org wrote:
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler()
I believe your analysis is the issue here. The commit that changed the ref counter from a global to per cpu didn't cause the issue, it just made the race window bigger.
Failed to get text_poke_array_refs[cpu0] Oops: int3
SERIALIZE instruction flashes pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
To prevent reloading replaced INT3, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org
arch/x86/kernel/alternative.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..1b606db48017 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2949,8 +2949,16 @@ void smp_text_poke_batch_finish(void) do_sync++; }
- if (do_sync)
- if (do_sync) {
/*
* Flush the instructions on the cache, then serialize the
* pipeline of each CPU.
The IPI interrupt should flush the cache. And the TLB should not be an issue here. If anything, this may work just because it will make the race smaller.
I'm thinking this may be a QEMU bug. If QEMU doesn't flush the icache on an IPI then this would indeed be an problem.
-- Steve
*/
flush_tlb_kernel_range((unsigned long)text_poke_addr(&text_poke_array.vec[0]),
(unsigned long)text_poke_addr(text_poke_array.vec +
smp_text_poke_sync_each_cpu();text_poke_array.nr_entries - 1));
- }
/* * Remove and wait for refs to be zero.
On Tue, 10 Jun 2025 11:50:30 -0400 Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 10 Jun 2025 23:47:48 +0900 "Masami Hiramatsu (Google)" mhiramat@kernel.org wrote:
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler()
I believe your analysis is the issue here. The commit that changed the ref counter from a global to per cpu didn't cause the issue, it just made the race window bigger.
Agreed. That is a suspicious commit, but even though, as you said it might just cause the bug easier. Here I wrote refcount as a per-cpu array because of showing the current code.
Failed to get text_poke_array_refs[cpu0] Oops: int3
SERIALIZE instruction flashes pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
To prevent reloading replaced INT3, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org
arch/x86/kernel/alternative.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..1b606db48017 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2949,8 +2949,16 @@ void smp_text_poke_batch_finish(void) do_sync++; }
- if (do_sync)
- if (do_sync) {
/*
* Flush the instructions on the cache, then serialize the
* pipeline of each CPU.
The IPI interrupt should flush the cache. And the TLB should not be an issue here. If anything, this may work just because it will make the race smaller.
I'm not sure, I'm searching it in the Intel SDM.
I'm thinking this may be a QEMU bug. If QEMU doesn't flush the icache on an IPI then this would indeed be an problem.
Does the qemu manage its icache? (Is that possible to manage it?) And I guess it is using KVM to run VM, thus the actual cache or TLB operation has been done by KVM.
Thanks,
-- Steve
*/
flush_tlb_kernel_range((unsigned long)text_poke_addr(&text_poke_array.vec[0]),
(unsigned long)text_poke_addr(text_poke_array.vec +
smp_text_poke_sync_each_cpu();text_poke_array.nr_entries - 1));
- }
/* * Remove and wait for refs to be zero.
On Tue, 10 Jun 2025 11:50:30 -0400 Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 10 Jun 2025 23:47:48 +0900 "Masami Hiramatsu (Google)" mhiramat@kernel.org wrote:
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler()
I believe your analysis is the issue here. The commit that changed the ref counter from a global to per cpu didn't cause the issue, it just made the race window bigger.
Ah, OK. It seems more easier to explain. Since we use the trap gate for #BP, it does not clear the IF automatically. Thus there is a time window between executing INT3 on icache (or already in the pipeline) and its handler disables interrupts. If the IPI is received in the time window, this bug happens.
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) Hit INT3 (from icache/pipeline) on_each_cpu(do_sync_core) ---- do_sync_core(do SERIALIZE) ---- Finish the third step. Handle #BP including CLI Clear text_poke_array_refs[cpu0] preparing stack Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0]
In this case, per-cpu text_poke_array_refs will make a time window bigger because clearing text_poke_array_refs is faster.
If this is correct, flushing cache does not matter (it can make the window smaller.)
One possible solution is to send IPI again which ensures the current #BP handler exits. It can make the window small enough.
Another solution is removing WARN_ONCE() from [1/2], which means we accept this scenario, but avoid catastrophic result.
Thank you,
[ I just noticed that you continued on the thread without the x86 folks Cc ]
On Wed, 11 Jun 2025 19:26:10 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
On Tue, 10 Jun 2025 11:50:30 -0400 Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 10 Jun 2025 23:47:48 +0900 "Masami Hiramatsu (Google)" mhiramat@kernel.org wrote:
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler()
I believe your analysis is the issue here. The commit that changed the ref counter from a global to per cpu didn't cause the issue, it just made the race window bigger.
Ah, OK. It seems more easier to explain. Since we use the trap gate for #BP, it does not clear the IF automatically. Thus there is a time window between executing INT3 on icache (or already in the pipeline) and its handler disables interrupts. If the IPI is received in the time window, this bug happens.
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) Hit INT3 (from icache/pipeline) on_each_cpu(do_sync_core)
do_sync_core(do SERIALIZE)
Finish the third step.
Handle #BP including CLI Clear text_poke_array_refs[cpu0] preparing stack Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0]
In this case, per-cpu text_poke_array_refs will make a time window bigger because clearing text_poke_array_refs is faster.
If this is correct, flushing cache does not matter (it can make the window smaller.)
One possible solution is to send IPI again which ensures the current #BP handler exits. It can make the window small enough.
Another solution is removing WARN_ONCE() from [1/2], which means we accept this scenario, but avoid catastrophic result.
If interrupts are enabled when the break point hits and just enters the int3 handler, does that also mean it can schedule?
If that's the case, then we either have to remove the WARN_ONCE() or we would have to do something like a synchronize_rcu_tasks().
-- Steve
On Wed, 11 Jun 2025 10:20:10 -0400 Steven Rostedt rostedt@goodmis.org wrote:
If interrupts are enabled when the break point hits and just enters the int3 handler, does that also mean it can schedule?
I added this:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c5c897a86418..0f3153322ad2 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -854,6 +854,8 @@ static bool do_int3(struct pt_regs *regs) { int res;
+ if (!irqs_disabled()) + printk("IRQS NOT DISABLED\n"); #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..2856805d9ed1 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2728,6 +2728,12 @@ noinstr int smp_text_poke_int3_handler(struct pt_regs *regs) int ret = 0; void *ip;
+ if (!irqs_disabled()) { + instrumentation_begin(); + printk("IRQS NOT DISABLED\n"); + instrumentation_end(); + } + if (user_mode(regs)) return 0;
And it didn't trigger when enabling function tracing. Are you sure interrupts are enabled here?
-- Steve
On Wed, 11 Jun 2025 11:42:43 -0400 Steven Rostedt rostedt@goodmis.org wrote:
On Wed, 11 Jun 2025 10:20:10 -0400 Steven Rostedt rostedt@goodmis.org wrote:
If interrupts are enabled when the break point hits and just enters the int3 handler, does that also mean it can schedule?
I added this:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index c5c897a86418..0f3153322ad2 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -854,6 +854,8 @@ static bool do_int3(struct pt_regs *regs) { int res;
- if (!irqs_disabled())
printk("IRQS NOT DISABLED\n");
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP, SIGTRAP) == NOTIFY_STOP) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..2856805d9ed1 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2728,6 +2728,12 @@ noinstr int smp_text_poke_int3_handler(struct pt_regs *regs) int ret = 0; void *ip;
- if (!irqs_disabled()) {
instrumentation_begin();
printk("IRQS NOT DISABLED\n");
instrumentation_end();
- }
- if (user_mode(regs)) return 0;
And it didn't trigger when enabling function tracing. Are you sure interrupts are enabled here?
Oops, I saw Xen's code. I confirmed that the asm_exc_int3 is registered as GATE_INTERRUPT. Hmm. Thus this might be a qemu bug as Peter said, because there is no chance to interrupt the IPI after hitting #BP.
Thank you,
-- Steve
On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Invalidate the cache after replacing INT3 with the new instruction. This will prevent the other CPUs seeing the removed INT3 in their cache after serializing the pipeline.
LKFT reported an oops by INT3 but there is no INT3 shown in the dumped code. This means the INT3 is removed after the CPU hits INT3.
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
SERIALIZE instruction flashes pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
To prevent reloading replaced INT3, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A store should cause the invalidation per MESI and all that. This means the only place where the old instruction can stick around is in the uarch micro-ops cache and all that, and SERIALIZE will very much flush those.
Also, TLB flush != I$ flush. There is clflush_cache_range() for this. But still, this really should not be needed.
Also, this is all qemu, and qemu is known to have gotten this terribly wrong in the past.
If you all cannot reproduce on real hardware, I'm considering this a qemu bug.
On Wed, 11 Jun 2025 13:30:01 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Invalidate the cache after replacing INT3 with the new instruction. This will prevent the other CPUs seeing the removed INT3 in their cache after serializing the pipeline.
LKFT reported an oops by INT3 but there is no INT3 shown in the dumped code. This means the INT3 is removed after the CPU hits INT3.
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
SERIALIZE instruction flashes pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
To prevent reloading replaced INT3, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A store should cause the invalidation per MESI and all that. This means the only place where the old instruction can stick around is in the uarch micro-ops cache and all that, and SERIALIZE will very much flush those.
OK, thanks for pointing it out!
Also, TLB flush != I$ flush. There is clflush_cache_range() for this. But still, this really should not be needed.
Also, this is all qemu, and qemu is known to have gotten this terribly wrong in the past.
What about KVM? We need to ask Naresh how it is running on the machine. Naresh, can you tell us how the VM is running? Does that use KVM? And if so, how the kvm is configured(it may depend on the real hardware)?
If you all cannot reproduce on real hardware, I'm considering this a qemu bug.
OK, if it is a qemu's bug, dropping [2/2], but I think we still need [1/2] to avoid kernel crash (with a warning message without dump).
Thank you,
On Thu, 12 Jun 2025 at 05:47, Masami Hiramatsu mhiramat@kernel.org wrote:
On Wed, 11 Jun 2025 13:30:01 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Invalidate the cache after replacing INT3 with the new instruction. This will prevent the other CPUs seeing the removed INT3 in their cache after serializing the pipeline.
LKFT reported an oops by INT3 but there is no INT3 shown in the dumped code. This means the INT3 is removed after the CPU hits INT3.
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
SERIALIZE instruction flashes pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
To prevent reloading replaced INT3, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A store should cause the invalidation per MESI and all that. This means the only place where the old instruction can stick around is in the uarch micro-ops cache and all that, and SERIALIZE will very much flush those.
OK, thanks for pointing it out!
Also, TLB flush != I$ flush. There is clflush_cache_range() for this. But still, this really should not be needed.
Also, this is all qemu, and qemu is known to have gotten this terribly wrong in the past.
What about KVM? We need to ask Naresh how it is running on the machine. Naresh, can you tell us how the VM is running? Does that use KVM? And if so, how the kvm is configured(it may depend on the real hardware)?
We do not use KVM and are running the Qemu version (10.0.0).
If you all cannot reproduce on real hardware, I'm considering this a qemu bug.
It is reproducible intermittently on x86_64 device and qemu-x86 device with and without compat mode.
This link is showing how intermittent it is on Linux next tree.
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/tes...
- Naresh
OK, if it is a qemu's bug, dropping [2/2], but I think we still need [1/2] to avoid kernel crash (with a warning message without dump).
Thank you,
-- Masami Hiramatsu (Google) mhiramat@kernel.org
On Thu, 12 Jun 2025 21:54:05 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Thu, 12 Jun 2025 at 05:47, Masami Hiramatsu mhiramat@kernel.org wrote:
On Wed, 11 Jun 2025 13:30:01 +0200 Peter Zijlstra peterz@infradead.org wrote:
On Tue, Jun 10, 2025 at 11:47:48PM +0900, Masami Hiramatsu (Google) wrote:
From: Masami Hiramatsu (Google) mhiramat@kernel.org
Invalidate the cache after replacing INT3 with the new instruction. This will prevent the other CPUs seeing the removed INT3 in their cache after serializing the pipeline.
LKFT reported an oops by INT3 but there is no INT3 shown in the dumped code. This means the INT3 is removed after the CPU hits INT3.
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI <4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0] Oops: int3
SERIALIZE instruction flashes pipeline, thus the processor needs to reload the instruction. But it is not ensured to reload it from memory because SERIALIZE does not invalidate the cache.
To prevent reloading replaced INT3, we need to invalidate the cache (flush TLB) in the third step, before the do_sync_core().
This sounds all sorts of wrong. x86 is supposed to be cache-coherent. A store should cause the invalidation per MESI and all that. This means the only place where the old instruction can stick around is in the uarch micro-ops cache and all that, and SERIALIZE will very much flush those.
OK, thanks for pointing it out!
Also, TLB flush != I$ flush. There is clflush_cache_range() for this. But still, this really should not be needed.
Also, this is all qemu, and qemu is known to have gotten this terribly wrong in the past.
What about KVM? We need to ask Naresh how it is running on the machine. Naresh, can you tell us how the VM is running? Does that use KVM? And if so, how the kvm is configured(it may depend on the real hardware)?
We do not use KVM and are running the Qemu version (10.0.0).
If you all cannot reproduce on real hardware, I'm considering this a qemu bug.
It is reproducible intermittently on x86_64 device and qemu-x86 device with and without compat mode.
Interesting, so it seems not a KVM/qemu issue, but a real bug in the INT3 (maybe text_poke?).
This link is showing how intermittent it is on Linux next tree.
I found this example did not remove INT3 but failed to handle it.
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250501/tes...
[ 77.103476] Oops: int3: 0000 [#1] SMP PTI [ 77.103481] CPU: 2 UID: 0 PID: 10062 Comm: cat Not tainted 6.15.0-rc4-next-20250501 #1 PREEMPT_{RT,(full)} [ 77.103484] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 77.103485] RIP: 0010:kmem_cache_alloc_noprof+0x10a/0x2c0 [ 77.103490] Code: 4c 89 e7 e8 28 e4 cd 00 66 90 f7 c5 00 00 40 00 0f 85 89 01 00 00 f6 43 09 20 0f 85 7f 01 00 00 4c 8b 24 24 48 8b 74 24 38 cc <1f> 44 00 00 48 8b 44 24 08 65 48 2b 05 cd e5 23 02 0f 85 8e 01 00 [ 77.103491] RSP: 0018:ffffa0954960bac0 EFLAGS: 00000202 [ 77.103493] RAX: 0000000000000001 RBX: ffff9105c0229700 RCX: 0000000000000007 [ 77.103494] RDX: ffff9105c7589180 RSI: ffffffffb6fc247e RDI: ffff9105c7589180 [ 77.103495] RBP: 0000000000000cc0 R08: 0000000000000006 R09: 00000000000000c0 [ 77.103496] R10: ffffa0954960bbb8 R11: ffff9105cd06310c R12: ffff9105c3583300 [ 77.103497] R13: 00000000000000c0 R14: ffffffffb6fc247e R15: ffff9105c3b7c200 [ 77.103499] FS: 0000000000000000(0000) GS:ffff9109668b7000(0000) knlGS:0000000000000000 [ 77.103500] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 77.103501] CR2: 00007ffdbe756f50 CR3: 0000000103e24003 CR4: 00000000003726f0 [ 77.103502] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 77.103503] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 77.103504] Call Trace: [ 77.103505] <TASK> [ 77.103507] vm_area_dup+0x1e/0xe0 [ 77.103510] __split_vma+0xa0/0x320 [ 77.103513] vms_gather_munmap_vmas+0xab/0x230 [ 77.103514] __mmap_region+0x211/0xb80 [ 77.103521] do_mmap+0x3fa/0x5a0 [ 77.103524] vm_mmap_pgoff+0xfc/0x1d0 [ 77.103528] ksys_mmap_pgoff+0x149/0x1f0 [ 77.103531] ? do_syscall_64+0x7e/0x1d0 [ 77.103535] do_syscall_64+0xb2/0x1d0 [ 77.103537] entry_SYSCALL_64_after_hwframe+0x77/0x7f
The code pattern looks like a text_poke_batch()
cc <1f> 44 00 00 = BYTES_NOP5 with INT3.
But since it is not at the entry of the symbol, it may not a ftrace entry, maybe a tracepoint?
------- void *kmem_cache_alloc_noprof(struct kmem_cache *s, gfp_t gfpflags) { void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE, _RET_IP_, s->object_size);
trace_kmem_cache_alloc(_RET_IP_, ret, s, gfpflags, NUMA_NO_NODE);
return ret; } -------
Hmm, it seems like smp_text_poke_batch_finish() in the first step (add INT3 on NOP) or the second step (right before removing INT3).
Thanks,
- Naresh
OK, if it is a qemu's bug, dropping [2/2], but I think we still need [1/2] to avoid kernel crash (with a warning message without dump).
Thank you,
-- Masami Hiramatsu (Google) mhiramat@kernel.org
On Tue, 10 Jun 2025 18:50:05 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Is this bug reproducible easier recently?
Yes. It is easy to reproduce.
Can you test before and after this commit:
4334336e769b ("x86/alternatives: Improve code-patching scalability by removing false sharing in poke_int3_handler()")
I think that may be the culprit.
Even if Masami's patches work, I want to know what exactly caused it.
-- Steve
On Tue, 10 Jun 2025 at 20:22, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 10 Jun 2025 18:50:05 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Is this bug reproducible easier recently?
Yes. It is easy to reproduce.
Can you test before and after this commit:
4334336e769b ("x86/alternatives: Improve code-patching scalability by removing false sharing in poke_int3_handler()")
I think that may be the culprit.
Even if Masami's patches work, I want to know what exactly caused it.
Steven,
Since the reported regressions are intermittent, It is not easy to bisect. However, The commit merged into Linux next-20250414 tag and then started noticing from next-20250415 onwards this regression on both x86_64 devices and qemu-x86_64 intermittently with and without compat mode.
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/tes... - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/tes...
And above commit landed into Linus master branch on 2025-05-13 and then started noticing this regression intermittently on x86 with and without compat mode.
- https://qa-reports.linaro.org/lkft/linux-mainline-master/build/v6.16-rc1/tes...
Masami San,
case 1) compat mode x86_64 (64-bit kernel + 32-bit rootfs) I have tested your patch on top of linux next-20250606 tag and tested on real x86_64 (64-bit kernel + 32-bit rootfs) hardware for 7 test runs.
ftrace_regression01 - pass ftrace_regression02 - pass ftrace-stress-test - pass dynamic_debug01 - Hangs (No crash log on serial console)
Case 1.1) Above case noticed on qemu-x86_64 with compat mode ^ with 12 test runs.
- https://lkft.validation.linaro.org/scheduler/job/8312811#L1687
case 2) x86_64 (64-bit kernel + 64-bit rootfs) I have tested your patch on top of linux next-20250606 tag and tested on real x86_64 (64-bit kernel + 64-bit rootfs) hardware for 4 runs and out of these 3 runs failed and found these kernel warnings, kernel BUG and invalid opcode while running LTP tracing test cases.
Here I am sharing the crash log snippet and boot and test log links and build link.
Test logs: [ 112.596591] Ring buffer clock went backwards: 113864910133 -> 112596588266 [ 115.829620] cat (5762) used greatest stack depth: 10936 bytes left [ 120.922517] ------------[ cut here ]------------ [ 120.927198] WARNING: CPU: 2 PID: 6639 at kernel/trace/trace_functions_graph.c:985 print_graph_entry+0x579/0x590 [ 120.937364] Modules linked in: x86_pkg_temp_thermal [ 120.942405] CPU: 2 UID: 0 PID: 6639 Comm: cat Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 120.953380] Tainted: [S]=CPU_OUT_OF_SPEC [ 120.957477] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 120.965036] RIP: 0010:print_graph_entry+0x579/0x590
Run 1: - https://lkft.validation.linaro.org/scheduler/job/8311136#L1700
ftrace-stress-test: [ 58.963898] /usr/local/bin/kirk[340]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) [ 60.316588] ------------[ cut here ]------------ [ 60.316588] ------------[ cut here ]------------ [ 60.316590] ------------[ cut here ]------------ [ 60.316593] ------------[ cut here ]------------ [ 60.316593] ------------[ cut here ]------------ [ 60.316594] ------------[ cut here ]------------ [ 60.316594] kernel BUG at kernel/entry/common.c:328! [ 60.316594] kernel BUG at kernel/entry/common.c:328! [ 60.316595] kernel BUG at kernel/entry/common.c:328! [ 60.316600] Oops: invalid opcode: 0000 [#1] SMP PTI [ 60.316604] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 60.316608] Tainted: [S]=CPU_OUT_OF_SPEC [ 60.316609] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 60.316614] ------------[ cut here ]------------ [ 60.316615] kernel BUG at kernel/entry/common.c:328! [ 60.316617] Oops: invalid opcode: 0000 [#2] SMP PTI [ 60.316620] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 60.316622] Tainted: [S]=CPU_OUT_OF_SPEC [ 60.316623] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 60.316625] RIP: 0010:irqentry_nmi_enter+0x6c/0x70
Run 2: - https://lkft.validation.linaro.org/scheduler/job/8311138#L1703
ftrace-stress-test: [ 78.877495] /usr/local/bin/kirk[343]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) [ 78.977303] Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1 [ 82.299799] cat (2322) used greatest stack depth: 11520 bytes left [ 82.327708] cat (2327) used greatest stack depth: 11256 bytes left [ 82.632183] cat (2375) used greatest stack depth: 10992 bytes left [ 137.335901] ------------[ cut here ]------------ [ 137.335901] ------------[ cut here ]------------ [ 137.335902] ------------[ cut here ]------------ [ 137.335907] kernel BUG at kernel/entry/common.c:328! [ 137.335908] ------------[ cut here ]------------ [ 137.335909] ------------[ cut here ]------------ [ 137.335912] kernel BUG at kernel/entry/common.c:328! [ 137.335912] kernel BUG at kernel/entry/common.c:328! [ 137.335915] Oops: invalid opcode: 0000 [#1] SMP PTI [ 137.335921] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 137.335926] Tainted: [S]=CPU_OUT_OF_SPEC [ 137.335929] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 137.335937] ------------[ cut here ]------------ [ 137.335939] kernel BUG at kernel/entry/common.c:328! [ 137.335945] Oops: invalid opcode: 0000 [#2] SMP PTI [ 137.335949] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 137.335953] Tainted: [S]=CPU_OUT_OF_SPEC [ 137.335956] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 137.335959] RIP: 0010:irqentry_nmi_enter+0x6c/0x70
Run 3: - https://lkft.validation.linaro.org/scheduler/job/8311139#L1703
Build log: - https://storage.tuxsuite.com/public/linaro/naresh/builds/2yM9krm5KgE5a57QFvO...
- Naresh
-- Steve
On Thu, 12 Jun 2025 18:39:41 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Tue, 10 Jun 2025 at 20:22, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 10 Jun 2025 18:50:05 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Is this bug reproducible easier recently?
Yes. It is easy to reproduce.
Can you test before and after this commit:
4334336e769b ("x86/alternatives: Improve code-patching scalability by removing false sharing in poke_int3_handler()")
I think that may be the culprit.
Even if Masami's patches work, I want to know what exactly caused it.
Steven,
Since the reported regressions are intermittent, It is not easy to bisect. However, The commit merged into Linux next-20250414 tag and then started noticing from next-20250415 onwards this regression on both x86_64 devices and qemu-x86_64 intermittently with and without compat mode.
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/tes...
- https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20250606/tes...
And above commit landed into Linus master branch on 2025-05-13 and then started noticing this regression intermittently on x86 with and without compat mode.
Masami San,
case 1) compat mode x86_64 (64-bit kernel + 32-bit rootfs) I have tested your patch on top of linux next-20250606 tag and tested on real x86_64 (64-bit kernel + 32-bit rootfs) hardware for 7 test runs.
ftrace_regression01 - pass ftrace_regression02 - pass ftrace-stress-test - pass dynamic_debug01 - Hangs (No crash log on serial console)
Hm, this last one seems different reason.
Case 1.1) Above case noticed on qemu-x86_64 with compat mode ^ with 12 test runs.
case 2) x86_64 (64-bit kernel + 64-bit rootfs) I have tested your patch on top of linux next-20250606 tag and tested on real x86_64 (64-bit kernel + 64-bit rootfs) hardware for 4 runs and out of these 3 runs failed and found these kernel warnings, kernel BUG and invalid opcode while running LTP tracing test cases.
Here I am sharing the crash log snippet and boot and test log links and build link.
Test logs: [ 112.596591] Ring buffer clock went backwards: 113864910133 -> 112596588266 [ 115.829620] cat (5762) used greatest stack depth: 10936 bytes left [ 120.922517] ------------[ cut here ]------------ [ 120.927198] WARNING: CPU: 2 PID: 6639 at kernel/trace/trace_functions_graph.c:985 print_graph_entry+0x579/0x590 [ 120.937364] Modules linked in: x86_pkg_temp_thermal [ 120.942405] CPU: 2 UID: 0 PID: 6639 Comm: cat Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 120.953380] Tainted: [S]=CPU_OUT_OF_SPEC [ 120.957477] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 120.965036] RIP: 0010:print_graph_entry+0x579/0x590
Run 1:
The warning came from; ---- /* Save this function pointer to see if the exit matches */ if (call->depth < FTRACE_RETFUNC_DEPTH && !WARN_ON_ONCE(call->depth < 0)) cpu_data->enter_funcs[call->depth] = call->func; } ----
Hit the "call->depth < 0". Thus this is function graph tracer's problem.
ftrace-stress-test: [ 58.963898] /usr/local/bin/kirk[340]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) [ 60.316588] ------------[ cut here ]------------ [ 60.316588] ------------[ cut here ]------------ [ 60.316590] ------------[ cut here ]------------ [ 60.316593] ------------[ cut here ]------------ [ 60.316593] ------------[ cut here ]------------ [ 60.316594] ------------[ cut here ]------------ [ 60.316594] kernel BUG at kernel/entry/common.c:328! [ 60.316594] kernel BUG at kernel/entry/common.c:328! [ 60.316595] kernel BUG at kernel/entry/common.c:328! [ 60.316600] Oops: invalid opcode: 0000 [#1] SMP PTI [ 60.316604] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 60.316608] Tainted: [S]=CPU_OUT_OF_SPEC [ 60.316609] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 60.316614] ------------[ cut here ]------------ [ 60.316615] kernel BUG at kernel/entry/common.c:328! [ 60.316617] Oops: invalid opcode: 0000 [#2] SMP PTI [ 60.316620] CPU: 2 UID: 0 PID: 1556 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 60.316622] Tainted: [S]=CPU_OUT_OF_SPEC [ 60.316623] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 60.316625] RIP: 0010:irqentry_nmi_enter+0x6c/0x70
Run 2:
Interesting. This hits the max nestable number of NMI.
/* * nmi_enter() can nest up to 15 times; see NMI_BITS. */ #define __nmi_enter() \ do { \ lockdep_off(); \ arch_nmi_enter(); \ BUG_ON(in_nmi() == NMI_MASK); \ <===== __preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ } while (0)
ftrace-stress-test: [ 78.877495] /usr/local/bin/kirk[343]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) [ 78.977303] Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1 [ 82.299799] cat (2322) used greatest stack depth: 11520 bytes left [ 82.327708] cat (2327) used greatest stack depth: 11256 bytes left [ 82.632183] cat (2375) used greatest stack depth: 10992 bytes left [ 137.335901] ------------[ cut here ]------------ [ 137.335901] ------------[ cut here ]------------ [ 137.335902] ------------[ cut here ]------------ [ 137.335907] kernel BUG at kernel/entry/common.c:328! [ 137.335908] ------------[ cut here ]------------ [ 137.335909] ------------[ cut here ]------------ [ 137.335912] kernel BUG at kernel/entry/common.c:328! [ 137.335912] kernel BUG at kernel/entry/common.c:328! [ 137.335915] Oops: invalid opcode: 0000 [#1] SMP PTI [ 137.335921] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 137.335926] Tainted: [S]=CPU_OUT_OF_SPEC [ 137.335929] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 137.335937] ------------[ cut here ]------------ [ 137.335939] kernel BUG at kernel/entry/common.c:328! [ 137.335945] Oops: invalid opcode: 0000 [#2] SMP PTI [ 137.335949] CPU: 0 UID: 0 PID: 544 Comm: sh Tainted: G S 6.15.0-next-20250606 #1 PREEMPT(voluntary) [ 137.335953] Tainted: [S]=CPU_OUT_OF_SPEC [ 137.335956] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.7 12/07/2021 [ 137.335959] RIP: 0010:irqentry_nmi_enter+0x6c/0x70
Run 3:
This is the same as Run 2, and clearer.
In do_int3(), if we hit a disappeared int3, it is evacuated after all. This means kprobe_int3_handler() is hit, and call get_kprobe() to find the corresponding kprobes. But,
ffffffff8150a040 <get_kprobe>: ffffffff8150a040: f3 0f 1e fa endbr64 ffffffff8150a044: e8 07 b0 e2 ff call ffffffff81335050 <__fentry__> ffffffff8150a049: 48 b8 eb 83 b5 80 46 movabs $0x61c8864680b583eb,%rax ffffffff8150a050: 86 c8 61
It hits the ftrace and hooked by fgraph, and eventually returns via ftrace_return_to_handler()
[ 137.338572] RIP: 0010:ftrace_return_to_handler+0xd5/0x1f0 [ 137.338577] Code: 00 89 55 c8 48 85 ff 74 07 4c 89 b7 80 00 00 00 49 8b 94 24 38 0b 00 00 48 98 48 8b 04 c2 48 c1 e8 0c 0f b7 c0 48 89 45 b8 cc <90> 48 8b 05 e3 ac c2 01 48 63 80 f8 00 00 00 48 0f a3 45 b8 72 39
This address is;
$ eu-addr2line -fi -e vmlinux ftrace_return_to_handler+0xd5 arch_static_branch inlined at /builds/linux/kernel/trace/fgraph.c:839:6 in ftrace_return_to_handler /builds/linux/arch/x86/include/asm/jump_label.h:36:2 __ftrace_return_to_handler /builds/linux/kernel/trace/fgraph.c:839:6 ftrace_return_to_handler /builds/linux/kernel/trace/fgraph.c:874:9
It is for static_branch, which also uses a text_poke.
----- #ifdef CONFIG_HAVE_STATIC_CALL if (static_branch_likely(&fgraph_do_direct)) { <====== if (test_bit(fgraph_direct_gops->idx, &bitmap)) static_call(fgraph_retfunc)(&trace, fgraph_direct_gops, fregs); -----
But actually, this static_branch modifies the kernel code with smp_text_poke_single() (note, this is a wrapper of smp_text_poke_batch).
And this is MISSED by the smp_text_poke_int3_handler() again and go through the kprobes, and hit ftrace (fgraph) and caused this loop.
So the fundamental issue is that smp_text_poke_batch missed handling INT3.
I guess some text_poke user do not get text_mutex?
Thank you,
Build log:
-- Steve
On Fri, 13 Jun 2025 17:27:53 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
In do_int3(), if we hit a disappeared int3, it is evacuated after all. This means kprobe_int3_handler() is hit, and call get_kprobe() to find the corresponding kprobes. But,
ffffffff8150a040 <get_kprobe>: ffffffff8150a040: f3 0f 1e fa endbr64 ffffffff8150a044: e8 07 b0 e2 ff call ffffffff81335050 <__fentry__> ffffffff8150a049: 48 b8 eb 83 b5 80 46 movabs $0x61c8864680b583eb,%rax ffffffff8150a050: 86 c8 61
BTW, I think this get_kprobe() should be "notrace" because this is called from int3 handler.
Thanks,
On Fri, 13 Jun 2025 17:27:53 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
Run 3:
This is the same as Run 2, and clearer.
In do_int3(), if we hit a disappeared int3, it is evacuated after all. This means kprobe_int3_handler() is hit, and call get_kprobe() to find the corresponding kprobes. But,
ffffffff8150a040 <get_kprobe>: ffffffff8150a040: f3 0f 1e fa endbr64 ffffffff8150a044: e8 07 b0 e2 ff call ffffffff81335050 <__fentry__> ffffffff8150a049: 48 b8 eb 83 b5 80 46 movabs $0x61c8864680b583eb,%rax ffffffff8150a050: 86 c8 61
It hits the ftrace and hooked by fgraph, and eventually returns via ftrace_return_to_handler()
[ 137.338572] RIP: 0010:ftrace_return_to_handler+0xd5/0x1f0 [ 137.338577] Code: 00 89 55 c8 48 85 ff 74 07 4c 89 b7 80 00 00 00 49 8b 94 24 38 0b 00 00 48 98 48 8b 04 c2 48 c1 e8 0c 0f b7 c0 48 89 45 b8 cc <90> 48 8b 05 e3 ac c2 01 48 63 80 f8 00 00 00 48 0f a3 45 b8 72 39
This address is;
$ eu-addr2line -fi -e vmlinux ftrace_return_to_handler+0xd5 arch_static_branch inlined at /builds/linux/kernel/trace/fgraph.c:839:6 in ftrace_return_to_handler /builds/linux/arch/x86/include/asm/jump_label.h:36:2 __ftrace_return_to_handler /builds/linux/kernel/trace/fgraph.c:839:6 ftrace_return_to_handler /builds/linux/kernel/trace/fgraph.c:874:9
It is for static_branch, which also uses a text_poke.
#ifdef CONFIG_HAVE_STATIC_CALL if (static_branch_likely(&fgraph_do_direct)) { <====== if (test_bit(fgraph_direct_gops->idx, &bitmap)) static_call(fgraph_retfunc)(&trace, fgraph_direct_gops, fregs);
But actually, this static_branch modifies the kernel code with smp_text_poke_single() (note, this is a wrapper of smp_text_poke_batch).
And this is MISSED by the smp_text_poke_int3_handler() again and go through the kprobes, and hit ftrace (fgraph) and caused this loop.
So the fundamental issue is that smp_text_poke_batch missed handling INT3.
I guess some text_poke user do not get text_mutex?
Hmm, I've checked the smp_text_poke_* users, but it seems no problem. Basically, those smp_text_poke* user locks text_mutex, and another suspicious ftrace_start_up is also set under ftrace_lock. ftrace_arch_code_modify_post_process() is also paired with ftrace_arch_code_modify_prepare() and under ftrace_lock.
smp_text_poke_single() ftrace_mod_jmp() ftrace_enable_ftrace_graph_caller() ftrace_modify_all_code() -> see [*1] ftrace_disable_ftrace_graph_caller() ftrace_modify_all_code() -> see [*1] ftrace_update_ftrace_func() update_ftrace_func() ftrace_modify_all_code() -> see [*1]
smp_text_poke_batch_add() arch_jump_label_transform_queue() -> lock text_mutex ftrace_replace_code() ftrace_modify_all_code() <------[*1] arch_ftrace_update_code() ftrace_run_update_code() -> lock text_mutex ftrace_modify_code_direct() (only if ftrace_poke_late != 0) ftrace_make_nop() __ftrace_replace_code() <----[*3] ftrace_replace_code(weak) --> Not used on x86 (overridden) ftrace_modify_all_code() <--- [*1] arch_ftrace_update_code() <---- [*4] ftrace_run_update_code()-> lock text_mutex __ftrace_modify_code() ftrace_run_stop_machine() arch_ftrace_update_code(weak) -> overridden on x86 see [*4] ftrace_module_enable() -> lock text_mutex (see below) ftrace_init_nop() ftrace_nop_initialize() ftrace_update_code() ftrace_module_enable() -> lock text_mutex prepare_coming_module() load_modole() ftrace_process_locs() -> lock ftrace_lock. ftrace_init() -> OK (ftrace_poke_late == 0 because its early) ftrace_module_init() -> OK (ftrace_poke_late == 0 because module is not live) load_module() ftrace_make_call() __ftrace_replace_code() -> see [*3]
smp_text_poke_batch_finish() arch_jump_label_transform_apply() -> lock text_mutex ftrace_arch_code_modify_post_process() -> must be OK because this unlock text_mutex ftrace_run_update_code()-> paired with ftrace_arch_code_modify_prepare() ftrace_module_enable()-> paired with ftrace_arch_code_modify_prepare() (depends on ftrace_lock && ftrace_start_up) ftrace_replace_code() ftrace_modify_all_code() -> see [*1]
ftrace_start_up <does variable set under ftrace_lock ?> ftrace_startup() ftrace_startup_subops() register_ftrace_graph() -> lock ftrace_lock register_ftrace_function_probe() -> lock ftrace_lock register_ftrace_function_nolock() -> lock ftrace_lock ftrace_shutdown() unregister_ftrace_function() -> lock ftrace_lock
ftrace_arch_code_modify_prepare() < this set ftrace_poke_late = 1> ftrace_module_enable() -> lock ftrace_lock. ftrace_run_update_code() ftrace_run_modify_code() ftrace_ops_update_code() __ftrace_hash_move_and_update_ops() ftrace_update_ops() ftrace_startup_subops() register_ftrace_graph() -> lock ftrace_lock ftrace_shutdown_subops() unregister_ftrace_graph() -> lock ftrace_lock ftrace_hash_move_and_update_subops() ftrace_hash_move_and_update_ops() -> [*2] ftrace_hash_move_and_update_ops() <-- [*2] process_mod_list() -> lock ftrace_lock register_ftrace_function_probe() -> lock ftrace_lock unregister_ftrace_function_probe_func() -> lock ftrace_lock ftrace_set_hash() -> lock ftrace_lock ftrace_regex_release() -> lock ftrace_lock unregister_ftrace_function_probe_func() -> lock ftrace_lock ftrace_startup_enable() ftrace_startup_all() ftrace_pid_reset() -> lock ftrace_lock pid_write() -> lock ftrace_lock ftrace_startup() ftrace_startup_subops() register_ftrace_graph() -> lock ftrace_lock register_ftrace_function_probe() -> lock ftrace_lock register_ftrace_function_nolock() -> lock ftrace_lock ftrace_startup_sysctl() ftrace_enable_sysctl() -> lock ftrace_lock ftrace_shutdown() ftrace_shutdown_subops() unregister_ftrace_graph() -> lock ftrace_lock unregister_ftrace_function_probe_func() -> lock ftrace_lock ftrace_destroy_filter_files() -> lock ftrace_lock unregister_ftrace_function() -> lock ftrace_lock ftrace_shutdown_sysctl() ftrace_enable_sysctl() -> lock ftrace_lock
Thanks,
On Mon, 16 Jun 2025 16:36:59 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
So the fundamental issue is that smp_text_poke_batch missed handling INT3.
I guess some text_poke user do not get text_mutex?
Hmm, I've checked the smp_text_poke_* users, but it seems no problem. Basically, those smp_text_poke* user locks text_mutex, and another suspicious ftrace_start_up is also set under ftrace_lock. ftrace_arch_code_modify_post_process() is also paired with ftrace_arch_code_modify_prepare() and under ftrace_lock.
Eventually, I found a bug in text_poke, and jump_label (tracepoint) hit the bug.
The jump_label uses 2 different APIs (single and batch) which independently takes text_mutex lock.
smp_text_poke_single() __jump_label_transform() jump_label_transform() --> lock text_mutex
smp_text_poke_batch_add() arch_jump_label_transform_queue() -> lock text_mutex
smp_text_poke_batch_finish() arch_jump_label_transform_apply() -> lock text_mutex
This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives: Remove the mixed-patching restriction on smp_text_poke_single()"), but smp_text_poke_single() still expects that the batched APIs are run in the same text_mutex lock region. Thus if user calls those APIs in the below order;
arch_jump_label_transform_queue(addr1) jump_label_transform(addr2) arch_jump_label_transform_apply()
And if the addr1 > addr2, the bsearch on the array does not work, and failed to handle int3!
This can explain the disappeared int3 case. If it happens right before int3 is overwritten, that int3 will be overwritten when the int3 handler dumps the code, but text_poke_array_refs is still 1.
It seems that commit c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") introduced this problem, because it shares the global array in the text_poke_batch and text_poke_single. Before that commit, text_poke_single (text_poke_bp) uses its local variable.
To fix this issue, Use smp_text_poke_batch_add() in smp_text_poke_single(), which checks whether the array sorted and the array index does not overflow.
Please test below;
From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001 From: "Masami Hiramatsu (Google)" mhiramat@kernel.org Date: Tue, 17 Jun 2025 19:18:37 +0900 Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken text_poke array
Since smp_text_poke_single() does not expect there is another text_poke request is queued, it can make text_poke_array not sorted or cause a buffer overflow on the text_poke_array.vec[]. This will cause an Oops in int3, or kernel page fault if it causes a buffer overflow.
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add() so that it correctly flush the queue if needed.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org --- arch/x86/kernel/alternative.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..8038951650c6 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c */ void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate) { - __smp_text_poke_batch_add(addr, opcode, len, emulate); + smp_text_poke_batch_add(addr, opcode, len, emulate); smp_text_poke_batch_finish(); }
Hi Masami,
On Tue, 17 Jun 2025 at 16:12, Masami Hiramatsu mhiramat@kernel.org wrote:
On Mon, 16 Jun 2025 16:36:59 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
So the fundamental issue is that smp_text_poke_batch missed handling INT3.
I guess some text_poke user do not get text_mutex?
Hmm, I've checked the smp_text_poke_* users, but it seems no problem. Basically, those smp_text_poke* user locks text_mutex, and another suspicious ftrace_start_up is also set under ftrace_lock. ftrace_arch_code_modify_post_process() is also paired with ftrace_arch_code_modify_prepare() and under ftrace_lock.
Eventually, I found a bug in text_poke, and jump_label (tracepoint) hit the bug.
The jump_label uses 2 different APIs (single and batch) which independently takes text_mutex lock.
smp_text_poke_single() __jump_label_transform() jump_label_transform() --> lock text_mutex
smp_text_poke_batch_add() arch_jump_label_transform_queue() -> lock text_mutex
smp_text_poke_batch_finish() arch_jump_label_transform_apply() -> lock text_mutex
This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives: Remove the mixed-patching restriction on smp_text_poke_single()"), but smp_text_poke_single() still expects that the batched APIs are run in the same text_mutex lock region. Thus if user calls those APIs in the below order;
arch_jump_label_transform_queue(addr1) jump_label_transform(addr2) arch_jump_label_transform_apply()
And if the addr1 > addr2, the bsearch on the array does not work, and failed to handle int3!
This can explain the disappeared int3 case. If it happens right before int3 is overwritten, that int3 will be overwritten when the int3 handler dumps the code, but text_poke_array_refs is still 1.
It seems that commit c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") introduced this problem, because it shares the global array in the text_poke_batch and text_poke_single. Before that commit, text_poke_single (text_poke_bp) uses its local variable.
To fix this issue, Use smp_text_poke_batch_add() in smp_text_poke_single(), which checks whether the array sorted and the array index does not overflow.
Please test below;
Do you mean only this single patch on top of the Linux next ?
From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001 From: "Masami Hiramatsu (Google)" mhiramat@kernel.org Date: Tue, 17 Jun 2025 19:18:37 +0900 Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken text_poke array
Since smp_text_poke_single() does not expect there is another text_poke request is queued, it can make text_poke_array not sorted or cause a buffer overflow on the text_poke_array.vec[]. This will cause an Oops in int3, or kernel page fault if it causes a buffer overflow.
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add() so that it correctly flush the queue if needed.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org
arch/x86/kernel/alternative.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..8038951650c6 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c */ void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate) {
__smp_text_poke_batch_add(addr, opcode, len, emulate);
smp_text_poke_batch_add(addr, opcode, len, emulate); smp_text_poke_batch_finish();
}
2.50.0.rc2.692.g299adb8693-goog
-- Masami Hiramatsu (Google) mhiramat@kernel.org
On Tue, 17 Jun 2025 17:40:25 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Please test below;
Do you mean only this single patch on top of the Linux next ?
Looking at Masami's analysis, yeah, I think you only need that one patch.
-- Steve
On Tue, 17 Jun 2025 at 17:55, Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 17 Jun 2025 17:40:25 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Please test below;
Do you mean only this single patch on top of the Linux next ?
Looking at Masami's analysis, yeah, I think you only need that one patch.
Testing is in progress.
-- Steve
On Tue, 17 Jun 2025 19:41:59 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
Eventually, I found a bug in text_poke, and jump_label (tracepoint) hit the bug.
The jump_label uses 2 different APIs (single and batch) which independently takes text_mutex lock.
smp_text_poke_single() __jump_label_transform() jump_label_transform() --> lock text_mutex
smp_text_poke_batch_add() arch_jump_label_transform_queue() -> lock text_mutex
smp_text_poke_batch_finish() arch_jump_label_transform_apply() -> lock text_mutex
This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives: Remove the mixed-patching restriction on smp_text_poke_single()"), but smp_text_poke_single() still expects that the batched APIs are run in the same text_mutex lock region. Thus if user calls those APIs in the below order;
arch_jump_label_transform_queue(addr1) jump_label_transform(addr2) arch_jump_label_transform_apply()
And if the addr1 > addr2, the bsearch on the array does not work, and failed to handle int3!
Nice catch!
This can explain the disappeared int3 case. If it happens right before int3 is overwritten, that int3 will be overwritten when the int3 handler dumps the code, but text_poke_array_refs is still 1.
It seems that commit c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") introduced this problem, because it shares the global array in the text_poke_batch and text_poke_single. Before that commit, text_poke_single (text_poke_bp) uses its local variable.
To fix this issue, Use smp_text_poke_batch_add() in smp_text_poke_single(), which checks whether the array sorted and the array index does not overflow.
Please test below;
From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" mhiramat@kernel.org Date: Tue, 17 Jun 2025 19:18:37 +0900 Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken text_poke array
Since smp_text_poke_single() does not expect there is another text_poke request is queued, it can make text_poke_array not sorted or cause a buffer overflow on the text_poke_array.vec[]. This will cause an Oops in int3, or kernel page fault if it causes a buffer overflow.
I would add more of what you found above in the change log. And the issue that was triggered I don't think was because of a buffer overflow. It was because an entry was added to the text_poke_array out of order causing the bsearch to fail.
Please add to the change log that the issue is that smp_text_poke_single() can be called while smp_text_poke_batch*() is being used. The locking is around the called functions but nothing prevents them from being intermingled.
This means that if we have:
CPU 0 CPU 1 CPU 2 ----- ----- -----
smp_text_poke_batch_add()
smp_text_poke_single() <<-- Adds out of order
<int3> [Fails o find address in text_poke_array ] OOPS!
No overflow. This could possibly happen with just two entries!
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add() so that it correctly flush the queue if needed.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google)
Reviewed-by: Steven Rostedt (Google) rostedt@goodmis.org
-- Steve
mhiramat@kernel.org --- arch/x86/kernel/alternative.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..8038951650c6 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c */ void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate) {
- __smp_text_poke_batch_add(addr, opcode, len, emulate);
- smp_text_poke_batch_add(addr, opcode, len, emulate); smp_text_poke_batch_finish();
}
On Tue, 17 Jun 2025 10:29:51 -0400 Steven Rostedt rostedt@goodmis.org wrote:
From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" mhiramat@kernel.org Date: Tue, 17 Jun 2025 19:18:37 +0900 Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken text_poke array
Since smp_text_poke_single() does not expect there is another text_poke request is queued, it can make text_poke_array not sorted or cause a buffer overflow on the text_poke_array.vec[]. This will cause an Oops in int3, or kernel page fault if it causes a buffer overflow.
I would add more of what you found above in the change log. And the issue that was triggered I don't think was because of a buffer overflow. It was because an entry was added to the text_poke_array out of order causing the bsearch to fail.
There are two patterns of bugs I saw, one is "Oops: int3" and another is "#PF in smp_text_poke_batch_finish (or smp_text_poke_int3_handler)". The latter comes from buffer overflow.
----- [ 164.164215] BUG: unable to handle page fault for address: ffffffff32c00000 [ 164.166999] #PF: supervisor read access in kernel mode [ 164.169096] #PF: error_code(0x0000) - not-present page [ 164.171143] PGD 8364b067 P4D 8364b067 PUD 0 [ 164.172954] Oops: Oops: 0000 [#1] SMP PTI [ 164.174581] CPU: 4 UID: 0 PID: 2702 Comm: sh Tainted: G W 6.15.0-next-20250606-00002-g75b4e49588c2 #239 PREEMPT(voluntary) [ 164.179193] Tainted: [W]=WARN [ 164.180926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 164.184696] RIP: 0010:smp_text_poke_batch_finish+0xb9/0x400 [ 164.186873] Code: e4 4c 8d 6d c2 85 c9 74 39 48 63 03 b9 01 00 00 00 4c 89 ea 41 83 c4 01 48 c7 c7 d0 f7 f7 b2 48 83 c3 10 48 8d b0 00 00 c0 b2 <0f> b6 80 00 00 c0 b2 88 43 ff e8 68 e3 ff ff 44 3b 25 d1 29 5f 02 -----
This is because smp_text_poke_single() overwrites the text_poke_array.vec[TEXT_POKE_ARRAY_MAX], which is nr_entries (and the variables next to text_poke_array.)
----- static struct smp_text_poke_array { struct smp_text_poke_loc vec[TEXT_POKE_ARRAY_MAX]; int nr_entries; } text_poke_array; -----
Please add to the change log that the issue is that smp_text_poke_single() can be called while smp_text_poke_batch*() is being used. The locking is around the called functions but nothing prevents them from being intermingled.
OK.
This means that if we have:
CPU 0 CPU 1 CPU 2
smp_text_poke_batch_add()
smp_text_poke_single() <<-- Adds out of order <int3> [Fails o find address in text_poke_array ] OOPS!
Thanks for the chart!
No overflow. This could possibly happen with just two entries!
Yes, that was actually I observed (by a debug patch)
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add() so that it correctly flush the queue if needed.
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google)
Reviewed-by: Steven Rostedt (Google) rostedt@goodmis.org
Thank you!
-- Steve
mhiramat@kernel.org --- arch/x86/kernel/alternative.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..8038951650c6 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c */ void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate) {
- __smp_text_poke_batch_add(addr, opcode, len, emulate);
- smp_text_poke_batch_add(addr, opcode, len, emulate); smp_text_poke_batch_finish();
}
On Wed, 18 Jun 2025 08:40:22 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
I would add more of what you found above in the change log. And the issue that was triggered I don't think was because of a buffer overflow. It was because an entry was added to the text_poke_array out of order causing the bsearch to fail.
There are two patterns of bugs I saw, one is "Oops: int3" and another is "#PF in smp_text_poke_batch_finish (or smp_text_poke_int3_handler)". The latter comes from buffer overflow.
[ 164.164215] BUG: unable to handle page fault for address: ffffffff32c00000 [ 164.166999] #PF: supervisor read access in kernel mode [ 164.169096] #PF: error_code(0x0000) - not-present page [ 164.171143] PGD 8364b067 P4D 8364b067 PUD 0 [ 164.172954] Oops: Oops: 0000 [#1] SMP PTI [ 164.174581] CPU: 4 UID: 0 PID: 2702 Comm: sh Tainted: G W 6.15.0-next-20250606-00002-g75b4e49588c2 #239 PREEMPT(voluntary) [ 164.179193] Tainted: [W]=WARN [ 164.180926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 164.184696] RIP: 0010:smp_text_poke_batch_finish+0xb9/0x400 [ 164.186873] Code: e4 4c 8d 6d c2 85 c9 74 39 48 63 03 b9 01 00 00 00 4c 89 ea 41 83 c4 01 48 c7 c7 d0 f7 f7 b2 48 83 c3 10 48 8d b0 00 00 c0 b2 <0f> b6 80 00 00 c0 b2 88 43 ff e8 68 e3 ff ff 44 3b 25 d1 29 5f 02
This is because smp_text_poke_single() overwrites the text_poke_array.vec[TEXT_POKE_ARRAY_MAX], which is nr_entries (and the variables next to text_poke_array.)
Interesting. It must be that the stress test was able to get in and add a bunch of individual entries while a batch was being performed.
Still, both are a bug and solved by the same solution ;-)
(Two for the price of one!)
-- Steve
On Tue, 17 Jun 2025 at 16:12, Masami Hiramatsu mhiramat@kernel.org wrote:
On Mon, 16 Jun 2025 16:36:59 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
So the fundamental issue is that smp_text_poke_batch missed handling INT3.
I guess some text_poke user do not get text_mutex?
Hmm, I've checked the smp_text_poke_* users, but it seems no problem. Basically, those smp_text_poke* user locks text_mutex, and another suspicious ftrace_start_up is also set under ftrace_lock. ftrace_arch_code_modify_post_process() is also paired with ftrace_arch_code_modify_prepare() and under ftrace_lock.
Eventually, I found a bug in text_poke, and jump_label (tracepoint) hit the bug.
The jump_label uses 2 different APIs (single and batch) which independently takes text_mutex lock.
smp_text_poke_single() __jump_label_transform() jump_label_transform() --> lock text_mutex
smp_text_poke_batch_add() arch_jump_label_transform_queue() -> lock text_mutex
smp_text_poke_batch_finish() arch_jump_label_transform_apply() -> lock text_mutex
This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives: Remove the mixed-patching restriction on smp_text_poke_single()"), but smp_text_poke_single() still expects that the batched APIs are run in the same text_mutex lock region. Thus if user calls those APIs in the below order;
arch_jump_label_transform_queue(addr1) jump_label_transform(addr2) arch_jump_label_transform_apply()
And if the addr1 > addr2, the bsearch on the array does not work, and failed to handle int3!
This can explain the disappeared int3 case. If it happens right before int3 is overwritten, that int3 will be overwritten when the int3 handler dumps the code, but text_poke_array_refs is still 1.
It seems that commit c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") introduced this problem, because it shares the global array in the text_poke_batch and text_poke_single. Before that commit, text_poke_single (text_poke_bp) uses its local variable.
To fix this issue, Use smp_text_poke_batch_add() in smp_text_poke_single(), which checks whether the array sorted and the array index does not overflow.
Please test below;
From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001 From: "Masami Hiramatsu (Google)" mhiramat@kernel.org Date: Tue, 17 Jun 2025 19:18:37 +0900 Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken text_poke array
Since smp_text_poke_single() does not expect there is another text_poke request is queued, it can make text_poke_array not sorted or cause a buffer overflow on the text_poke_array.vec[]. This will cause an Oops in int3, or kernel page fault if it causes a buffer overflow.
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add() so that it correctly flush the queue if needed.
I’ve applied the patch on top of Linux next-20250617 and ran the LTP tracing tests. I'm happy to report that the previously observed kernel panic has been resolved.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org
arch/x86/kernel/alternative.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..8038951650c6 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c */ void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate) {
__smp_text_poke_batch_add(addr, opcode, len, emulate);
smp_text_poke_batch_add(addr, opcode, len, emulate); smp_text_poke_batch_finish();
}
2.50.0.rc2.692.g299adb8693-goog
-- Masami Hiramatsu (Google) mhiramat@kernel.org
On Tue, 17 Jun 2025 22:15:20 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
On Tue, 17 Jun 2025 at 16:12, Masami Hiramatsu mhiramat@kernel.org wrote:
On Mon, 16 Jun 2025 16:36:59 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
So the fundamental issue is that smp_text_poke_batch missed handling INT3.
I guess some text_poke user do not get text_mutex?
Hmm, I've checked the smp_text_poke_* users, but it seems no problem. Basically, those smp_text_poke* user locks text_mutex, and another suspicious ftrace_start_up is also set under ftrace_lock. ftrace_arch_code_modify_post_process() is also paired with ftrace_arch_code_modify_prepare() and under ftrace_lock.
Eventually, I found a bug in text_poke, and jump_label (tracepoint) hit the bug.
The jump_label uses 2 different APIs (single and batch) which independently takes text_mutex lock.
smp_text_poke_single() __jump_label_transform() jump_label_transform() --> lock text_mutex
smp_text_poke_batch_add() arch_jump_label_transform_queue() -> lock text_mutex
smp_text_poke_batch_finish() arch_jump_label_transform_apply() -> lock text_mutex
This is allowed by commit 8a6a1b4e0ef1 ("x86/alternatives: Remove the mixed-patching restriction on smp_text_poke_single()"), but smp_text_poke_single() still expects that the batched APIs are run in the same text_mutex lock region. Thus if user calls those APIs in the below order;
arch_jump_label_transform_queue(addr1) jump_label_transform(addr2) arch_jump_label_transform_apply()
And if the addr1 > addr2, the bsearch on the array does not work, and failed to handle int3!
This can explain the disappeared int3 case. If it happens right before int3 is overwritten, that int3 will be overwritten when the int3 handler dumps the code, but text_poke_array_refs is still 1.
It seems that commit c8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") introduced this problem, because it shares the global array in the text_poke_batch and text_poke_single. Before that commit, text_poke_single (text_poke_bp) uses its local variable.
To fix this issue, Use smp_text_poke_batch_add() in smp_text_poke_single(), which checks whether the array sorted and the array index does not overflow.
Please test below;
From e2a49c7cefb4148ea3142c752396d39f103c9f4d Mon Sep 17 00:00:00 2001 From: "Masami Hiramatsu (Google)" mhiramat@kernel.org Date: Tue, 17 Jun 2025 19:18:37 +0900 Subject: [PATCH] x86: alternative: Fix int3 handling failure from broken text_poke array
Since smp_text_poke_single() does not expect there is another text_poke request is queued, it can make text_poke_array not sorted or cause a buffer overflow on the text_poke_array.vec[]. This will cause an Oops in int3, or kernel page fault if it causes a buffer overflow.
Use smp_text_poke_batch_add() instead of __smp_text_poke_batch_add() so that it correctly flush the queue if needed.
I’ve applied the patch on top of Linux next-20250617 and ran the LTP tracing tests. I'm happy to report that the previously observed kernel panic has been resolved.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
Thank you for testing! This is a good chance for me to setup LTP environment locally :)
Thanks!
Reported-by: Linux Kernel Functional Testing lkft@linaro.org Closes: https://lore.kernel.org/all/CA+G9fYsLu0roY3DV=tKyqP7FEKbOEETRvTDhnpPxJGbA=Cg... Fixes: 8976ade0c1b ("x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs") Signed-off-by: Masami Hiramatsu (Google) mhiramat@kernel.org
arch/x86/kernel/alternative.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index ecfe7b497cad..8038951650c6 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -3107,6 +3107,6 @@ void __ref smp_text_poke_batch_add(void *addr, const void *opcode, size_t len, c */ void __ref smp_text_poke_single(void *addr, const void *opcode, size_t len, const void *emulate) {
__smp_text_poke_batch_add(addr, opcode, len, emulate);
smp_text_poke_batch_add(addr, opcode, len, emulate); smp_text_poke_batch_finish();
}
2.50.0.rc2.692.g299adb8693-goog
-- Masami Hiramatsu (Google) mhiramat@kernel.org
On Wed, 18 Jun 2025 08:05:54 +0900 Masami Hiramatsu (Google) mhiramat@kernel.org wrote:
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
Thank you for testing! This is a good chance for me to setup LTP environment locally :)
It's a beast and so far, it continues to fail to build for me :-p
-- Steve
[ Adding x86 and text_poke folks ]
On Thu, 5 Jun 2025 17:12:10 +0530 Naresh Kamboju naresh.kamboju@linaro.org wrote:
Regressions found on qemu-x86_64 with compat mode (64-bit kernel running on 32-bit userspace) while running LTP tracing test suite on Linux next-20250605 tag kernel.
Regressions found on
- LTP tracing
Regression Analysis:
- New regression? Yes
- Reproducible? Intermittent
Test regression: qemu-x86_64-compat mode ltp tracing Oops int3 kernel panic
Reported-by: Linux Kernel Functional Testing lkft@linaro.org
## Test log ftrace-stress-test: <12>[ 21.971153] /usr/local/bin/kirk[277]: starting test ftrace-stress-test (ftrace_stress_test.sh 90) <4>[ 58.997439] Oops: int3: 0000 [#1] SMP PTI
Did anything change with text_poke? Ftrace just happens to stress text_poke more than anything else, as it updates tens of thousands of locations at a time.
The ftrace code hasn't changed in a while, but I think there's been updates to text_poke.
The modifying of code and adding and removing the int3 handler needs to be synchronized correctly or something like this bug can happen.
-- Steve
<4>[ 58.998089] CPU: 0 UID: 0 PID: 323 Comm: sh Not tainted 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 58.998152] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 58.998260] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 58.998563] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe <4>[ 58.998610] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246 <4>[ 58.998715] RAX: ffff912a042edd00 RBX: 000000000000000b RCX: 0000000000000000 <4>[ 58.998727] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff912a00f2c8c0 <4>[ 58.998737] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09: 0000000000000000 <4>[ 58.998748] R10: 0000000000000000 R11: 0000000000000000 R12: ffff912a00f2c8c0 <4>[ 58.998759] R13: ffff912a00f2c840 R14: 0000000000000006 R15: 0000000000000000 <4>[ 58.998804] FS: 0000000000000000(0000) GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580 <4>[ 58.998821] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 <4>[ 58.998832] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4: 00000000000006f0 <4>[ 58.998915] Call Trace: <4>[ 58.999010] <TASK> <4>[ 58.999077] ? file_close_fd+0x32/0x60 <4>[ 58.999147] __ia32_sys_close+0x18/0x90 <4>[ 58.999172] ia32_sys_call+0x1c3c/0x27e0 <4>[ 58.999183] __do_fast_syscall_32+0x79/0x1e0 <4>[ 58.999194] do_fast_syscall_32+0x37/0x80 <4>[ 58.999203] do_SYSENTER_32+0x23/0x30 <4>[ 58.999211] entry_SYSENTER_compat_after_hwframe+0x84/0x8e <4>[ 58.999254] RIP: 0023:0xf7f0c579 <4>[ 58.999459] Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 2e 8d b4 26 00 00 00 00 8d b4 26 00 00 00 <4>[ 58.999466] RSP: 002b:00000000fff98500 EFLAGS: 00000206 ORIG_RAX: 0000000000000006 <4>[ 58.999479] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 0000000000000000 <4>[ 58.999484] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 <4>[ 58.999488] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 <4>[ 58.999492] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 <4>[ 58.999497] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 <4>[ 58.999534] </TASK> <4>[ 58.999579] Modules linked in: <4>[ 58.999895] ---[ end trace 0000000000000000 ]--- <4>[ 58.999892] Oops: int3: 0000 [#2] SMP PTI <4>[ 58.999997] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 59.000008] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe <4>[ 59.000010] CPU: 1 UID: 0 PID: 339 Comm: sh Tainted: G D 6.15.0-next-20250605 #1 PREEMPT(voluntary) <4>[ 59.000014] RSP: 0018:ffff9494007bbe98 EFLAGS: 00000246 <4>[ 59.000021] RAX: ffff912a042edd00 RBX: 000000000000000b RCX: 0000000000000000 <4>[ 59.000026] RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff912a00f2c8c0 <4>[ 59.000030] RBP: ffff9494007bbeb8 R08: 0000000000000000 R09: 0000000000000000 <4>[ 59.000040] R10: 0000000000000000 R11: 0000000000000000 R12: ffff912a00f2c8c0 <4>[ 59.000044] R13: ffff912a00f2c840 R14: 0000000000000006 R15: 0000000000000000 <4>[ 59.000049] FS: 0000000000000000(0000) GS:ffff912ad7cbf000(0063) knlGS:00000000f7f05580 <4>[ 59.000054] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 <4>[ 59.000059] CR2: 00000000f7d8f890 CR3: 000000010124e000 CR4: 00000000000006f0 <4>[ 59.000070] Tainted: [D]=DIE <4>[ 59.000080] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 <4>[ 59.000085] RIP: 0010:_raw_spin_lock+0x5/0x50 <4>[ 59.000101] Code: 5d e9 ff 12 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f <1f> 44 00 00 55 48 89 e5 53 48 89 fb bf 01 00 00 00 e8 15 12 e4 fe <4>[ 59.000108] RSP: 0018:ffff9494000e0e88 EFLAGS: 00000097 <4>[ 59.000117] RAX: 0000000000010002 RBX: ffff912a7bd29500 RCX: ffff912a7bd2a400 <0>[ 59.000179] Kernel panic - not syncing: Fatal exception in interrupt <0>[ 60.592321] Shutting down cpus with NMI <0>[ 60.593242] Kernel Offset: 0x20800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) <0>[ 60.618536] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
## Source
- Kernel version: 6.15.0-next-20250605
- Git tree: https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next.git
- Git sha: 4f27f06ec12190c7c62c722e99ab6243dea81a94
## Build
- Test log: https://qa-reports.linaro.org/api/testruns/28675335/log_file/
- Build link: https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taH...
- Kernel config:
https://storage.tuxsuite.com/public/linaro/lkft/builds/2y4whKazVqJKOUFD08taH...
-- Linaro LKFT https://lkft.linaro.org