On Tue, 10 Jun 2025 11:50:30 -0400 Steven Rostedt rostedt@goodmis.org wrote:
On Tue, 10 Jun 2025 23:47:48 +0900 "Masami Hiramatsu (Google)" mhiramat@kernel.org wrote:
Maybe one possible scenario is to hit the int3 after the third step somehow (on I-cache).
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) on_each_cpu(do_sync_core) do_sync_core(do SERIALIZE) Finish the third step. Hit INT3 (from I-cache) Clear text_poke_array_refs[cpu0] Start smp_text_poke_int3_handler()
I believe your analysis is the issue here. The commit that changed the ref counter from a global to per cpu didn't cause the issue, it just made the race window bigger.
Ah, OK. It seems more easier to explain. Since we use the trap gate for #BP, it does not clear the IF automatically. Thus there is a time window between executing INT3 on icache (or already in the pipeline) and its handler disables interrupts. If the IPI is received in the time window, this bug happens.
<CPU0> <CPU1> Start smp_text_poke_batch_finish(). Start the third step. (remove INT3) Hit INT3 (from icache/pipeline) on_each_cpu(do_sync_core) ---- do_sync_core(do SERIALIZE) ---- Finish the third step. Handle #BP including CLI Clear text_poke_array_refs[cpu0] preparing stack Start smp_text_poke_int3_handler() Failed to get text_poke_array_refs[cpu0]
In this case, per-cpu text_poke_array_refs will make a time window bigger because clearing text_poke_array_refs is faster.
If this is correct, flushing cache does not matter (it can make the window smaller.)
One possible solution is to send IPI again which ensures the current #BP handler exits. It can make the window small enough.
Another solution is removing WARN_ONCE() from [1/2], which means we accept this scenario, but avoid catastrophic result.
Thank you,