Re: [PATCH] sched/core: Fix potential deadlock on rq lock

11 Sep 2025


      Le Thu, Sep 11, 2025 at 03:53:58PM +0200, Peter Zijlstra a écrit :
...
On Thu, Sep 11, 2025 at 12:42:49PM +0000, Wang Tao wrote:
...
When CPU 1 enters the nohz_full state, and the kworker on CPU 0 executes
the function sched_tick_remote, holding the lock on CPU1's rq
and triggering the warning WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3).
This leads to the process of printing the warning message, where the
console_sem semaphore is held. At this point, the print task on the
CPU1's rq cannot acquire the console_sem and joins the wait queue,
entering the UNINTERRUPTIBLE state. It waits for the console_sem to be
released and then wakes up. After the task on CPU 0 releases
the console_sem, it wakes up the waiting console_sem task.
In try_to_wake_up, it attempts to acquire the lock on CPU1's rq again,
resulting in a deadlock.
The triggering scenario is as follows:
CPU0								CPU1
sched_tick_remote
WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3)
report_bug							con_write
printk
console_unlock
   							do_con_write
   							console_lock
   							down(&console_sem)
   							list_add_tail(&waiter.list, &sem->wait_list);
up(&console_sem)
wake_up_q(&wake_q)
try_to_wake_up
__task_rq_lock
_raw_spin_lock
This patch fixes the issue by deffering all printk console printing
during the lock holding period.
Fixes: d84b31313ef8 ("sched/isolation: Offload residual 1Hz scheduler tick")
Signed-off-by: Wang Tao wangtao554@huawei.com
I fundamentally hate that deferred thing and consider it a printk bug.
But really, if you trip that WARN, fix it and the problem goes away.
And probably it triggers a lot of false positives. An overloaded housekeeping
CPU can easily be off for 2 seconds. We should make it 30 seconds.
Thanks.
-- 
Frederic Weisbecker
SUSE Labs

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] sched/core: Fix potential deadlock on rq lock