From: Zqiang qiang.zhang@linux.dev
[ Upstream commit a257e974210320ede524f340ffe16bf4bf0dda1e ]
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in the per-cpu irq_work/* task context and not disable-irq, if the rq returned by container_of() is current CPU's rq, the following scenarios may occur:
lock(&rq->__lock); <Interrupt> lock(&rq->__lock);
This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn() is always invoked in hard-irq context.
Signed-off-by: Zqiang qiang.zhang@linux.dev Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
## Comprehensive Analysis
### 1. Commit Message and Problem Analysis The commit addresses a **critical deadlock** scenario in the `sched_ext` (BPF extensible scheduler) subsystem. - **The Issue:** On `CONFIG_PREEMPT_RT` (Real-Time) kernels, `irq_work` items initialized with `init_irq_work()` default to running in a per- CPU thread context with interrupts enabled. The work function `deferred_irq_workfn()` acquires the runqueue lock (`raw_spin_rq_lock(rq)`). If an interrupt occurs while this lock is held, and the interrupt handler also attempts to acquire `rq->__lock` (a standard scheduler pattern), a deadlock occurs (A-A deadlock). - **The Fix:** The commit changes the initialization of `deferred_irq_work` to use `IRQ_WORK_INIT_HARD()`. This forces the work function to execute in **hard interrupt context** (with interrupts disabled), preventing the nested interrupt scenario that causes the deadlock.
### 2. Deep Code Research & Verification - **Subsystem Context:** `sched_ext` was merged in Linux v6.12. The buggy code exists in all stable kernels starting from v6.12.y up to the current v6.17.y. Older LTS kernels (6.6.y, 6.1.y) do not contain `sched_ext` and are unaffected. - **Code Mechanics:** - **Buggy Code:** `init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn);` relies on defaults which are unsafe for this locking pattern on PREEMPT_RT. - **Corrected Code:** `rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn);` explicitly sets the `IRQ_WORK_HARD_IRQ` flag. - **Precedent:** This pattern is well-established in the scheduler core (e.g., `rto_push_work` in `kernel/sched/topology.c` uses `IRQ_WORK_INIT_HARD` for the exact same reason). - **Correctness:** `deferred_irq_workfn` calls `run_deferred`, which uses `raw_spin_rq_lock`. These locks are safe to take in hard-irq context. The fix is technically sound and adheres to locking rules.
### 3. Stable Kernel Rules Evaluation - **Fixes a real bug?** **Yes.** It fixes a reproducible deadlock that causes system hangs. - **Important issue?** **Yes.** Deadlocks are critical failures, especially on Real-Time systems where reliability is paramount. - **Obviously correct?** **Yes.** The fix is a one-line change using a standard kernel macro specifically designed for this purpose. - **Small and contained?** **Yes.** One line changed, no external dependencies. - **No new features?** **Yes.** This is a pure bug fix for existing functionality.
### 4. Risk Assessment - **Regression Risk:** **Very Low.** The change only affects the execution context of the work item. On non-RT kernels, `irq_work` often runs in hard-irq context anyway, so the behavior change is minimal. On RT kernels, this restores the required behavior to prevent deadlocks. - **User Impact:** Users running `sched_ext` on Real-Time kernels are at risk of random system freezes without this fix.
### Conclusion This commit is a textbook example of stable-material. It fixes a severe bug (deadlock) in a supported feature (`sched_ext`) using a minimal, well-understood solution. While it lacks a "Cc: stable" tag, the nature of the bug (deadlock) and the surgical nature of the fix make it a mandatory backport for all stable trees containing `sched_ext` (v6.12+).
**YES**
kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index e1b502ef1243c..fa64fdb6e9796 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5280,7 +5280,7 @@ void __init init_sched_ext_class(void) BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n)); BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n)); BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n)); - init_irq_work(&rq->scx.deferred_irq_work, deferred_irq_workfn); + rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn); init_irq_work(&rq->scx.kick_cpus_irq_work, kick_cpus_irq_workfn);
if (cpu_online(cpu))