* Paul E. McKenney paulmck@kernel.org [230913 07:28]:
On Wed, Sep 13, 2023 at 01:01:39PM +0200, Peter Zijlstra wrote:
On Tue, Sep 12, 2023 at 08:56:47PM -0400, Liam R. Howlett wrote:
Initial booting is setting the task flag to idle (PF_IDLE) by the call path sched_init() -> init_idle(). Having the task idle and calling call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be set. Subsequent calls to any cond_resched() will enable IRQs, potentially earlier than the IRQ setup has completed. Recent changes have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled before they are fully set up.
Fix this issue by clearing the PF_IDLE flag on return from sched_init() and restore the flag in rest_init(). Although the boot task was marked as idle since (at least) d80e4fda576d, I am not sure that it is wrong to do so. The forced context-switch on idle task was introduced in the tiny_rcu update, so I'm going to claim this fixes 5f6130fa52ee.
Link: https://lore.kernel.org/linux-mm/87v8cv22jh.fsf@mail.lhotse/ Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+Zou... Fixes: 5f6130fa52ee ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task") Cc: stable@vger.kernel.org Cc: Geert Uytterhoeven geert@linux-m68k.org Cc: "Paul E. McKenney" paulmck@kernel.org Cc: Christophe Leroy christophe.leroy@csgroup.eu Cc: Andreas Schwab schwab@linux-m68k.org Cc: Matthew Wilcox willy@infradead.org Cc: Peng Zhang zhangpeng.00@bytedance.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Juri Lelli juri.lelli@redhat.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Andrew Morton akpm@linux-foundation.org Cc: "Mike Rapoport (IBM)" rppt@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com
init/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/init/main.c b/init/main.c index ad920fac325c..f74772acf612 100644 --- a/init/main.c +++ b/init/main.c @@ -696,7 +696,7 @@ noinline void __ref __noreturn rest_init(void) */ rcu_read_lock(); tsk = find_task_by_pid_ns(pid, &init_pid_ns);
- tsk->flags |= PF_NO_SETAFFINITY;
- tsk->flags |= PF_NO_SETAFFINITY | PF_IDLE; set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id())); rcu_read_unlock();
@@ -938,6 +938,8 @@ void start_kernel(void) * time - but meanwhile we still have a functioning scheduler. */ sched_init();
- /* Avoid early context switch, rest_init() restores PF_IDLE */
- current->flags &= ~PF_IDLE;
if (WARN(!irqs_disabled(), "Interrupts were enabled *very* early, fixing it\n"))
Hurmph... so since this is about IRQs, would it not make sense to have the | PF_IDLE near 'early_boot_irqs_disabled = false' ?
I was thinking that this isn't an idle thread until the end of boot, so leave it set as not idle until the end.
Or, alternatively, make the tinyrcu thing check that variable?
We could do that, but do we really the decidedly non-idle early boot sequence designated as idle?
call_rcu() tiny is called more than this code, so unless there is another reason we want to check the IRQ status it seemed best to change the boot task flag. I mean, I think the is_idle_task() check is valid in every other context, right?
What surprises me is that this is just now showing up. The ingredients for this one have been in place for almost 10 years.
Am I just that lucky?
Cheers, Liam