Re: [PATCH] sched/rt: Fix live lock between select_fallback_rq() and RT push

23 Sep 2023

On Sat, 23 Sep 2023 01:14:08 +0000
"Joel Fernandes (Google)" joel@joelfernandes.org wrote:
...
During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.
On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running.  During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.
It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.
Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.
Tested-by: Paul E. McKenney paulmck@kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) joel@joelfernandes.org

kernel/sched/cpupri.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a286e726eb4b..42c40cfdf836 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 
   if (lowest_mask) {
   	cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);

cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);



What happens if the cpu_active_mask changes right here?
Is this just making the race window smaller?
Something tells me the fix is going to be something a bit more involved.
But as I'm getting ready for Paris, I can't look at it at the moment.
-- Steve
...
/*
   	 * We have to ensure that we have at least one bit

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] sched/rt: Fix live lock between select_fallback_rq() and RT push