When a new idle CPU executes idle balance, the idle swap thread has not been switched in actually. The current thread is a normal task and this task is going to not occupy the CPU anymore so the CPU is seeking to pull task onto it.
But at this moment rq->h_nr_running still adds accounts for this normal thread; this gives scheduler misunderstanding the CPU has one running task on it and finally adds it into sum running number of schedule group.
At the end, function group_has_capacity() compare the running task number with CPU number, and unfortunately if all other CPUs have real running tasks then the group is considered as no spare 'capacity' and skip migrate any misfit task from another schedule group in the same schedule domain.
This patch is to fix nu_running accounting for new idle CPU, when checks the new idle CPU it doesn't account the running number into schedule group.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f5fb04f..6ebf7c7 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7154,9 +7154,22 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load; sgs->group_util += cpu_util(i); - sgs->sum_nr_running += rq->cfs.h_nr_running;
- nr_running = rq->nr_running; + /* + * If destination CPU is one new idle CPU, that means current + * task is occupying CPU so h_nr_running = 1 but in fact this + * task is going to release CPU for idle balance. + * + * Here should not account this task into running number, so + * give more chance for task migration onto this idle CPU. + */ + if (env->idle == CPU_NEWLY_IDLE && env->dst_cpu == i) + nr_running = 0; + else { + sgs->sum_nr_running += rq->cfs.h_nr_running; + nr_running = rq->nr_running; + } + if (nr_running > 1) *overload = true;
-- 2.7.4