The blocked load and shares of root cfs_rqs is currently only updated by a the CPU owning the rq. That means if a CPU goes suddenly from being busy to totally idle, its load and shares are not updated.
Schedutil works around this problem by ignoring the util of CPUs that were last updated more than a tick ago. However the stale load does impact task placement: elements that look at load and util (in particular the slow-path of select_task_rq_fair) can leave the idle CPUs un-used while other CPUs go unnecessarily overloaded. Furthermore the stale shares can impact CPU time allotment.
Two complementary solutions are proposed here: 1. When a task wakes up, if necessary an idle CPU is woken as if to perform a NOHZ idle balance, which is then aborted once the load of NOHZ idle CPUs has been updated. This solves the problem but brings with it extra CPU wakeups, which have an energy cost. 2. During newly-idle load balancing, the load of remote nohz-idle CPUs in the sched_domain is updated. When all of the idle CPUs were updated in that step, the nohz.next_update field is pushed further into the future. This field is used to determine the need for triggering the newly-added NOHZ kick. So if such newly-idle balances are happening often enough, no additional CPU wakeups are required to keep all the CPUs' loads updated.
[eas-dev] Patch 2/3 here is to highlight a change I made from Vincent's original patch, so that it can be reviewed more easily - if the modification is accepted then I'll squash it before posting this to LKML proper.
Brendan Jackman (2): sched/fair: Refactor nohz blocked load udpates sched/fair: Update blocked load from newly idle balance
Vincent Guittot (1): sched: force update of blocked load of idle cpus
kernel/sched/core.c | 1 + kernel/sched/fair.c | 106 ++++++++++++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 2 + 3 files changed, 96 insertions(+), 13 deletions(-)
-- 2.14.1