The blocked load and shares of root cfs_rqs is currently only updated by a the CPU owning the rq. That means if a CPU goes suddenly from being busy to totally idle, its load and shares are not updated.
Schedutil works around this problem by ignoring the util of CPUs that were last updated more than a tick ago. However the stale load does impact task placement: elements that look at load and util (in particular the slow-path of select_task_rq_fair) can leave the idle CPUs un-used while other CPUs go unnecessarily overloaded. Furthermore the stale shares can impact CPU time allotment.
Two complementary solutions are proposed here: 1. When a task wakes up, if necessary an idle CPU is woken as if to perform a NOHZ idle balance, which is then aborted once the load of NOHZ idle CPUs has been updated. This solves the problem but brings with it extra CPU wakeups, which have an energy cost. 2. During newly-idle load balancing, the load of remote nohz-idle CPUs in the sched_domain is updated. When all of the idle CPUs were updated in that step, the nohz.next_update field is pushed further into the future. This field is used to determine the need for triggering the newly-added NOHZ kick. So if such newly-idle balances are happening often enough, no additional CPU wakeups are required to keep all the CPUs' loads updated.
[eas-dev] Patch 2/3 here is to highlight a change I made from Vincent's original patch, so that it can be reviewed more easily - if the modification is accepted then I'll squash it before posting this to LKML proper.
Brendan Jackman (2): sched/fair: Refactor nohz blocked load udpates sched/fair: Update blocked load from newly idle balance
Vincent Guittot (1): sched: force update of blocked load of idle cpus
kernel/sched/core.c | 1 + kernel/sched/fair.c | 106 ++++++++++++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 2 + 3 files changed, 96 insertions(+), 13 deletions(-)
-- 2.14.1
From: Vincent Guittot vincent.guittot@linaro.org
When idle, the blocked load of CPUs will be updated only when an idle load balance is triggered which may never happen. Because of this uncertainty on the execution of idle load balance, the utilization, the load and the shares of idle cfs_rq can stay artificially high and steal shares and running time to busy cfs_rqs of the task group. Add a new light idle load balance state which ensures that blocked loads are periodically updated and decayed but does not perform any task migration.
The remote load udpates are rate-limited, so that they are not performed with a shorter period than LOAD_AVG_PERIOD (i.e. PELT half-life). This is the period after which we have a known 50% error in stale load.
Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Morten Rasmussen morten.rasmussen@arm.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Vincent Guittot vincent.guittot@linaro.org [Switched remote update interval to use PELT half life] Signed-off-by: Brendan Jackman brendan.jackman@arm.com --- kernel/sched/fair.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++---- kernel/sched/sched.h | 1 + 2 files changed, 72 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 85d1ec1c3b39..cf27fa6548af 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5976,6 +5976,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) return min_cap * 1024 < task_util(p) * capacity_margin; }
+static inline bool nohz_kick_needed(struct rq *rq, bool only_update); +static void nohz_balancer_kick(bool only_update); + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -6074,6 +6077,11 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } rcu_read_unlock();
+#ifdef CONFIG_NO_HZ_COMMON + if (nohz_kick_needed(cpu_rq(new_cpu), true)) + nohz_balancer_kick(true); +#endif + return new_cpu; }
@@ -8653,6 +8661,7 @@ static struct { cpumask_var_t idle_cpus_mask; atomic_t nr_cpus; unsigned long next_balance; /* in jiffy units */ + unsigned long next_update; /* in jiffy units */ } nohz ____cacheline_aligned;
static inline int find_new_ilb(void) @@ -8670,7 +8679,7 @@ static inline int find_new_ilb(void) * nohz_load_balancer CPU (if there is one) otherwise fallback to any idle * CPU (if there is one). */ -static void nohz_balancer_kick(void) +static void nohz_balancer_kick(bool only_update) { int ilb_cpu;
@@ -8683,6 +8692,10 @@ static void nohz_balancer_kick(void)
if (test_and_set_bit(NOHZ_BALANCE_KICK, nohz_flags(ilb_cpu))) return; + + if (only_update) + set_bit(NOHZ_STATS_KICK, nohz_flags(ilb_cpu)); + /* * Use smp_send_reschedule() instead of resched_cpu(). * This way we generate a sched IPI on the target cpu which @@ -8770,6 +8783,8 @@ void nohz_balance_enter_idle(int cpu) atomic_inc(&nohz.nr_cpus); set_bit(NOHZ_TICK_STOPPED, nohz_flags(cpu)); } +#else +static inline void nohz_balancer_kick(bool only_update) {} #endif
static DEFINE_SPINLOCK(balancing); @@ -8803,6 +8818,17 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
update_blocked_averages(cpu);
+#ifdef CONFIG_NO_HZ_COMMON + if ((idle == CPU_MAX_IDLE_TYPES) || + test_bit(NOHZ_STATS_KICK, nohz_flags(cpu))) { + /* + * We only want to update blocked load and shares which has + * just been done above + */ + return; + } +#endif + rcu_read_lock(); for_each_domain(cpu, sd) { /* @@ -8901,6 +8927,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { int this_cpu = this_rq->cpu; struct rq *rq; + struct sched_domain *sd; int balance_cpu; /* Earliest time when we have to do rebalance again */ unsigned long next_balance = jiffies + 60*HZ; @@ -8910,6 +8937,32 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) !test_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu))) goto end;
+ /* + * This cpu is going to update the blocked load of idle CPUs either + * before doing a rebalancing or just to keep metrics up to date. we + * can safely update the next update timestamp + */ + rcu_read_lock(); + sd = rcu_dereference(this_rq->sd); + /* + * Check whether there is a sched_domain available for this cpu. + * The last other cpu can have been unplugged since the ILB has been + * triggered and the sched_domain can now be null. The idle balance + * sequence will quickly be aborted as there is no more idle CPUs + */ + if (sd) + nohz.next_update = jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD); + rcu_read_unlock(); + + /* + * This idle load balance softirq has been triggered only to update the + * blocked load and shares of ilde CPUs. Set the idle state to an + * undefined state to the load balance will be aborted just after the + * update of blocked load + */ + if (test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu))) + idle = CPU_MAX_IDLE_TYPES; + for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue; @@ -8936,7 +8989,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) cpu_load_update_idle(rq); rq_unlock_irq(rq, &rf);
- rebalance_domains(rq, CPU_IDLE); + rebalance_domains(rq, idle); }
if (time_after(next_balance, rq->next_balance)) { @@ -8967,7 +9020,7 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ -static inline bool nohz_kick_needed(struct rq *rq) +static inline bool nohz_kick_needed(struct rq *rq, bool only_update) { unsigned long now = jiffies; struct sched_domain_shared *sds; @@ -8975,7 +9028,7 @@ static inline bool nohz_kick_needed(struct rq *rq) int nr_busy, i, cpu = rq->cpu; bool kick = false;
- if (unlikely(rq->idle_balance)) + if (unlikely(rq->idle_balance) && !only_update) return false;
/* @@ -8992,6 +9045,13 @@ static inline bool nohz_kick_needed(struct rq *rq) if (likely(!atomic_read(&nohz.nr_cpus))) return false;
+ if (only_update) { + if (time_before(now, nohz.next_update)) + return false; + else + return true; + } + if (time_before(now, nohz.next_balance)) return false;
@@ -9041,6 +9101,7 @@ static inline bool nohz_kick_needed(struct rq *rq) } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { } +static inline bool nohz_kick_needed(struct rq *rq, bool only_update) { return false; } #endif
/* @@ -9063,6 +9124,9 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle); +#ifdef CONFIG_NO_HZ_COMMON + clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu)); +#endif }
/* @@ -9077,8 +9141,8 @@ void trigger_load_balance(struct rq *rq) if (time_after_eq(jiffies, rq->next_balance)) raise_softirq(SCHED_SOFTIRQ); #ifdef CONFIG_NO_HZ_COMMON - if (nohz_kick_needed(rq)) - nohz_balancer_kick(); + if (nohz_kick_needed(rq, false)) + nohz_balancer_kick(false); #endif }
@@ -9657,6 +9721,7 @@ __init void init_sched_fair_class(void)
#ifdef CONFIG_NO_HZ_COMMON nohz.next_balance = jiffies; + nohz.next_update = jiffies; zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); #endif #endif /* SMP */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 14db76cd496f..6f95ef653f73 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1978,6 +1978,7 @@ extern void cfs_bandwidth_usage_dec(void); enum rq_nohz_flag_bits { NOHZ_TICK_STOPPED, NOHZ_BALANCE_KICK, + NOHZ_STATS_KICK };
#define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags) -- 2.14.1
By moving the call to update_blocked_averages outside of rebalance_domains, we can avoid the use of CPU_MAX_IDLE_TYPES as a marker for aborting the load balance.
Vincent, if you agree with this I'll squash it into the parent before posting to LKML --- kernel/sched/fair.c | 36 ++++++++++++------------------------ 1 file changed, 12 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cf27fa6548af..9085caf49c76 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8816,19 +8816,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) int need_serialize, need_decay = 0; u64 max_cost = 0;
- update_blocked_averages(cpu); - -#ifdef CONFIG_NO_HZ_COMMON - if ((idle == CPU_MAX_IDLE_TYPES) || - test_bit(NOHZ_STATS_KICK, nohz_flags(cpu))) { - /* - * We only want to update blocked load and shares which has - * just been done above - */ - return; - } -#endif - rcu_read_lock(); for_each_domain(cpu, sd) { /* @@ -8954,15 +8941,6 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) nohz.next_update = jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD); rcu_read_unlock();
- /* - * This idle load balance softirq has been triggered only to update the - * blocked load and shares of ilde CPUs. Set the idle state to an - * undefined state to the load balance will be aborted just after the - * update of blocked load - */ - if (test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu))) - idle = CPU_MAX_IDLE_TYPES; - for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue; @@ -8989,7 +8967,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) cpu_load_update_idle(rq); rq_unlock_irq(rq, &rf);
- rebalance_domains(rq, idle); + update_blocked_averages(balance_cpu); + /* + * This idle load balance softirq may have been + * triggered only to update the blocked load and shares + * of idle CPUs (which we have just done for + * balance_cpu). In that case skip the actual balance. + */ + if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu))) + rebalance_domains(rq, idle); }
if (time_after(next_balance, rq->next_balance)) { @@ -9123,7 +9109,9 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h) * and abort nohz_idle_balance altogether if we pull some load. */ nohz_idle_balance(this_rq, idle); - rebalance_domains(this_rq, idle); + update_blocked_averages(this_rq->cpu); + if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu))) + rebalance_domains(this_rq, idle); #ifdef CONFIG_NO_HZ_COMMON clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu)); #endif -- 2.14.1
Hi Brendan,
On 23 October 2017 at 15:32, Brendan Jackman brendan.jackman@arm.com wrote:
By moving the call to update_blocked_averages outside of rebalance_domains, we can avoid the use of CPU_MAX_IDLE_TYPES as a marker for aborting the load balance.
Vincent, if you agree with this I'll squash it into the parent before posting to LKML
Looks good to me
Vincent
kernel/sched/fair.c | 36 ++++++++++++------------------------ 1 file changed, 12 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cf27fa6548af..9085caf49c76 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8816,19 +8816,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) int need_serialize, need_decay = 0; u64 max_cost = 0;
update_blocked_averages(cpu);
-#ifdef CONFIG_NO_HZ_COMMON
if ((idle == CPU_MAX_IDLE_TYPES) ||
test_bit(NOHZ_STATS_KICK, nohz_flags(cpu))) {
/*
* We only want to update blocked load and shares which has
* just been done above
*/
return;
}
-#endif
rcu_read_lock(); for_each_domain(cpu, sd) { /*
@@ -8954,15 +8941,6 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) nohz.next_update = jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD); rcu_read_unlock();
/*
* This idle load balance softirq has been triggered only to update the
* blocked load and shares of ilde CPUs. Set the idle state to an
* undefined state to the load balance will be aborted just after the
* update of blocked load
*/
if (test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu)))
idle = CPU_MAX_IDLE_TYPES;
for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue;
@@ -8989,7 +8967,15 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) cpu_load_update_idle(rq); rq_unlock_irq(rq, &rf);
rebalance_domains(rq, idle);
update_blocked_averages(balance_cpu);
/*
* This idle load balance softirq may have been
* triggered only to update the blocked load and shares
* of idle CPUs (which we have just done for
* balance_cpu). In that case skip the actual balance.
*/
if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_cpu)))
rebalance_domains(rq, idle); } if (time_after(next_balance, rq->next_balance)) {
@@ -9123,7 +9109,9 @@ static __latent_entropy void run_rebalance_domains(struct softirq_action *h) * and abort nohz_idle_balance altogether if we pull some load. */ nohz_idle_balance(this_rq, idle);
rebalance_domains(this_rq, idle);
update_blocked_averages(this_rq->cpu);
if (!test_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu)))
rebalance_domains(this_rq, idle);
#ifdef CONFIG_NO_HZ_COMMON clear_bit(NOHZ_STATS_KICK, nohz_flags(this_rq->cpu));
#endif
2.14.1
We now have a NOHZ kick to avoid the load of idle CPUs becoming stale. This is good, but it brings about CPU wakeups, which have an energy cost. As an alternative to waking CPUs up to do decay blocked load, we can sometimes do it from newly idle balance. If the newly idle balance is on a domain that covers all the currently nohz-idle CPUs, we push the value of nohz.next_update into the future. That means that if such newly idle balances happen often enough, we never need wake up a CPU just to update load.
Since we're doing this new update inside a for_each_domain, we need to do something to avoid doing multiple updates on the same CPU in the same idle_balance. A tick stamp is set on the rq in update_blocked_averages as a simple way to do this. Using a simple jiffies-based timestamp, as opposed to the last_update_time of the root cfs_rq's sched_avg, means we can do this without taking the rq lock.
Cc: Dietmar Eggemann dietmar.eggemann@arm.com Cc: Vincent Guittot vincent.guittot@linaro.org Cc: Ingo Molnar mingo@redhat.com Cc: Morten Rasmussen morten.rasmussen@arm.com Cc: Peter Zijlstra peterz@infradead.org Signed-off-by: Brendan Jackman brendan.jackman@arm.com --- kernel/sched/core.c | 1 + kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 1 + 3 files changed, 36 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d17c5da523a0..d8e71fd27806 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5923,6 +5923,7 @@ void __init sched_init(void) rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON rq->last_load_update_tick = jiffies; + rq->last_blocked_load_update_tick = jiffies; rq->nohz_flags = 0; #endif #ifdef CONFIG_NO_HZ_FULL diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9085caf49c76..0eacac05b834 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7062,6 +7062,7 @@ static void update_blocked_averages(int cpu) if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); } + rq->last_blocked_load_update_tick = jiffies; rq_unlock_irqrestore(rq, &rf); }
@@ -7121,6 +7122,7 @@ static inline void update_blocked_averages(int cpu) rq_lock_irqsave(rq, &rf); update_rq_clock(rq); update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); + rq->last_blocked_load_update_tick = jiffies; rq_unlock_irqrestore(rq, &rf); }
@@ -7615,6 +7617,15 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) } #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_NO_HZ_COMMON +static struct { + cpumask_var_t idle_cpus_mask; + atomic_t nr_cpus; + unsigned long next_balance; /* in jiffy units */ + unsigned long next_update; /* in jiffy units */ +} nohz ____cacheline_aligned; +#endif + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancing. * @env: The load balancing environment. @@ -7633,6 +7644,29 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1;
+#ifdef CONFIG_NO_HZ_COMMON + if (env->idle == CPU_NEWLY_IDLE) { + int cpu; + + /* Update the stats of NOHZ idle CPUs in the sd */ + for_each_cpu_and(cpu, sched_domain_span(env->sd), + nohz.idle_cpus_mask) { + struct rq *rq = cpu_rq(cpu); + + /* ... Unless we've already done since the last tick */ + if (jiffies > rq->last_blocked_load_update_tick) + update_blocked_averages(cpu); + } + } + /* + * If we've just updated all of the NOHZ idle CPUs, then we can push + * back the next nohz.next_update, which will prevent an unnecessary + * wakeup for the nohz stats kick + */ + if (cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) + nohz.next_update = jiffies + LOAD_AVG_PERIOD; +#endif + load_idx = get_sd_load_idx(env->sd, env->idle);
do { @@ -8657,12 +8691,6 @@ static inline int on_null_domain(struct rq *rq) * needed, they will kick the idle load balancer, which then does idle * load balancing for all the idle CPUs. */ -static struct { - cpumask_var_t idle_cpus_mask; - atomic_t nr_cpus; - unsigned long next_balance; /* in jiffy units */ - unsigned long next_update; /* in jiffy units */ -} nohz ____cacheline_aligned;
static inline int find_new_ilb(void) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6f95ef653f73..6be8938bb977 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -681,6 +681,7 @@ struct rq { #ifdef CONFIG_NO_HZ_COMMON #ifdef CONFIG_SMP unsigned long last_load_update_tick; + unsigned long last_blocked_load_update_tick; #endif /* CONFIG_SMP */ unsigned long nohz_flags; #endif /* CONFIG_NO_HZ_COMMON */ -- 2.14.1
Hi Brendan,
On 23/10/17 14:32, Brendan Jackman wrote:
The blocked load and shares of root cfs_rqs is currently only updated by a the CPU owning the rq. That means if a CPU goes suddenly from being busy to totally idle, its load and shares are not updated.
Schedutil works around this problem by ignoring the util of CPUs that were last updated more than a tick ago. However the stale load does impact task placement: elements that look at load and util (in particular the slow-path of select_task_rq_fair) can leave the idle CPUs un-used while other CPUs go unnecessarily overloaded. Furthermore the stale shares can impact CPU time allotment.
Two complementary solutions are proposed here:
- When a task wakes up, if necessary an idle CPU is woken as if to perform a NOHZ idle balance, which is then aborted once the load of NOHZ idle CPUs has been updated. This solves the problem but brings with it extra CPU wakeups, which have an energy cost.
- During newly-idle load balancing, the load of remote nohz-idle CPUs in the sched_domain is updated. When all of the idle CPUs were updated in that step, the nohz.next_update field is pushed further into the future. This field is used to determine the need for triggering the newly-added NOHZ kick. So if such newly-idle balances are happening often enough, no additional CPU wakeups are required to keep all the CPUs' loads updated.
[eas-dev] Patch 2/3 here is to highlight a change I made from Vincent's original patch, so that it can be reviewed more easily - if the modification is accepted then I'll squash it before posting this to LKML proper.
I already reviewed those patches last week and I'm fine with this extra change [patch 2/3] as well so please go ahead and post it to lkml.
-- Dietmar