During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores or by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we evaluate the usage of a group and compare it with its capacity.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance.
The utilization_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the utilization_avg_contrib that is based on the new implementation proposal [1] but haven't provide the patches and results as [1] is still under review. I can provide change above [1] to change how utilization_avg_contrib is computed and adapt to new mecanism.
Change since V6 - add group usage tracking - fix some commits' messages - minor fix like comments and argument order
Change since V5 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07 - update commit log and add more details on the purpose of the patches - fix/remove useless code with the rebase on patchset [2] - remove capacity_orig in sched_group_capacity as it is not used - move code in the right patch - add some helper function to factorize code
Change since V4 - rebase to manage conflicts with changes in selection of busiest group [4]
Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments
Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics
Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity
[1] https://lkml.org/lkml/2014/7/18/110 [2] https://lkml.org/lkml/2014/7/25/589
Morten Rasmussen (1): sched: Track group sched_entity usage contributions
Vincent Guittot (6): sched: add per rq cpu_capacity_orig sched: move cfs task on a CPU with higher capacity sched: add utilization_avg_contrib sched: get CPU's usage statistic sched: replace capacity_factor by usage sched: add SD_PREFER_SIBLING for SMT level
include/linux/sched.h | 21 +++- kernel/sched/core.c | 15 +-- kernel/sched/debug.c | 12 ++- kernel/sched/fair.c | 283 ++++++++++++++++++++++++++++++-------------------- kernel/sched/sched.h | 11 +- 5 files changed, 209 insertions(+), 133 deletions(-)
This new field cpu_capacity_orig reflects the original capacity of a CPU before being altered by rt tasks and/or IRQ
The cpu_capacity_orig will be used in several places to detect when the capacity of a CPU has been noticeably reduced so we can trig load balance to look for a CPU with better capacity. As an example, we can detect when a CPU handles a significant amount of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by scheduler whereas CPUs, which are really idle, are available
In addition, this new cpu_capacity_orig will be used to evaluate the usage of a CPU by CFS tasks
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Kamalesh Babulal kamalesh@linux.vnet.ibm.com
--- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 8 +++++++- kernel/sched/sched.h | 1 + 3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c84bdc0..45ae52d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7087,7 +7087,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq->cpu_capacity = SCHED_CAPACITY_SCALE; + rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bd61cff..c3674da 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4089,6 +4089,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; }
+static unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5776,6 +5781,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) @@ -5837,7 +5843,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) * Runtime updates will correct capacity_orig. */ if (unlikely(!rq->sd)) { - capacity_orig += capacity_of(cpu); + capacity_orig += capacity_orig_of(cpu); capacity += capacity_of(cpu); continue; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6130251..12483d8 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -579,6 +579,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_capacity; + unsigned long cpu_capacity_orig;
unsigned char idle_balance; /* For active balancing */
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
env.src_cpu and env.src_rq must be set unconditionnaly because they are used in need_active_balance which is called even if busiest->nr_running equals 1
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 50 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c3674da..9075dee 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) }
/* + * Check whether the capacity of the rq has been noticeably reduced by side + * activity. The imbalance_pct is used for the threshold. + * Return true is the capacity is reduced + */ +static inline int +check_cpu_capacity(struct rq *rq, struct sched_domain *sd) +{ + return ((rq->cpu_capacity * sd->imbalance_pct) < + (rq->cpu_capacity_orig * 100)); +} + +/* * Group imbalance indicates (and tries to solve) the problem where balancing * groups is inadequate due to tsk_cpus_allowed() constraints. * @@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env) */ if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1; + + /* + * The src_cpu's capacity is reduced because of other + * sched_class or IRQs, we trig an active balance to move the + * task + */ + if (check_cpu_capacity(env->src_rq, sd)) + return 1; }
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); @@ -6668,6 +6688,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu; + env.src_rq = busiest; + ld_moved = 0; if (busiest->nr_running > 1) { /* @@ -6677,8 +6700,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.src_cpu = busiest->cpu; - env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
more_balance: @@ -7378,10 +7399,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/* * Current heuristic for kicking the idle load balancer in the presence - * of an idle cpu is the system. + * of an idle cpu in the system. * - This rq has more than one task. - * - At any scheduler domain level, this cpu's scheduler group has multiple - * busy cpu's exceeding the group's capacity. + * - This rq has at least one CFS task and the capacity of the CPU is + * significantly reduced because of RT tasks or IRQs. + * - At parent of LLC scheduler domain level, this cpu's scheduler group has + * multiple busy cpu. * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ @@ -7391,9 +7414,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu; + bool kick = false;
if (unlikely(rq->idle_balance)) - return 0; + return false;
/* * We may be recently in ticked or tickless idle mode. At the first @@ -7407,38 +7431,44 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus))) - return 0; + return false;
if (time_before(now, nohz.next_balance)) - return 0; + return false;
if (rq->nr_running >= 2) - goto need_kick; + return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); - if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1) - goto need_kick_unlock; + if (nr_busy > 1) { + kick = true; + goto unlock; + } + }
- sd = rcu_dereference(per_cpu(sd_asym, cpu)); + sd = rcu_dereference(rq->sd); + if (sd) { + if ((rq->cfs.h_nr_running >= 1) && + check_cpu_capacity(rq, sd)) { + kick = true; + goto unlock; + } + }
+ sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) - goto need_kick_unlock; - - rcu_read_unlock(); - return 0; + kick = true;
-need_kick_unlock: +unlock: rcu_read_unlock(); -need_kick: - return 1; + return kick; } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
+++ b/kernel/sched/fair.c @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) } /*
- Check whether the capacity of the rq has been noticeably reduced by side
- activity. The imbalance_pct is used for the threshold.
- Return true is the capacity is reduced
- */
+static inline int +check_cpu_capacity(struct rq *rq, struct sched_domain *sd) +{
- return ((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100));
+}
+/*
- Group imbalance indicates (and tries to solve) the problem where balancing
- groups is inadequate due to tsk_cpus_allowed() constraints.
@@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env) */ if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1;
/*
* The src_cpu's capacity is reduced because of other
* sched_class or IRQs, we trig an active balance to move the
* task
*/
if (check_cpu_capacity(env->src_rq, sd))
}return 1;
So does it make sense to first check if there's a better candidate at all? By this time we've already iterated the current SD while trying regular load balancing, so we could know this.
On 9 October 2014 13:23, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
+++ b/kernel/sched/fair.c @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) }
/*
- Check whether the capacity of the rq has been noticeably reduced by side
- activity. The imbalance_pct is used for the threshold.
- Return true is the capacity is reduced
- */
+static inline int +check_cpu_capacity(struct rq *rq, struct sched_domain *sd) +{
return ((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100));
+}
+/*
- Group imbalance indicates (and tries to solve) the problem where balancing
- groups is inadequate due to tsk_cpus_allowed() constraints.
@@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env) */ if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1;
/*
* The src_cpu's capacity is reduced because of other
* sched_class or IRQs, we trig an active balance to move the
* task
*/
if (check_cpu_capacity(env->src_rq, sd))
return 1; }
So does it make sense to first check if there's a better candidate at all? By this time we've already iterated the current SD while trying regular load balancing, so we could know this.
i'm not sure to completely catch your point. Normally, f_b_g and f_b_q have already looked at the best candidate when we call need_active_balance. And src_cpu has been elected. Or i have missed your point ?
On Thu, Oct 09, 2014 at 04:59:36PM +0200, Vincent Guittot wrote:
On 9 October 2014 13:23, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
+++ b/kernel/sched/fair.c @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) }
/*
- Check whether the capacity of the rq has been noticeably reduced by side
- activity. The imbalance_pct is used for the threshold.
- Return true is the capacity is reduced
- */
+static inline int +check_cpu_capacity(struct rq *rq, struct sched_domain *sd) +{
return ((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100));
+}
+/*
- Group imbalance indicates (and tries to solve) the problem where balancing
- groups is inadequate due to tsk_cpus_allowed() constraints.
@@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env) */ if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1;
/*
* The src_cpu's capacity is reduced because of other
* sched_class or IRQs, we trig an active balance to move the
* task
*/
if (check_cpu_capacity(env->src_rq, sd))
return 1; }
So does it make sense to first check if there's a better candidate at all? By this time we've already iterated the current SD while trying regular load balancing, so we could know this.
i'm not sure to completely catch your point. Normally, f_b_g and f_b_q have already looked at the best candidate when we call need_active_balance. And src_cpu has been elected. Or i have missed your point ?
Yep you did indeed miss my point.
So I've always disliked this patch for its arbitrary nature, why unconditionally try and active balance every time there is 'some' RT/IRQ usage, it could be all CPUs are over that arbitrary threshold and we'll end up active balancing for no point.
So, since we've already iterated all CPUs in our domain back in update_sd_lb_stats() we could have computed the CFS fraction:
1024 * capacity / capacity_orig
for every cpu and collected the min/max of this. Then we can compute if src is significantly (and there I suppose we can indeed use imb) affected compared to others.
On 9 October 2014 17:30, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Oct 09, 2014 at 04:59:36PM +0200, Vincent Guittot wrote:
On 9 October 2014 13:23, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:32PM +0200, Vincent Guittot wrote:
+++ b/kernel/sched/fair.c @@ -5896,6 +5896,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) }
/*
- Check whether the capacity of the rq has been noticeably reduced by side
- activity. The imbalance_pct is used for the threshold.
- Return true is the capacity is reduced
- */
+static inline int +check_cpu_capacity(struct rq *rq, struct sched_domain *sd) +{
return ((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100));
+}
+/*
- Group imbalance indicates (and tries to solve) the problem where balancing
- groups is inadequate due to tsk_cpus_allowed() constraints.
@@ -6567,6 +6579,14 @@ static int need_active_balance(struct lb_env *env) */ if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1;
/*
* The src_cpu's capacity is reduced because of other
* sched_class or IRQs, we trig an active balance to move the
* task
*/
if (check_cpu_capacity(env->src_rq, sd))
return 1; }
So does it make sense to first check if there's a better candidate at all? By this time we've already iterated the current SD while trying regular load balancing, so we could know this.
i'm not sure to completely catch your point. Normally, f_b_g and f_b_q have already looked at the best candidate when we call need_active_balance. And src_cpu has been elected. Or i have missed your point ?
Yep you did indeed miss my point.
So I've always disliked this patch for its arbitrary nature, why unconditionally try and active balance every time there is 'some' RT/IRQ usage, it could be all CPUs are over that arbitrary threshold and we'll end up active balancing for no point.
So, since we've already iterated all CPUs in our domain back in update_sd_lb_stats() we could have computed the CFS fraction:
1024 * capacity / capacity_orig
for every cpu and collected the min/max of this. Then we can compute if src is significantly (and there I suppose we can indeed use imb) affected compared to others.
ok, so we should put additional check in f_b_g to be sure that we will jump to force_balance only if there will be a real gain by moving the task on the local group (from an available capacity for the task point of view) and probably in f_b_q too
Add new statistics which reflect the average time a task is running on the CPU and the sum of these running time of the tasks on a runqueue. The latter is named utilization_avg_contrib.
This patch is based on the usage metric that was proposed in the 1st versions of the per-entity load tracking patchset by Paul Turner pjt@google.com but that has be removed afterward. This version differs from the original one in the sense that it's not linked to task_group.
The rq's utilization_avg_contrib will be used to check if a rq is overloaded or not instead of trying to compute how many task a group of CPUs can handle
Rename runnable_avg_period into avg_period as it is now used with both runnable_avg_sum and running_avg_sum
Add some descriptions of the variables to explain their differences
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 21 +++++++++++++--- kernel/sched/debug.c | 9 ++++--- kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 8 +++++- 4 files changed, 84 insertions(+), 24 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 18f5262..f6662b6 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1071,15 +1071,28 @@ struct load_weight { };
struct sched_avg { + u64 last_runnable_update; + s64 decay_count; + /* + * utilization_avg_contrib describes the amount of time that a + * sched_entity is running on a CPU. It is based on running_avg_sum + * and is scaled in the range [0..SCHED_LOAD_SCALE]. + * load_avg_contrib described the the amount of time that a + * sched_entity is runnable on a rq. It is based on both + * runnable_avg_sum and the weight of the task. + */ + unsigned long load_avg_contrib, utilization_avg_contrib; /* * These sums represent an infinite geometric series and so are bound * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024. + * running_avg_sum reflects the time that the sched_entity is + * effectively running on the CPU. + * runnable_avg_sum represents the amount of time a sched_entity is on + * a runqueue which includes the running time that is monitored by + * running_avg_sum. */ - u32 runnable_avg_sum, runnable_avg_period; - u64 last_runnable_update; - s64 decay_count; - unsigned long load_avg_contrib; + u32 runnable_avg_sum, avg_period, running_avg_sum; };
#ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index ce33780..e0fbc0f 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group if (!se) { struct sched_avg *avg = &cpu_rq(cpu)->avg; P(avg->runnable_avg_sum); - P(avg->runnable_avg_period); + P(avg->avg_period); return; }
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group P(se->load.weight); #ifdef CONFIG_SMP P(se->avg.runnable_avg_sum); - P(se->avg.runnable_avg_period); + P(se->avg.avg_period); P(se->avg.load_avg_contrib); P(se->avg.decay_count); #endif @@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) cfs_rq->runnable_load_avg); SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg", cfs_rq->blocked_load_avg); + SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg", + cfs_rq->utilization_load_avg); #ifdef CONFIG_FAIR_GROUP_SCHED SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib", cfs_rq->tg_load_contrib); @@ -628,7 +630,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) P(se.load.weight); #ifdef CONFIG_SMP P(se.avg.runnable_avg_sum); - P(se.avg.runnable_avg_period); + P(se.avg.running_avg_sum); + P(se.avg.avg_period); P(se.avg.load_avg_contrib); P(se.avg.decay_count); #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9075dee..d6de526 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -678,8 +678,8 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.decay_count = 0; slice = sched_slice(task_cfs_rq(p), &p->se) >> 10; - p->se.avg.runnable_avg_sum = slice; - p->se.avg.runnable_avg_period = slice; + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice; + p->se.avg.avg_period = slice; __update_task_entity_contrib(&p->se); } #else @@ -1548,7 +1548,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) *period = now - p->last_task_numa_placement; } else { delta = p->se.avg.runnable_avg_sum; - *period = p->se.avg.runnable_avg_period; + *period = p->se.avg.avg_period; }
p->last_sum_exec_runtime = runtime; @@ -2294,7 +2294,8 @@ static u32 __compute_runnable_contrib(u64 n) */ static __always_inline int __update_entity_runnable_avg(u64 now, struct sched_avg *sa, - int runnable) + int runnable, + int running) { u64 delta, periods; u32 runnable_contrib; @@ -2320,7 +2321,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, sa->last_runnable_update = now;
/* delta_w is the amount already accumulated against our next period */ - delta_w = sa->runnable_avg_period % 1024; + delta_w = sa->avg_period % 1024; if (delta + delta_w >= 1024) { /* period roll-over */ decayed = 1; @@ -2333,7 +2334,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, delta_w = 1024 - delta_w; if (runnable) sa->runnable_avg_sum += delta_w; - sa->runnable_avg_period += delta_w; + if (running) + sa->running_avg_sum += delta_w; + sa->avg_period += delta_w;
delta -= delta_w;
@@ -2343,20 +2346,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, periods + 1); - sa->runnable_avg_period = decay_load(sa->runnable_avg_period, + sa->running_avg_sum = decay_load(sa->running_avg_sum, + periods + 1); + sa->avg_period = decay_load(sa->avg_period, periods + 1);
/* Efficiently calculate \sum (1..n_period) 1024*y^i */ runnable_contrib = __compute_runnable_contrib(periods); if (runnable) sa->runnable_avg_sum += runnable_contrib; - sa->runnable_avg_period += runnable_contrib; + if (running) + sa->running_avg_sum += runnable_contrib; + sa->avg_period += runnable_contrib; }
/* Remainder of delta accrued against u_0` */ if (runnable) sa->runnable_avg_sum += delta; - sa->runnable_avg_period += delta; + if (running) + sa->running_avg_sum += delta; + sa->avg_period += delta;
return decayed; } @@ -2408,7 +2417,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
/* The fraction of a cpu used by this cfs_rq */ contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT, - sa->runnable_avg_period + 1); + sa->avg_period + 1); contrib -= cfs_rq->tg_runnable_contrib;
if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) { @@ -2461,7 +2470,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable, + runnable); __update_tg_runnable_avg(&rq->avg, &rq->cfs); } #else /* CONFIG_FAIR_GROUP_SCHED */ @@ -2479,7 +2489,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); - contrib /= (se->avg.runnable_avg_period + 1); + contrib /= (se->avg.avg_period + 1); se->avg.load_avg_contrib = scale_load(contrib); }
@@ -2498,6 +2508,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se) return se->avg.load_avg_contrib - old_contrib; }
+ +static inline void __update_task_entity_utilization(struct sched_entity *se) +{ + u32 contrib; + + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE); + contrib /= (se->avg.avg_period + 1); + se->avg.utilization_avg_contrib = scale_load(contrib); +} + +static long __update_entity_utilization_avg_contrib(struct sched_entity *se) +{ + long old_contrib = se->avg.utilization_avg_contrib; + + if (entity_is_task(se)) + __update_task_entity_utilization(se); + + return se->avg.utilization_avg_contrib - old_contrib; +} + static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq, long load_contrib) { @@ -2514,7 +2545,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, int update_cfs_rq) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - long contrib_delta; + long contrib_delta, utilization_delta; u64 now;
/* @@ -2526,18 +2557,22 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq, + cfs_rq->curr == se)) return;
contrib_delta = __update_entity_load_avg_contrib(se); + utilization_delta = __update_entity_utilization_avg_contrib(se);
if (!update_cfs_rq) return;
- if (se->on_rq) + if (se->on_rq) { cfs_rq->runnable_load_avg += contrib_delta; - else + cfs_rq->utilization_load_avg += utilization_delta; + } else { subtract_blocked_load_contrib(cfs_rq, -contrib_delta); + } }
/* @@ -2612,6 +2647,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, }
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; + cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); } @@ -2630,6 +2666,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib; + cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib; if (sleep) { cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); @@ -2967,6 +3004,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) */ update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); + update_entity_load_avg(se, 1); }
update_stats_curr_start(cfs_rq, se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 12483d8..35eca7d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -343,8 +343,14 @@ struct cfs_rq { * Under CFS, load is tracked on a per-entity basis and aggregated up. * This allows for the description of both thread and group usage (in * the FAIR_GROUP_SCHED case). + * runnable_load_avg is the sum of the load_avg_contrib of the + * sched_entities on the rq. + * blocked_load_avg is similar to runnable_load_avg except that its + * the blocked sched_entities on the rq. + * utilization_load_avg is the sum of the average running time of the + * sched_entities on the rq. */ - unsigned long runnable_load_avg, blocked_load_avg; + unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load;
On 07/10/14 13:13, Vincent Guittot wrote:
Add new statistics which reflect the average time a task is running on the CPU and the sum of these running time of the tasks on a runqueue. The latter is named utilization_avg_contrib.
This patch is based on the usage metric that was proposed in the 1st versions of the per-entity load tracking patchset by Paul Turner pjt@google.com but that has be removed afterward. This version differs from the original one in the sense that it's not linked to task_group.
The rq's utilization_avg_contrib will be used to check if a rq is overloaded or not instead of trying to compute how many task a group of CPUs can handle
Rename runnable_avg_period into avg_period as it is now used with both runnable_avg_sum and running_avg_sum
Add some descriptions of the variables to explain their differences
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
include/linux/sched.h | 21 +++++++++++++--- kernel/sched/debug.c | 9 ++++--- kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 8 +++++- 4 files changed, 84 insertions(+), 24 deletions(-)
[...]
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9075dee..d6de526 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -678,8 +678,8 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.decay_count = 0; slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
p->se.avg.runnable_avg_sum = slice;
p->se.avg.runnable_avg_period = slice;
p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
p->se.avg.avg_period = slice; __update_task_entity_contrib(&p->se);
Didn't you miss a call to __update_task_entity_utilization here?
Another thing from the naming perspective: You use the term 'utilization' for 'load' (e.g. existing __update_entity_load_avg_contrib versus new __update_entity_utilization_avg_contrib) but for the __update_task_entity_contrib function, you use the term 'utilization' for 'contrib'. IMHO, kind of hard to grasp.
} #else @@ -1548,7 +1548,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) *period = now - p->last_task_numa_placement; } else { delta = p->se.avg.runnable_avg_sum;
*period = p->se.avg.runnable_avg_period;
*period = p->se.avg.avg_period; } p->last_sum_exec_runtime = runtime;
From: Morten Rasmussen morten.rasmussen@arm.com
Adds usage contribution tracking for group entities. Unlike se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group entities is the sum of se->avg.utilization_avg_contrib for all entities on the group runqueue. It is _not_ influenced in any way by the task group h_load. Hence it is representing the actual cpu usage of the group, not its intended load contribution which may differ significantly from the utilization on lightly utilized systems.
cc: Paul Turner pjt@google.com cc: Ben Segall bsegall@google.com
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/debug.c | 3 +++ kernel/sched/fair.c | 5 +++++ 2 files changed, 8 insertions(+)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index e0fbc0f..efb47ed 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group P(se->load.weight); #ifdef CONFIG_SMP P(se->avg.runnable_avg_sum); + P(se->avg.running_avg_sum); P(se->avg.avg_period); P(se->avg.load_avg_contrib); + P(se->avg.utilization_avg_contrib); P(se->avg.decay_count); #endif #undef PN @@ -633,6 +635,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) P(se.avg.running_avg_sum); P(se.avg.avg_period); P(se.avg.load_avg_contrib); + P(se.avg.utilization_avg_contrib); P(se.avg.decay_count); #endif P(policy); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d6de526..d3e9067 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2381,6 +2381,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) return 0;
se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays); + se->avg.utilization_avg_contrib = + decay_load(se->avg.utilization_avg_contrib, decays); se->avg.decay_count = 0;
return decays; @@ -2525,6 +2527,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
if (entity_is_task(se)) __update_task_entity_utilization(se); + else + se->avg.utilization_avg_contrib = + group_cfs_rq(se)->utilization_load_avg;
return se->avg.utilization_avg_contrib - old_contrib; }
Vincent Guittot vincent.guittot@linaro.org writes:
From: Morten Rasmussen morten.rasmussen@arm.com
Adds usage contribution tracking for group entities. Unlike se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group entities is the sum of se->avg.utilization_avg_contrib for all entities on the group runqueue. It is _not_ influenced in any way by the task group h_load. Hence it is representing the actual cpu usage of the group, not its intended load contribution which may differ significantly from the utilization on lightly utilized systems.
Just noting that this version also has usage disappear immediately when a task blocks, although it does what you probably want on migration.
Also, group-ses don't ever use their running_avg_sum so it's kinda a waste, but I'm not sure it's worth doing anything about.
cc: Paul Turner pjt@google.com cc: Ben Segall bsegall@google.com
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
kernel/sched/debug.c | 3 +++ kernel/sched/fair.c | 5 +++++ 2 files changed, 8 insertions(+)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index e0fbc0f..efb47ed 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group P(se->load.weight); #ifdef CONFIG_SMP P(se->avg.runnable_avg_sum);
- P(se->avg.running_avg_sum); P(se->avg.avg_period); P(se->avg.load_avg_contrib);
- P(se->avg.utilization_avg_contrib); P(se->avg.decay_count);
#endif #undef PN @@ -633,6 +635,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) P(se.avg.running_avg_sum); P(se.avg.avg_period); P(se.avg.load_avg_contrib);
- P(se.avg.utilization_avg_contrib); P(se.avg.decay_count);
#endif P(policy); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d6de526..d3e9067 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2381,6 +2381,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) return 0; se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
- se->avg.utilization_avg_contrib =
se->avg.decay_count = 0;decay_load(se->avg.utilization_avg_contrib, decays);
return decays; @@ -2525,6 +2527,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se) if (entity_is_task(se)) __update_task_entity_utilization(se);
- else
se->avg.utilization_avg_contrib =
group_cfs_rq(se)->utilization_load_avg;
return se->avg.utilization_avg_contrib - old_contrib; }
On 7 October 2014 22:15, bsegall@google.com wrote:
Vincent Guittot vincent.guittot@linaro.org writes:
From: Morten Rasmussen morten.rasmussen@arm.com
Adds usage contribution tracking for group entities. Unlike se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group entities is the sum of se->avg.utilization_avg_contrib for all entities on the group runqueue. It is _not_ influenced in any way by the task group h_load. Hence it is representing the actual cpu usage of the group, not its intended load contribution which may differ significantly from the utilization on lightly utilized systems.
Just noting that this version also has usage disappear immediately when a task blocks, although it does what you probably want on migration.
Yes. For this patchset, we prefer to stay aligned with current implementation which only take into account runnable tasks in order cap the change of the behavior.
IMHO, the use of blocked task in utilization_avg_contrib should be part of a dedicated patchset with dedicated non regression test
Also, group-ses don't ever use their running_avg_sum so it's kinda a waste, but I'm not sure it's worth doing anything about.
On Tue, Oct 07, 2014 at 09:15:39PM +0100, bsegall@google.com wrote:
Vincent Guittot vincent.guittot@linaro.org writes:
From: Morten Rasmussen morten.rasmussen@arm.com
Adds usage contribution tracking for group entities. Unlike se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group entities is the sum of se->avg.utilization_avg_contrib for all entities on the group runqueue. It is _not_ influenced in any way by the task group h_load. Hence it is representing the actual cpu usage of the group, not its intended load contribution which may differ significantly from the utilization on lightly utilized systems.
Just noting that this version also has usage disappear immediately when a task blocks, although it does what you probably want on migration.
Yes, as it was discussed at Ksummit, we should include blocked usage. It gives a much more stable metric.
Also, group-ses don't ever use their running_avg_sum so it's kinda a waste, but I'm not sure it's worth doing anything about.
Yeah, but since we still have to maintain runnable_avg_sum for group-ses it isn't much that can be saved I think. Maybe avoid calling update_entity_load_avg() from set_next_entity() for group-ses, but I'm not sure how much it helps.
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
The utilization_avg_contrib only takes into account the running time but not the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the running load on the CPU. We have to scale the utilization with the capacity of the CPU to get the usage of the latter. The usage can then be compared with the available capacity.
The frequency scaling invariance is not taken into account in this patchset, it will be solved in another patchset
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3e9067..7364ed4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4551,6 +4551,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_usage(int cpu) +{ + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; + unsigned long capacity = capacity_orig_of(cpu); + + if (usage >= SCHED_LOAD_SCALE) + return capacity + 1; + + return (usage * capacity) >> SCHED_LOAD_SHIFT; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5679,6 +5690,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus; @@ -6053,6 +6065,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; + sgs->group_usage += get_cpu_usage(i); sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1)
On Tue, Oct 07, 2014 at 02:13:35PM +0200, Vincent Guittot wrote:
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
The utilization_avg_contrib only takes into account the running time but not the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the running load on the CPU. We have to scale the utilization with the capacity of the CPU to get the usage of the latter. The usage can then be compared with the available capacity.
You say cpu_capacity, but in actual fact you use capacity_orig and fail to justify/clarify this.
The frequency scaling invariance is not taken into account in this patchset, it will be solved in another patchset
Maybe explain what the specific invariance issue is that is skipped over for now.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3e9067..7364ed4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4551,6 +4551,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; } +static int get_cpu_usage(int cpu) +{
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long capacity = capacity_orig_of(cpu);
- if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Like Morten I'm confused by that +1 thing.
- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
A comment with that function that it returns capacity units might clarify shift confusion Morten raised the other day.
On 9 October 2014 13:36, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:35PM +0200, Vincent Guittot wrote:
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
The utilization_avg_contrib only takes into account the running time but not the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the running load on the CPU. We have to scale the utilization with the capacity of the CPU to get the usage of the latter. The usage can then be compared with the available capacity.
You say cpu_capacity, but in actual fact you use capacity_orig and fail to justify/clarify this.
you're right it's cpu_capacity_orig no cpu_capacity
cpu_capacity is the compute capacity available for CFS task once we have removed the capacity that is used by RT tasks.
We want to compare the utilization of the CPU (utilization_avg_contrib which is in the range [0..SCHED_LOAD_SCALE]) with available capacity (cpu_capacity which is in the range [0..cpu_capacity_orig]) An utilization_avg_contrib equals to SCHED_LOAD_SCALE means that the CPU is fully utilized so all cpu_capacity_orig are used. so we scale the utilization_avg_contrib from [0..SCHED_LOAD_SCALE] into cpu_usage in the range [0..cpu_capacity_orig]
The frequency scaling invariance is not taken into account in this patchset, it will be solved in another patchset
Maybe explain what the specific invariance issue is that is skipped over for now.
ok. I can add description on the fact that if the core run slower, the tasks will use more running time of the CPU for the same job so the usage of the cpu which is based on the amount of time, will increase
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3e9067..7364ed4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4551,6 +4551,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_usage(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Like Morten I'm confused by that +1 thing.
ok. the goal was to point out the erroneous case where usage is out of the range but if it generates confusion, it can remove it
return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
A comment with that function that it returns capacity units might clarify shift confusion Morten raised the other day.
ok. i will add a comment
On Thu, Oct 09, 2014 at 03:57:28PM +0200, Vincent Guittot wrote:
On 9 October 2014 13:36, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:35PM +0200, Vincent Guittot wrote:
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
The utilization_avg_contrib only takes into account the running time but not the uArch so the utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the running load on the CPU. We have to scale the utilization with the capacity of the CPU to get the usage of the latter. The usage can then be compared with the available capacity.
You say cpu_capacity, but in actual fact you use capacity_orig and fail to justify/clarify this.
you're right it's cpu_capacity_orig no cpu_capacity
cpu_capacity is the compute capacity available for CFS task once we have removed the capacity that is used by RT tasks.
But why, when you compare the sum of usage against the capacity you want matching units. Otherwise your usage will far exceed capacity in the presence of RT tasks, that doesn't seen to make sense to me.
We want to compare the utilization of the CPU (utilization_avg_contrib which is in the range [0..SCHED_LOAD_SCALE]) with available capacity (cpu_capacity which is in the range [0..cpu_capacity_orig]) An utilization_avg_contrib equals to SCHED_LOAD_SCALE means that the CPU is fully utilized so all cpu_capacity_orig are used. so we scale the utilization_avg_contrib from [0..SCHED_LOAD_SCALE] into cpu_usage in the range [0..cpu_capacity_orig]
Right, I got that, the usage thing converts from 'utilization' to fraction of capacity, so that we can then compare it against the total capacity.
But if, as per the above we were to use the same capacity for both sides, its a pointless factor and we could've immediately compared the 'utilization' number against its unit.
+static int get_cpu_usage(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Like Morten I'm confused by that +1 thing.
ok. the goal was to point out the erroneous case where usage is out of the range but if it generates confusion, it can remove it
Well, the fact that you clip makes that point, returning a value outside of the specified range doesn't.
On 9 October 2014 17:12, Peter Zijlstra peterz@infradead.org wrote:
+static int get_cpu_usage(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Like Morten I'm confused by that +1 thing.
ok. the goal was to point out the erroneous case where usage is out of the range but if it generates confusion, it can remove it
Well, the fact that you clip makes that point, returning a value outside of the specified range doesn't.
i meant removing the +1
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system, it sometimes works for big cores but fails to do the right thing for little cores.
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
2 - Then, if the default capacity of a CPU is greater than SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4 (at max and thanks to the fix [0] for SMT system that prevent the apparition of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is reduced to nearly nothing), the capacity factor of the group will still be 4 (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
So, this patch tries to solve this issue by removing capacity_factor and replacing it with the 2 following metrics : -The available CPU's capacity for CFS tasks which is already used by load_balance. -The usage of the CPU by the CFS tasks. For the latter, I have re-introduced the utilization_avg_contrib which is used to compute the usage of a CPU by CFS tasks.
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed because it's no more used during load balance.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 12 ----- kernel/sched/fair.c | 131 ++++++++++++++++++++------------------------------- kernel/sched/sched.h | 2 +- 3 files changed, 51 insertions(+), 94 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 45ae52d..37fb92c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5373,17 +5373,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, break; }
- /* - * Even though we initialize ->capacity to something semi-sane, - * we leave capacity_orig unset. This allows us to detect if - * domain iteration is still funny without causing /0 traps. - */ - if (!group->sgc->capacity_orig) { - printk(KERN_CONT "\n"); - printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n"); - break; - } - if (!cpumask_weight(sched_group_cpus(group))) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: empty group\n"); @@ -5868,7 +5857,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) * die on a /0 trap. */ sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); - sg->sgc->capacity_orig = sg->sgc->capacity;
/* * Make sure the first group of this domain contains the diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7364ed4..7ee71d8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5692,11 +5692,10 @@ struct sg_lb_stats { unsigned long group_capacity; unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_has_free_capacity; + int group_no_capacity; #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5837,7 +5836,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) capacity >>= SCHED_CAPACITY_SHIFT;
cpu_rq(cpu)->cpu_capacity_orig = capacity; - sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) capacity *= arch_scale_freq_capacity(sd, cpu); @@ -5860,7 +5858,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, capacity_orig; + unsigned long capacity; unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval); @@ -5872,7 +5870,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) return; }
- capacity_orig = capacity = 0; + capacity = 0;
if (child->flags & SD_OVERLAP) { /* @@ -5892,19 +5890,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu) * Use capacity_of(), which is set irrespective of domains * in update_cpu_capacity(). * - * This avoids capacity/capacity_orig from being 0 and + * This avoids capacity from being 0 and * causing divide-by-zero issues on boot. - * - * Runtime updates will correct capacity_orig. */ if (unlikely(!rq->sd)) { - capacity_orig += capacity_orig_of(cpu); capacity += capacity_of(cpu); continue; }
sgc = rq->sd->groups->sgc; - capacity_orig += sgc->capacity_orig; capacity += sgc->capacity; } } else { @@ -5915,42 +5909,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
group = child->groups; do { - capacity_orig += group->sgc->capacity_orig; capacity += group->sgc->capacity; group = group->next; } while (group != child->groups); }
- sdg->sgc->capacity_orig = capacity_orig; sdg->sgc->capacity = capacity; }
/* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). - */ -static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) -{ - /* - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUCAPACITY)) - return 0; - - /* - * If ~90% of the cpu_capacity is still there, we're good. - */ - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29) - return 1; - - return 0; -} - -/* * Check whether the capacity of the rq has been noticeably reduced by side * activity. The imbalance_pct is used for the threshold. * Return true is the capacity is reduced @@ -5996,38 +5963,37 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgc->imbalance; }
-/* - * Compute the group capacity factor. - * - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit unit capacity with that. - */ -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) { - unsigned int capacity_factor, smt, cpus; - unsigned int capacity, capacity_orig; + if ((sgs->group_capacity * 100) > + (sgs->group_usage * env->sd->imbalance_pct)) + return true;
- capacity = group->sgc->capacity; - capacity_orig = group->sgc->capacity_orig; - cpus = group->group_weight; + if (sgs->sum_nr_running < sgs->group_weight) + return true;
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); - capacity_factor = cpus / smt; /* cores */ + return false; +}
- capacity_factor = min_t(unsigned, - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); +static inline bool +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) +{ + if (sgs->sum_nr_running <= sgs->group_weight) + return false; + + if ((sgs->group_capacity * 100) < + (sgs->group_usage * env->sd->imbalance_pct)) + return true;
- return capacity_factor; + return false; }
-static enum group_type -group_classify(struct sched_group *group, struct sg_lb_stats *sgs) +static enum group_type group_classify(struct lb_env *env, + struct sched_group *group, + struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (sgs->group_no_capacity) return group_overloaded;
if (sg_imbalanced(group)) @@ -6088,11 +6054,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight; - sgs->group_capacity_factor = sg_capacity_factor(env, group); - sgs->group_type = group_classify(group, sgs);
- if (sgs->group_capacity_factor > sgs->sum_nr_running) - sgs->group_has_free_capacity = 1; + sgs->group_no_capacity = group_is_overloaded(env, sgs); + sgs->group_type = group_classify(env, group, sgs); }
/** @@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity factor to one so that we'll try + * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity_factor. The + * these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + group_has_capacity(env, &sds->local_stat)) { + if (sgs->sum_nr_running > 1) + sgs->group_no_capacity = 1; + sgs->group_capacity = min(sgs->group_capacity, + SCHED_CAPACITY_SCALE); + }
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6403,11 +6371,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->group_type == group_overloaded && local->group_type == group_overloaded) { - load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity_factor); - - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); - load_above_capacity /= busiest->group_capacity; + load_above_capacity = busiest->sum_nr_running * + SCHED_LOAD_SCALE; + if (load_above_capacity > busiest->group_capacity) + load_above_capacity -= busiest->group_capacity; + else + load_above_capacity = ~0UL; }
/* @@ -6470,6 +6439,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6490,8 +6460,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && - !busiest->group_has_free_capacity) + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && + busiest->group_no_capacity) goto force_balance;
/* @@ -6550,7 +6520,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long capacity, capacity_factor, wl; + unsigned long capacity, wl; enum fbq_type rt;
rq = cpu_rq(i); @@ -6579,9 +6549,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
capacity = capacity_of(i); - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6589,7 +6556,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) + + if (rq->nr_running == 1 && wl > env->imbalance && + !check_cpu_capacity(rq, env->sd)) continue;
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 35eca7d..7963981 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -759,7 +759,7 @@ struct sched_group_capacity { * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity * for a single CPU. */ - unsigned int capacity, capacity_orig; + unsigned int capacity; unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ /*
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
+static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) {
- if ((sgs->group_capacity * 100) >
(sgs->group_usage * env->sd->imbalance_pct))
return true;
Why the imb_pct there? We're looking for 100% utilization, not 130 or whatnot, right?
- if (sgs->sum_nr_running < sgs->group_weight)
return true;
With the code as it stands, this is the cheaper test (no mults) so why is it second?
- return false;
+} +static inline bool +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) +{
- if (sgs->sum_nr_running <= sgs->group_weight)
return false;
- if ((sgs->group_capacity * 100) <
(sgs->group_usage * env->sd->imbalance_pct))
return true;
- return false;
}
Same thing here wrt imb_pct
On 9 October 2014 14:16, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
+static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) {
if ((sgs->group_capacity * 100) >
(sgs->group_usage * env->sd->imbalance_pct))
return true;
Why the imb_pct there? We're looking for 100% utilization, not 130 or whatnot, right?
Having exactly 100% is quite difficult because of various rounding. So i have added a margin/threshold to prevent any excessive change of the state. I have just to use the same margin/threshold than in other place in load balance.
so the current threshold depends of the sched_level. it's around 14% at MC level
if (sgs->sum_nr_running < sgs->group_weight)
return true;
With the code as it stands, this is the cheaper test (no mults) so why is it second?
return false;
+}
+static inline bool +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) +{
if (sgs->sum_nr_running <= sgs->group_weight)
return false;
if ((sgs->group_capacity * 100) <
(sgs->group_usage * env->sd->imbalance_pct))
return true;
return false;
}
Same thing here wrt imb_pct
On Thu, Oct 09, 2014 at 04:18:02PM +0200, Vincent Guittot wrote:
On 9 October 2014 14:16, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
+static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) {
if ((sgs->group_capacity * 100) >
(sgs->group_usage * env->sd->imbalance_pct))
return true;
Why the imb_pct there? We're looking for 100% utilization, not 130 or whatnot, right?
Having exactly 100% is quite difficult because of various rounding. So i have added a margin/threshold to prevent any excessive change of the state. I have just to use the same margin/threshold than in other place in load balance.
Yet you failed to mention this anywhere. Also does it really matter?
On 9 October 2014 17:18, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Oct 09, 2014 at 04:18:02PM +0200, Vincent Guittot wrote:
On 9 October 2014 14:16, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
+static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) {
if ((sgs->group_capacity * 100) >
(sgs->group_usage * env->sd->imbalance_pct))
return true;
Why the imb_pct there? We're looking for 100% utilization, not 130 or whatnot, right?
Having exactly 100% is quite difficult because of various rounding. So i have added a margin/threshold to prevent any excessive change of the state. I have just to use the same margin/threshold than in other place in load balance.
Yet you failed to mention this anywhere. Also does it really matter?
yes i think it latter because it give a more stable view of the "overload state" and "have free capacity state" of the CPU. One additional point is that the imbalance_pct will ensure that a cpu/group will not been seen as having capacity if its available capacity is only 1-5% which will generate spurious task migration
I will add these details in the commit log and in a comment in the code
On 10 October 2014 09:17, Vincent Guittot vincent.guittot@linaro.org wrote:
yes i think it latter because it give a more stable view of the
s/latter/matter/
"overload state" and "have free capacity state" of the CPU. One additional point is that the imbalance_pct will ensure that a cpu/group will not been seen as having capacity if its available capacity is only 1-5% which will generate spurious task migration
I will add these details in the commit log and in a comment in the code
Hi Vincent, On 10/9/14, 10:18 PM, Vincent Guittot wrote:
On 9 October 2014 14:16, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
+static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) {
if ((sgs->group_capacity * 100) >
(sgs->group_usage * env->sd->imbalance_pct))
return true;
Why the imb_pct there? We're looking for 100% utilization, not 130 or whatnot, right?
Having exactly 100% is quite difficult because of various rounding.
Could you give some examples about the various rounding?
Regards, Wanpeng Li
So i have added a margin/threshold to prevent any excessive change of the state. I have just to use the same margin/threshold than in other place in load balance.
so the current threshold depends of the sched_level. it's around 14% at MC level
if (sgs->sum_nr_running < sgs->group_weight)
return true;
With the code as it stands, this is the cheaper test (no mults) so why is it second?
return false;
+}
+static inline bool +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) +{
if (sgs->sum_nr_running <= sgs->group_weight)
return false;
if ((sgs->group_capacity * 100) <
(sgs->group_usage * env->sd->imbalance_pct))
return true;
return false;
}
Same thing here wrt imb_pct
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 23 November 2014 at 02:03, Wanpeng Li kernellwp@gmail.com wrote:
Hi Vincent, On 10/9/14, 10:18 PM, Vincent Guittot wrote:
On 9 October 2014 14:16, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
+static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) {
if ((sgs->group_capacity * 100) >
(sgs->group_usage * env->sd->imbalance_pct))
return true;
Why the imb_pct there? We're looking for 100% utilization, not 130 or whatnot, right?
Having exactly 100% is quite difficult because of various rounding.
Could you give some examples about the various rounding?
As an example while I was monitoring the runnable_load_sum and the runnable_load_period, they were varying a bit around the max value but was not always the same which can change the load_avg_contrib by few unit
Regards, Vincent
Regards, Wanpeng Li
So i have added a margin/threshold to prevent any excessive change of the state. I have just to use the same margin/threshold than in other place in load balance.
so the current threshold depends of the sched_level. it's around 14% at MC level
if (sgs->sum_nr_running < sgs->group_weight)
return true;
With the code as it stands, this is the cheaper test (no mults) so why is it second?
return false;
+}
+static inline bool +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) +{
if (sgs->sum_nr_running <= sgs->group_weight)
return false;
if ((sgs->group_capacity * 100) <
(sgs->group_usage * env->sd->imbalance_pct))
return true;
return false;
}
Same thing here wrt imb_pct
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
@@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* * In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
- and move all the excess tasks away. We lower the capacity
- of a group only if the local group has the capacity to fit
* these excess tasks, i.e. group_capacity > 0. The
*/ if (prefer_sibling && sds->local &&
- extra check prevents the case where you always pull from the
- heaviest group when it is already under-utilized (possible
- with a large weight task outweighs the tasks on the system).
group_has_capacity(env, &sds->local_stat)) {
if (sgs->sum_nr_running > 1)
sgs->group_no_capacity = 1;
sgs->group_capacity = min(sgs->group_capacity,
SCHED_CAPACITY_SCALE);
}
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg;
@@ -6490,8 +6460,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance; /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
- if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
!busiest->group_has_free_capacity)
- if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
goto force_balance;busiest->group_no_capacity)
/*
This is two calls to group_has_capacity() on the local group. Why not compute once in update_sd_lb_stats()?
On 9 October 2014 16:16, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
@@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit
* these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local &&
group_has_capacity(env, &sds->local_stat)) {
if (sgs->sum_nr_running > 1)
sgs->group_no_capacity = 1;
sgs->group_capacity = min(sgs->group_capacity,
SCHED_CAPACITY_SCALE);
} if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg;
@@ -6490,8 +6460,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
!busiest->group_has_free_capacity)
if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
busiest->group_no_capacity) goto force_balance; /*
This is two calls to group_has_capacity() on the local group. Why not compute once in update_sd_lb_stats()?
mainly because of the place in the code, so it is not always used during load balance unlike group_no_capacity
Now that i have said that, i just noticed that it's better to move the call to the last tested condition
+ if (env->idle == CPU_NEWLY_IDLE && busiest->group_no_capacity && + group_has_capacity(env, local))
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
@@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* * In case the child domain prefers tasks go to siblings
* first, lower the sg capacity factor to one so that we'll try
* first, lower the sg capacity to one so that we'll try
- and move all the excess tasks away. We lower the capacity
- of a group only if the local group has the capacity to fit
* these excess tasks, i.e. nr_running < group_capacity_factor. The
* these excess tasks, i.e. group_capacity > 0. The
*/ if (prefer_sibling && sds->local &&
- extra check prevents the case where you always pull from the
- heaviest group when it is already under-utilized (possible
- with a large weight task outweighs the tasks on the system).
sds->local_stat.group_has_free_capacity)
sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
group_has_capacity(env, &sds->local_stat)) {
if (sgs->sum_nr_running > 1)
sgs->group_no_capacity = 1;
sgs->group_capacity = min(sgs->group_capacity,
SCHED_CAPACITY_SCALE);
}
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg;
So this is your PREFER_SIBLING implementation, why is this a good one?
That is, the current PREFER_SIBLING works because we account against nr_running, and setting it to 1 makes 2 tasks too much and we end up moving stuff away.
But if I understand things right, we're now measuring tasks in 'utilization' against group_capacity, so setting group_capacity to CAPACITY_SCALE, means we can end up with many tasks on the one cpu before we move over to another group, right?
So I think that for 'idle' systems we want to do the nr_running/work-conserving thing -- get as many cpus running 'something' and avoid queueing like the plague.
Then when there's some queueing, we want to go do the utilization thing, basically minimize queueing by leveling utilization.
Once all cpus are fully utilized, we switch to fair/load based balancing and try and get equal load on cpus.
Does that make sense?
If so, how about adding a group_type and splitting group_other into say group_idle and group_util:
enum group_type { group_idle = 0, group_util, group_imbalanced, group_overloaded, }
we change group_classify() into something like:
if (sgs->group_usage > sgs->group_capacity) return group_overloaded;
if (sg_imbalanced(group)) return group_imbalanced;
if (sgs->nr_running < sgs->weight) return group_idle;
return group_util;
And then have update_sd_pick_busiest() something like:
if (sgs->group_type > busiest->group_type) return true;
if (sgs->group_type < busiest->group_type) return false;
switch (sgs->group_type) { case group_idle: if (sgs->nr_running < busiest->nr_running) return false; break;
case group_util: if (sgs->group_usage < busiest->group_usage) return false; break;
default: if (sgs->avg_load < busiest->avg_load) return false; break; }
....
And then some calculate_imbalance() magic to complete it..
If we have that, we can play tricks with the exact busiest condition in update_sd_pick_busiest() to implement PREFER_SIBLING or so.
Makes sense?
On 9 October 2014 16:58, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Oct 07, 2014 at 02:13:36PM +0200, Vincent Guittot wrote:
@@ -6214,17 +6178,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings
* first, lower the sg capacity factor to one so that we'll try
* first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit
* these excess tasks, i.e. nr_running < group_capacity_factor. The
* these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local &&
sds->local_stat.group_has_free_capacity)
sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
group_has_capacity(env, &sds->local_stat)) {
if (sgs->sum_nr_running > 1)
sgs->group_no_capacity = 1;
sgs->group_capacity = min(sgs->group_capacity,
SCHED_CAPACITY_SCALE);
} if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg;
So this is your PREFER_SIBLING implementation, why is this a good one?
That is, the current PREFER_SIBLING works because we account against nr_running, and setting it to 1 makes 2 tasks too much and we end up moving stuff away.
But if I understand things right, we're now measuring tasks in 'utilization' against group_capacity, so setting group_capacity to CAPACITY_SCALE, means we can end up with many tasks on the one cpu before we move over to another group, right?
I would say no, because we don't have to be overloaded to migrate a task so the load balance can occurs before we become overloaded. The main difference is that a group is better tagged overloaded than previously.
The update of capacity field is only used when calculating the imbalance and the average load on the sched_domain but the group has already been classified (group_overloaded or other). If we have more than 1 task, we overwrite the status of group_no_capacity to true. The updated capacity will then be used to cap the max amount of load that will be pulled in order to ensure that the busiest group will not become idle. So for prefer sibling case, we are not taking into account the utilization and the capacity metrics but we check whether more than 1 task is running in this group (which is what you proposed below)
Then, we keep the same mechanism than with capacity_factor for calculating the imbalance
This update of PREFER_SIBLING is quite similar to the previous one except that we directly use nr_running instead of group_capacity_factor and we force the group state to no more capacity
Regards, Vincent
So I think that for 'idle' systems we want to do the nr_running/work-conserving thing -- get as many cpus running 'something' and avoid queueing like the plague.
Then when there's some queueing, we want to go do the utilization thing, basically minimize queueing by leveling utilization.
Once all cpus are fully utilized, we switch to fair/load based balancing and try and get equal load on cpus.
Does that make sense?
If so, how about adding a group_type and splitting group_other into say group_idle and group_util:
enum group_type { group_idle = 0, group_util, group_imbalanced, group_overloaded, }
we change group_classify() into something like:
if (sgs->group_usage > sgs->group_capacity) return group_overloaded; if (sg_imbalanced(group)) return group_imbalanced; if (sgs->nr_running < sgs->weight) return group_idle; return group_util;
And then have update_sd_pick_busiest() something like:
if (sgs->group_type > busiest->group_type) return true; if (sgs->group_type < busiest->group_type) return false; switch (sgs->group_type) { case group_idle: if (sgs->nr_running < busiest->nr_running) return false; break; case group_util: if (sgs->group_usage < busiest->group_usage) return false; break; default: if (sgs->avg_load < busiest->avg_load) return false; break; } ....
And then some calculate_imbalance() magic to complete it..
If we have that, we can play tricks with the exact busiest condition in update_sd_pick_busiest() to implement PREFER_SIBLING or so.
Makes sense?
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Preeti U. Murthy preeti@linux.vnet.ibm.com --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 37fb92c..731f2ad 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6165,6 +6165,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */
if (sd->flags & SD_SHARE_CPUCAPACITY) { + sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
linaro-kernel@lists.linaro.org