During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores or by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we evaluate the usage of a group and compare it with its capacity
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance.
The utilization_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the utilization_avg_contrib that is based on the new implementation proposal [1] but haven't provide the patches and results as [1] is still under review. I can provide change above [1] to change how utilization_avg_contrib is computed and adapt to new mecanism.
Change since V5 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07 - update commit log and add more details on the purpose of the patches - fix/remove useless code with the rebase on patchset [2] - remove capacity_orig in sched_group_capacity as it is not used - move code in the right patch - add some helper function to factorize code
Change since V4 - rebase to manage conflicts with changes in selection of busiest group [4]
Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments
Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics
Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity
[1] https://lkml.org/lkml/2014/7/18/110 [2] https://lkml.org/lkml/2014/7/25/589
Vincent Guittot (6): sched: add per rq cpu_capacity_orig sched: move cfs task on a CPU with higher capacity sched: add utilization_avg_contrib sched: get CPU's usage statistic sched: replace capacity_factor by usage sched: add SD_PREFER_SIBLING for SMT level
include/linux/sched.h | 19 +++- kernel/sched/core.c | 15 +-- kernel/sched/debug.c | 9 +- kernel/sched/fair.c | 276 ++++++++++++++++++++++++++++++-------------------- kernel/sched/sched.h | 11 +- 5 files changed, 199 insertions(+), 131 deletions(-)
This new field cpu_capacity_orig reflects the original capacity of a CPU before being altered by frequency scaling, rt tasks and/or IRQ
The cpu_capacity_orig will be used in several places to detect when the capapcity of a CPU has been noticeably reduced so we can trig load balance to look for a CPU with better capacity. As an example, we can detect when a CPU handles a significant amount of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by scheduler whereas CPUs, which are really idle, are available
In addition, this new cpu_capacity_orig will be used to evaluate the usage of a the CPU by CFS tasks
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Kamalesh Babulal kamalesh@linux.vnet.ibm.com --- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 8 +++++++- kernel/sched/sched.h | 1 + 3 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2a93b87..e20f203 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7056,7 +7056,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq->cpu_capacity = SCHED_CAPACITY_SCALE; + rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2a1e6ac..622f8b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,6 +4092,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; }
+static unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5765,6 +5770,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) @@ -5826,7 +5832,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) * Runtime updates will correct capacity_orig. */ if (unlikely(!rq->sd)) { - capacity_orig += capacity_of(cpu); + capacity_orig += capacity_orig_of(cpu); capacity += capacity_of(cpu); continue; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1bc6aad..f332e45 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -575,6 +575,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_capacity; + unsigned long cpu_capacity_orig;
unsigned char idle_balance; /* For active balancing */
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
env.src_cpu and env.src_rq must be set unconditionnaly because they are used in need_active_balance which is called even if busiest->nr_running equals 1
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 50 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 622f8b0..7422044 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5885,6 +5885,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group) }
/* + * Check whether the capacity of the rq has been noticeably reduced by side + * activity. The imbalance_pct is used for the threshold. + * Return true is the capacity is reduced + */ +static inline int +check_cpu_capacity(struct rq *rq, struct sched_domain *sd) +{ + return ((rq->cpu_capacity * sd->imbalance_pct) < + (rq->cpu_capacity_orig * 100)); +} + +/* * Group imbalance indicates (and tries to solve) the problem where balancing * groups is inadequate due to tsk_cpus_allowed() constraints. * @@ -6555,6 +6567,14 @@ static int need_active_balance(struct lb_env *env) */ if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) return 1; + + /* + * The src_cpu's capacity is reduced because of other + * sched_class or IRQs, we trig an active balance to move the + * task + */ + if (check_cpu_capacity(env->src_rq, sd)) + return 1; }
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); @@ -6656,6 +6676,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu; + env.src_rq = busiest; + ld_moved = 0; if (busiest->nr_running > 1) { /* @@ -6665,8 +6688,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.src_cpu = busiest->cpu; - env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
more_balance: @@ -7372,10 +7393,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/* * Current heuristic for kicking the idle load balancer in the presence - * of an idle cpu is the system. + * of an idle cpu in the system. * - This rq has more than one task. - * - At any scheduler domain level, this cpu's scheduler group has multiple - * busy cpu's exceeding the group's capacity. + * - This rq has at least one CFS task and the capacity of the CPU is + * significantly reduced because of RT tasks or IRQs. + * - At parent of LLC scheduler domain level, this cpu's scheduler group has + * multiple busy cpu. * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ @@ -7385,9 +7408,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu; + bool kick = false;
if (unlikely(rq->idle_balance)) - return 0; + return false;
/* * We may be recently in ticked or tickless idle mode. At the first @@ -7401,38 +7425,44 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus))) - return 0; + return false;
if (time_before(now, nohz.next_balance)) - return 0; + return false;
if (rq->nr_running >= 2) - goto need_kick; + return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); - if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1) - goto need_kick_unlock; + if (nr_busy > 1) { + kick = true; + goto unlock; + } + }
- sd = rcu_dereference(per_cpu(sd_asym, cpu)); + sd = rcu_dereference(rq->sd); + if (sd) { + if ((rq->cfs.h_nr_running >= 1) && + check_cpu_capacity(rq, sd)) { + kick = true; + goto unlock; + } + }
+ sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) - goto need_kick_unlock; - - rcu_read_unlock(); - return 0; + kick = true;
-need_kick_unlock: +unlock: rcu_read_unlock(); -need_kick: - return 1; + return kick; } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
Add new statistics which reflect the average time a task is running on the CPU and the sum of these running time of the tasks on a runqueue. The latter is named utilization_avg_contrib.
This patch is based on the usage metric that was proposed in the 1st versions of the per-entity load tracking patchset by Paul Turner pjt@google.com but that has be removed afterward. This version differs from the original one in the sense that it's not linked to task_group.
The rq's utilization_avg_contrib will be used to check if a rq is overloaded or not instead of trying to compute how many task a group of CPUs can handle
Rename runnable_avg_period into avg_period as it is now used with both runnable_avg_sum and running_avg_sum
Add some descriptions of the variables to explain their differences
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 19 +++++++++++--- kernel/sched/debug.c | 9 ++++--- kernel/sched/fair.c | 68 +++++++++++++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 8 +++++- 4 files changed, 81 insertions(+), 23 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 48ae6c4..51df220 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1071,15 +1071,26 @@ struct load_weight { };
struct sched_avg { + u64 last_runnable_update; + s64 decay_count; + /* + * utilization_avg_contrib describes the amount of time that a + * sched_entity is running on a CPU. It is based on running_avg_sum + * and is scaled in the range [0..SCHED_LOAD_SCALE]. + * load_avg_contrib described the the amount of time that a + * sched_entity is runnable on a rq. It is based on both + * runnable_avg_sum and the weight of the task. + */ + unsigned long load_avg_contrib, utilization_avg_contrib; /* * These sums represent an infinite geometric series and so are bound * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024. + * runnable_avg_sum represents the amount of time a sched_entity is on + * the runqueue whereas running_avg_sum reflects the time the + * sched_entity is effectively running on the runqueue. */ - u32 runnable_avg_sum, runnable_avg_period; - u64 last_runnable_update; - s64 decay_count; - unsigned long load_avg_contrib; + u32 runnable_avg_sum, avg_period, running_avg_sum; };
#ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index c7fe1ea0..1761db9 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group if (!se) { struct sched_avg *avg = &cpu_rq(cpu)->avg; P(avg->runnable_avg_sum); - P(avg->runnable_avg_period); + P(avg->avg_period); return; }
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group P(se->load.weight); #ifdef CONFIG_SMP P(se->avg.runnable_avg_sum); - P(se->avg.runnable_avg_period); + P(se->avg.avg_period); P(se->avg.load_avg_contrib); P(se->avg.decay_count); #endif @@ -215,6 +215,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) cfs_rq->runnable_load_avg); SEQ_printf(m, " .%-30s: %ld\n", "blocked_load_avg", cfs_rq->blocked_load_avg); + SEQ_printf(m, " .%-30s: %ld\n", "utilization_load_avg", + cfs_rq->utilization_load_avg); #ifdef CONFIG_FAIR_GROUP_SCHED SEQ_printf(m, " .%-30s: %ld\n", "tg_load_contrib", cfs_rq->tg_load_contrib); @@ -631,7 +633,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m) P(se.load.weight); #ifdef CONFIG_SMP P(se.avg.runnable_avg_sum); - P(se.avg.runnable_avg_period); + P(se.avg.running_avg_sum); + P(se.avg.avg_period); P(se.avg.load_avg_contrib); P(se.avg.decay_count); #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7422044..2cf153d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -677,8 +677,8 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.decay_count = 0; slice = sched_slice(task_cfs_rq(p), &p->se) >> 10; - p->se.avg.runnable_avg_sum = slice; - p->se.avg.runnable_avg_period = slice; + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice; + p->se.avg.avg_period = slice; __update_task_entity_contrib(&p->se); } #else @@ -1547,7 +1547,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period) *period = now - p->last_task_numa_placement; } else { delta = p->se.avg.runnable_avg_sum; - *period = p->se.avg.runnable_avg_period; + *period = p->se.avg.avg_period; }
p->last_sum_exec_runtime = runtime; @@ -2297,7 +2297,8 @@ static u32 __compute_runnable_contrib(u64 n) */ static __always_inline int __update_entity_runnable_avg(u64 now, struct sched_avg *sa, - int runnable) + int runnable, + int running) { u64 delta, periods; u32 runnable_contrib; @@ -2323,7 +2324,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, sa->last_runnable_update = now;
/* delta_w is the amount already accumulated against our next period */ - delta_w = sa->runnable_avg_period % 1024; + delta_w = sa->avg_period % 1024; if (delta + delta_w >= 1024) { /* period roll-over */ decayed = 1; @@ -2336,7 +2337,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, delta_w = 1024 - delta_w; if (runnable) sa->runnable_avg_sum += delta_w; - sa->runnable_avg_period += delta_w; + if (running) + sa->running_avg_sum += delta_w; + sa->avg_period += delta_w;
delta -= delta_w;
@@ -2346,20 +2349,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, periods + 1); - sa->runnable_avg_period = decay_load(sa->runnable_avg_period, + sa->running_avg_sum = decay_load(sa->running_avg_sum, + periods + 1); + sa->avg_period = decay_load(sa->avg_period, periods + 1);
/* Efficiently calculate \sum (1..n_period) 1024*y^i */ runnable_contrib = __compute_runnable_contrib(periods); if (runnable) sa->runnable_avg_sum += runnable_contrib; - sa->runnable_avg_period += runnable_contrib; + if (running) + sa->running_avg_sum += runnable_contrib; + sa->avg_period += runnable_contrib; }
/* Remainder of delta accrued against u_0` */ if (runnable) sa->runnable_avg_sum += delta; - sa->runnable_avg_period += delta; + if (running) + sa->running_avg_sum += delta; + sa->avg_period += delta;
return decayed; } @@ -2411,7 +2420,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
/* The fraction of a cpu used by this cfs_rq */ contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT, - sa->runnable_avg_period + 1); + sa->avg_period + 1); contrib -= cfs_rq->tg_runnable_contrib;
if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) { @@ -2464,7 +2473,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable, + runnable); __update_tg_runnable_avg(&rq->avg, &rq->cfs); } #else /* CONFIG_FAIR_GROUP_SCHED */ @@ -2482,7 +2492,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); - contrib /= (se->avg.runnable_avg_period + 1); + contrib /= (se->avg.avg_period + 1); se->avg.load_avg_contrib = scale_load(contrib); }
@@ -2501,6 +2511,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se) return se->avg.load_avg_contrib - old_contrib; }
+ +static inline void __update_task_entity_utilization(struct sched_entity *se) +{ + u32 contrib; + + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE); + contrib /= (se->avg.avg_period + 1); + se->avg.utilization_avg_contrib = scale_load(contrib); +} + +static long __update_entity_utilization_avg_contrib(struct sched_entity *se) +{ + long old_contrib = se->avg.utilization_avg_contrib; + + if (entity_is_task(se)) + __update_task_entity_utilization(se); + + return se->avg.utilization_avg_contrib - old_contrib; +} + static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq, long load_contrib) { @@ -2517,7 +2548,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, int update_cfs_rq) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - long contrib_delta; + long contrib_delta, utilization_delta; u64 now;
/* @@ -2529,16 +2560,20 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq, + cfs_rq->curr == se)) return;
contrib_delta = __update_entity_load_avg_contrib(se); + utilization_delta = __update_entity_utilization_avg_contrib(se);
if (!update_cfs_rq) return;
- if (se->on_rq) + if (se->on_rq) { cfs_rq->runnable_load_avg += contrib_delta; + cfs_rq->utilization_load_avg += utilization_delta; + } else subtract_blocked_load_contrib(cfs_rq, -contrib_delta); } @@ -2615,6 +2650,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, }
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; + cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); } @@ -2633,6 +2669,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib; + cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib; if (sleep) { cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); @@ -2970,6 +3007,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) */ update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); + update_entity_load_avg(se, 1); }
update_stats_curr_start(cfs_rq, se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f332e45..3ccb136 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -339,8 +339,14 @@ struct cfs_rq { * Under CFS, load is tracked on a per-entity basis and aggregated up. * This allows for the description of both thread and group usage (in * the FAIR_GROUP_SCHED case). + * runnable_load_avg is the sum of the load_avg_contrib of the + * sched_entities on the rq. + * blocked_load_avg is similar to runnable_load_avg except that its + * the blocked sched_entities on the rq. + * utilization_load_avg is the sum of the average running time of the + * sched_entities on the rq. */ - unsigned long runnable_load_avg, blocked_load_avg; + unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load;
On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
struct sched_avg {
- u64 last_runnable_update;
- s64 decay_count;
- /*
* utilization_avg_contrib describes the amount of time that a
* sched_entity is running on a CPU. It is based on running_avg_sum
* and is scaled in the range [0..SCHED_LOAD_SCALE].
* load_avg_contrib described the the amount of time that a
* sched_entity is runnable on a rq. It is based on both
* runnable_avg_sum and the weight of the task.
*/
- unsigned long load_avg_contrib, utilization_avg_contrib; /*
- These sums represent an infinite geometric series and so are bound
- above by 1024/(1-y). Thus we only need a u32 to store them for all
- choices of y < 1-2^(-32)*1024.
* runnable_avg_sum represents the amount of time a sched_entity is on
* the runqueue whereas running_avg_sum reflects the time the
* sched_entity is effectively running on the runqueue.
I would say: 'running on the cpu'. I would further clarify that runnable also includes running, the above could be read such that runnable is only the time spend waiting on the queue, excluding the time spend on the cpu.
*/
- u32 runnable_avg_sum, avg_period, running_avg_sum;
};
On 3 October 2014 16:15, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
struct sched_avg {
u64 last_runnable_update;
s64 decay_count;
/*
* utilization_avg_contrib describes the amount of time that a
* sched_entity is running on a CPU. It is based on running_avg_sum
* and is scaled in the range [0..SCHED_LOAD_SCALE].
* load_avg_contrib described the the amount of time that a
* sched_entity is runnable on a rq. It is based on both
* runnable_avg_sum and the weight of the task.
*/
unsigned long load_avg_contrib, utilization_avg_contrib; /* * These sums represent an infinite geometric series and so are bound * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024.
* runnable_avg_sum represents the amount of time a sched_entity is on
* the runqueue whereas running_avg_sum reflects the time the
* sched_entity is effectively running on the runqueue.
I would say: 'running on the cpu'. I would further clarify that runnable also includes running, the above could be read such that runnable is only the time spend waiting on the queue, excluding the time spend on the cpu.
ok
*/
u32 runnable_avg_sum, avg_period, running_avg_sum;
};
On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
+++ b/kernel/sched/sched.h @@ -339,8 +339,14 @@ struct cfs_rq { * Under CFS, load is tracked on a per-entity basis and aggregated up. * This allows for the description of both thread and group usage (in * the FAIR_GROUP_SCHED case).
* runnable_load_avg is the sum of the load_avg_contrib of the
* sched_entities on the rq.
* blocked_load_avg is similar to runnable_load_avg except that its
* the blocked sched_entities on the rq.
Strictly speaking blocked entities aren't on a rq as such, but yeah, no idea how to better put it. Just being a pedantic, which isn't helpful I guess :-)
* utilization_load_avg is the sum of the average running time of the
*/* sched_entities on the rq.
So I think there was some talk about a blocked_utilization thingy, which would track the avg running time of the tasks currently asleep, right?
- unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load;
On 3 October 2014 16:36, Peter Zijlstra peterz@infradead.org wrote:
* utilization_load_avg is the sum of the average running time of the
* sched_entities on the rq. */
So I think there was some talk about a blocked_utilization thingy, which would track the avg running time of the tasks currently asleep, right?
yes. Do you mean that we should anticipate and rename utilization_load_avg into utilization_runnable_avg to make space for a utilization_blocked_avg that could be added in future ?
unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load;
On Fri, Oct 03, 2014 at 04:51:01PM +0200, Vincent Guittot wrote:
On 3 October 2014 16:36, Peter Zijlstra peterz@infradead.org wrote:
* utilization_load_avg is the sum of the average running time of the
* sched_entities on the rq. */
So I think there was some talk about a blocked_utilization thingy, which would track the avg running time of the tasks currently asleep, right?
yes. Do you mean that we should anticipate and rename utilization_load_avg into utilization_runnable_avg to make space for a utilization_blocked_avg that could be added in future ?
nah, just trying to put things straight in my brain, including what is 'missing'.
On Fri, Oct 03, 2014 at 04:14:51PM +0100, Peter Zijlstra wrote:
On Fri, Oct 03, 2014 at 04:51:01PM +0200, Vincent Guittot wrote:
On 3 October 2014 16:36, Peter Zijlstra peterz@infradead.org wrote:
* utilization_load_avg is the sum of the average running time of the
* sched_entities on the rq. */
So I think there was some talk about a blocked_utilization thingy, which would track the avg running time of the tasks currently asleep, right?
yes. Do you mean that we should anticipate and rename utilization_load_avg into utilization_runnable_avg to make space for a utilization_blocked_avg that could be added in future ?
nah, just trying to put things straight in my brain, including what is 'missing'.
As Ben pointed out in the scale-invariance thread, we need blocked utilization. I fully agree with that. It doesn't make any sense not to include it. In fact I do have the patch already.
If you want to rename utlization_load_avg you should name it utilization_running_avg, not utilization_runnable_avg :) Or even better, something shorter.
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cf153d..4097e3f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_usage(int cpu) +{ + unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; + unsigned long capacity = capacity_orig_of(cpu); + + if (usage >= SCHED_LOAD_SCALE) + return capacity + 1; + + return (usage * capacity) >> SCHED_LOAD_SHIFT; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5663,6 +5674,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus; @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; + sgs->group_usage += get_cpu_usage(i); sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1)
On 23/09/14 17:08, Vincent Guittot wrote:
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cf153d..4097e3f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; } +static int get_cpu_usage(int cpu) +{
- unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
- unsigned long capacity = capacity_orig_of(cpu);
- if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Why you are returning rq->cpu_capacity_orig + 1 (1025) in case utilization_load_avg is greater or equal than 1024 and not usage or (usage * capacity) >> SCHED_LOAD_SHIFT too?
In case the weight of a sched group is greater than 1, you might loose the information that the whole sched group is over-utilized too.
You add up the individual cpu usage values for a group by sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use sgs->group_usage in group_is_overloaded to compare it against sgs->group_capacity (taking imbalance_pct into consideration).
- return (usage * capacity) >> SCHED_LOAD_SHIFT;
Nit-pick: Since you're multiplying by a capacity value (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
Just to make sure: You do this scaling of usage by cpu_capacity_orig here only to cater for the fact that cpu_capacity_orig might be uarch scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while utilization_load_avg is currently not. We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline today due to the missing clock-frequency property in the device tree.
I think it's hard for people to grasp that your patch-set takes uArch scaling of capacity into consideration but not frequency scaling of capacity (via arch_scale_freq_capacity, not used at the moment).
+}
/*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5663,6 +5674,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity;
- unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus;
@@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx); sgs->group_load += load;
sgs->sum_nr_running += rq->cfs.h_nr_running;sgs->group_usage += get_cpu_usage(i);
if (rq->nr_running > 1)
On 25 September 2014 21:05, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/09/14 17:08, Vincent Guittot wrote:
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cf153d..4097e3f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_usage(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Why you are returning rq->cpu_capacity_orig + 1 (1025) in case utilization_load_avg is greater or equal than 1024 and not usage or (usage * capacity) >> SCHED_LOAD_SHIFT too?
The usage can't be higher than the full capacity of the CPU because it's about the running time on this CPU. Nevertheless, usage can be higher than SCHED_LOAD_SCALE because of unfortunate rounding in avg_period and running_load_avg or just after migrating tasks until the average stabilizes with the new running time.
In case the weight of a sched group is greater than 1, you might loose the information that the whole sched group is over-utilized too.
that's exactly for sched_group with more than 1 CPU that we need to cap the usage of a CPU to 100%. Otherwise, the group could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
You add up the individual cpu usage values for a group by sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use sgs->group_usage in group_is_overloaded to compare it against sgs->group_capacity (taking imbalance_pct into consideration).
return (usage * capacity) >> SCHED_LOAD_SHIFT;
Nit-pick: Since you're multiplying by a capacity value (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
we want to compare the output of the function with some capacity figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
Just to make sure: You do this scaling of usage by cpu_capacity_orig here only to cater for the fact that cpu_capacity_orig might be uarch scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
I do this for any system with CPUs that have an original capacity that is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
utilization_load_avg is currently not. We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline today due to the missing clock-frequency property in the device tree.
sorry i don't catch your point
I think it's hard for people to grasp that your patch-set takes uArch scaling of capacity into consideration but not frequency scaling of capacity (via arch_scale_freq_capacity, not used at the moment).
+}
/*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5663,6 +5674,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus;
@@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->group_usage += get_cpu_usage(i); sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1)
On Fri, Sep 26, 2014 at 01:17:43PM +0100, Vincent Guittot wrote:
On 25 September 2014 21:05, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/09/14 17:08, Vincent Guittot wrote:
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cf153d..4097e3f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_usage(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Why you are returning rq->cpu_capacity_orig + 1 (1025) in case utilization_load_avg is greater or equal than 1024 and not usage or (usage * capacity) >> SCHED_LOAD_SHIFT too?
The usage can't be higher than the full capacity of the CPU because it's about the running time on this CPU. Nevertheless, usage can be higher than SCHED_LOAD_SCALE because of unfortunate rounding in avg_period and running_load_avg or just after migrating tasks until the average stabilizes with the new running time.
I fully agree that the cpu usage should be capped to capacity, but why do you return capacity + 1? I would just return capacity, no?
Now that you have gotten rid of 'usage' everywhere else, shouldn't this function be renamed to get_cpu_utilization()?
On 26/09/14 13:17, Vincent Guittot wrote:
On 25 September 2014 21:05, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/09/14 17:08, Vincent Guittot wrote:
[...]
+static int get_cpu_usage(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Why you are returning rq->cpu_capacity_orig + 1 (1025) in case utilization_load_avg is greater or equal than 1024 and not usage or (usage * capacity) >> SCHED_LOAD_SHIFT too?
The usage can't be higher than the full capacity of the CPU because it's about the running time on this CPU. Nevertheless, usage can be higher than SCHED_LOAD_SCALE because of unfortunate rounding in avg_period and running_load_avg or just after migrating tasks until the average stabilizes with the new running time.
Ok, I got it now, thanks!
When running 'hackbench -p -T -s 10 -l 1' on TC2, the usage for a cpu goes occasionally also much higher than SCHED_LOAD_SCALE. After all, p->se.avg.running_avg_sum is initialized to slice in init_task_runnable_average.
In case the weight of a sched group is greater than 1, you might loose the information that the whole sched group is over-utilized too.
that's exactly for sched_group with more than 1 CPU that we need to cap the usage of a CPU to 100%. Otherwise, the group could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
Makes sense, we don't want to do anything in this case on a sched level (e.g. DIE), the appropriate level below (e.g. MC) should balance this out first. Got it!
You add up the individual cpu usage values for a group by sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use sgs->group_usage in group_is_overloaded to compare it against sgs->group_capacity (taking imbalance_pct into consideration).
return (usage * capacity) >> SCHED_LOAD_SHIFT;
Nit-pick: Since you're multiplying by a capacity value (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
we want to compare the output of the function with some capacity figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
Just to make sure: You do this scaling of usage by cpu_capacity_orig here only to cater for the fact that cpu_capacity_orig might be uarch scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
I do this for any system with CPUs that have an original capacity that is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
Understood so your current patch-set is doing uArch scaling for capacity and since you're not doing uArch scaling for utilization, you do this '* capacity) >> SCHED_LOAD_SHIFT' thing. Correct?
utilization_load_avg is currently not. We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline today due to the missing clock-frequency property in the device tree.
sorry i don't catch your point
With mainline dts file for ARM TC2, the rq->cpu_capacity-orig is 1024 for all 5 cpus (A15's and A7's). The arm topology shim layer barfs a
/cpus/cpu@x missing clock-frequency property
per cpu in this case and doesn't scale the capacity. Only when I add
clock-frequency = <xxxxxxxxx>;
per cpuX node into the dts file, I get a system with asymmetric rq->cpu_capacity_orig values (606 for an A7 and 1441 for an A15).
I think it's hard for people to grasp that your patch-set takes uArch scaling of capacity into consideration but not frequency scaling of capacity (via arch_scale_freq_capacity, not used at the moment).
[...]
Hi Vincent, On 9/26/14, 8:17 PM, Vincent Guittot wrote:
On 25 September 2014 21:05, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/09/14 17:08, Vincent Guittot wrote:
Monitor the usage level of each group of each sched_domain level. The usage is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the utilization_load_avg to evaluate the usage level of each group.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2cf153d..4097e3f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_usage(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
unsigned long capacity = capacity_orig_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
Why you are returning rq->cpu_capacity_orig + 1 (1025) in case utilization_load_avg is greater or equal than 1024 and not usage or (usage * capacity) >> SCHED_LOAD_SHIFT too?
The usage can't be higher than the full capacity of the CPU because it's about the running time on this CPU. Nevertheless, usage can be higher than SCHED_LOAD_SCALE because of unfortunate rounding in avg_period and running_load_avg or just after migrating tasks until the average stabilizes with the new running time.
In case the weight of a sched group is greater than 1, you might loose the information that the whole sched group is over-utilized too.
that's exactly for sched_group with more than 1 CPU that we need to cap the usage of a CPU to 100%. Otherwise, the group could be seen as overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
You add up the individual cpu usage values for a group by sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use sgs->group_usage in group_is_overloaded to compare it against sgs->group_capacity (taking imbalance_pct into consideration).
return (usage * capacity) >> SCHED_LOAD_SHIFT;
Nit-pick: Since you're multiplying by a capacity value (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
we want to compare the output of the function with some capacity figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
Could you explain more why '>> SCHED_LOAD_SHIFT' instead of '>> SCHED_CAPACITY_SHIFT'?
Regards, Wanpeng Li
Just to make sure: You do this scaling of usage by cpu_capacity_orig here only to cater for the fact that cpu_capacity_orig might be uarch scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
I do this for any system with CPUs that have an original capacity that is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
utilization_load_avg is currently not. We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline today due to the missing clock-frequency property in the device tree.
sorry i don't catch your point
I think it's hard for people to grasp that your patch-set takes uArch scaling of capacity into consideration but not frequency scaling of capacity (via arch_scale_freq_capacity, not used at the moment).
+}
- /*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5663,6 +5674,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus;
@@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->group_usage += get_cpu_usage(i); sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1)
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 21 November 2014 06:36, Wanpeng Li kernellwp@gmail.com wrote:
Hi Vincent,
On 9/26/14, 8:17 PM, Vincent Guittot wrote:
[snip]
You add up the individual cpu usage values for a group by sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use sgs->group_usage in group_is_overloaded to compare it against sgs->group_capacity (taking imbalance_pct into consideration).
return (usage * capacity) >> SCHED_LOAD_SHIFT;
Nit-pick: Since you're multiplying by a capacity value (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
we want to compare the output of the function with some capacity figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
Could you explain more why '>> SCHED_LOAD_SHIFT' instead of '>> SCHED_CAPACITY_SHIFT'?
The return of get_cpu_usage is going to be compared with capacity so we need to return CAPACITY unit
usage unit is LOAD, capacity unit is CAPACITY so we have to divide by LOAD unit to return CAPACITY unit
LOAD unit * CAPACITY unit / LOAD unit -> CAPACITY unit
Regards, Vincent
Regards, Wanpeng Li
Just to make sure: You do this scaling of usage by cpu_capacity_orig here only to cater for the fact that cpu_capacity_orig might be uarch scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
I do this for any system with CPUs that have an original capacity that is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
utilization_load_avg is currently not. We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline today due to the missing clock-frequency property in the device tree.
sorry i don't catch your point
I think it's hard for people to grasp that your patch-set takes uArch scaling of capacity into consideration but not frequency scaling of capacity (via arch_scale_freq_capacity, not used at the moment).
+}
- /*
- select_task_rq_fair: Select target runqueue for the waking task in
domains
- that have the 'sd_flag' flag set. In practice, this is
SD_BALANCE_WAKE, @@ -5663,6 +5674,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus;
@@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->group_usage += get_cpu_usage(i); sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1)
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system, it sometimes works for big cores but fails to do the right thing for little cores.
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
2 - Then, if the default capacity of a CPU is greater than SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system that prevent the apparition of ghost CPUs) but if one CPU is fully used by a rt task (and its capacity is reduced to nearly nothing), the capacity factor of the group will still be 4 (div_round_closest(3*1512/1024) = 5).
So, this patch tries to solve this issue by removing capacity_factor and replacing it with the 2 following metrics : -The available CPU's capacity for CFS tasks which is the already used by load_balance. -The usage of the CPU by the CFS tasks. For the latter, I have re-introduced the utilization_avg_contrib which is in the range [0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.
This implementation of utilization_avg_contrib doesn't solve the scaling in-variance problem, so i have to scale the utilization with original capacity of the CPU in order to get the CPU usage and compare it with the capacity. Once the scaling invariance will have been added in utilization_avg_contrib, we will remove the scale of utilization_avg_contrib by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come in another patchset.
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed because it's more used during load balance.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 12 ----- kernel/sched/fair.c | 131 ++++++++++++++++++++------------------------------- kernel/sched/sched.h | 2 +- 3 files changed, 52 insertions(+), 93 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e20f203..c7c8ac4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5342,17 +5342,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, break; }
- /* - * Even though we initialize ->capacity to something semi-sane, - * we leave capacity_orig unset. This allows us to detect if - * domain iteration is still funny without causing /0 traps. - */ - if (!group->sgc->capacity_orig) { - printk(KERN_CONT "\n"); - printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n"); - break; - } - if (!cpumask_weight(sched_group_cpus(group))) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: empty group\n"); @@ -5837,7 +5826,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) * die on a /0 trap. */ sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); - sg->sgc->capacity_orig = sg->sgc->capacity;
/* * Make sure the first group of this domain contains the diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4097e3f..2ba278c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5676,11 +5676,10 @@ struct sg_lb_stats { unsigned long group_capacity; unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_has_free_capacity; + int group_out_of_capacity; #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5821,7 +5820,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) capacity >>= SCHED_CAPACITY_SHIFT;
cpu_rq(cpu)->cpu_capacity_orig = capacity; - sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) capacity *= arch_scale_freq_capacity(sd, cpu); @@ -5844,7 +5842,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, capacity_orig; + unsigned long capacity; unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval); @@ -5856,7 +5854,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) return; }
- capacity_orig = capacity = 0; + capacity = 0;
if (child->flags & SD_OVERLAP) { /* @@ -5876,19 +5874,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu) * Use capacity_of(), which is set irrespective of domains * in update_cpu_capacity(). * - * This avoids capacity/capacity_orig from being 0 and + * This avoids capacity from being 0 and * causing divide-by-zero issues on boot. - * - * Runtime updates will correct capacity_orig. */ if (unlikely(!rq->sd)) { - capacity_orig += capacity_orig_of(cpu); capacity += capacity_of(cpu); continue; }
sgc = rq->sd->groups->sgc; - capacity_orig += sgc->capacity_orig; capacity += sgc->capacity; } } else { @@ -5899,42 +5893,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
group = child->groups; do { - capacity_orig += group->sgc->capacity_orig; capacity += group->sgc->capacity; group = group->next; } while (group != child->groups); }
- sdg->sgc->capacity_orig = capacity_orig; sdg->sgc->capacity = capacity; }
/* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). - */ -static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) -{ - /* - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUCAPACITY)) - return 0; - - /* - * If ~90% of the cpu_capacity is still there, we're good. - */ - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29) - return 1; - - return 0; -} - -/* * Check whether the capacity of the rq has been noticeably reduced by side * activity. The imbalance_pct is used for the threshold. * Return true is the capacity is reduced @@ -5980,38 +5947,37 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgc->imbalance; }
-/* - * Compute the group capacity factor. - * - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit unit capacity with that. - */ -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline int group_has_free_capacity(struct sg_lb_stats *sgs, + struct lb_env *env) { - unsigned int capacity_factor, smt, cpus; - unsigned int capacity, capacity_orig; + if ((sgs->group_capacity * 100) > + (sgs->group_usage * env->sd->imbalance_pct)) + return true;
- capacity = group->sgc->capacity; - capacity_orig = group->sgc->capacity_orig; - cpus = group->group_weight; + if (sgs->sum_nr_running < sgs->group_weight) + return true;
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); - capacity_factor = cpus / smt; /* cores */ + return false; +}
- capacity_factor = min_t(unsigned, - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); +static inline int group_is_overloaded(struct sg_lb_stats *sgs, + struct lb_env *env) +{ + if (sgs->sum_nr_running <= sgs->group_weight) + return false;
- return capacity_factor; + if ((sgs->group_capacity * 100) < + (sgs->group_usage * env->sd->imbalance_pct)) + return true; + + return false; }
static enum group_type -group_classify(struct sched_group *group, struct sg_lb_stats *sgs) +group_classify(struct sched_group *group, struct sg_lb_stats *sgs, + struct lb_env *env) { - if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (group_is_overloaded(sgs, env)) return group_overloaded;
if (sg_imbalanced(group)) @@ -6072,11 +6038,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight; - sgs->group_capacity_factor = sg_capacity_factor(env, group); - sgs->group_type = group_classify(group, sgs);
- if (sgs->group_capacity_factor > sgs->sum_nr_running) - sgs->group_has_free_capacity = 1; + sgs->group_type = group_classify(group, sgs, env); + + sgs->group_out_of_capacity = group_is_overloaded(sgs, env); }
/** @@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity factor to one so that we'll try + * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity_factor. The + * these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + group_has_free_capacity(&sds->local_stat, env)) { + if (sgs->sum_nr_running > 1) + sgs->group_out_of_capacity = 1; + sgs->group_capacity = min(sgs->group_capacity, + SCHED_CAPACITY_SCALE); + }
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6387,11 +6356,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->group_type == group_overloaded && local->group_type == group_overloaded) { - load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity_factor); - - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); - load_above_capacity /= busiest->group_capacity; + load_above_capacity = busiest->sum_nr_running * + SCHED_LOAD_SCALE; + if (load_above_capacity > busiest->group_capacity) + load_above_capacity -= busiest->group_capacity; + else + load_above_capacity = ~0UL; }
/* @@ -6454,6 +6424,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6474,8 +6445,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && - !busiest->group_has_free_capacity) + if (env->idle == CPU_NEWLY_IDLE && + group_has_free_capacity(local, env) && + busiest->group_out_of_capacity) goto force_balance;
/* @@ -6533,7 +6505,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long capacity, capacity_factor, wl; + unsigned long capacity, wl; enum fbq_type rt;
rq = cpu_rq(i); @@ -6562,9 +6534,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
capacity = capacity_of(i); - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6572,7 +6541,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) + + if (rq->nr_running == 1 && wl > env->imbalance && + !check_cpu_capacity(rq, env->sd)) continue;
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3ccb136..8d224c5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -750,7 +750,7 @@ struct sched_group_capacity { * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity * for a single CPU. */ - unsigned int capacity, capacity_orig; + unsigned int capacity; unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ /*
On 23/09/14 17:08, Vincent Guittot wrote:
[...]
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed because it's more used during load balance.
So you're not forced to call it rq->cpu_capacity_orig any more, you could use rq->cpu_capacity_max instead.
[...]
This review (by PeterZ) during v5 of your patch-set recommended some renaming (e.g. s/group_has_free_capacity/group_has_capacity and s/group_out_of_capacity/group_no_capacity as well as reordering of the parameters which I agree with:
https://lkml.org/lkml/2014/9/11/706
-/*
- Compute the group capacity factor.
- Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- first dividing out the smt factor and computing the actual number of cores
- and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
s/static inline int/static inline bool
struct lb_env *env)
{
unsigned int capacity_factor, smt, cpus;
unsigned int capacity, capacity_orig;
if ((sgs->group_capacity * 100) >
(sgs->group_usage * env->sd->imbalance_pct))
return true;
capacity = group->sgc->capacity;
capacity_orig = group->sgc->capacity_orig;
cpus = group->group_weight;
if (sgs->sum_nr_running < sgs->group_weight)
return true;
/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
capacity_factor = cpus / smt; /* cores */
return false;
+}
capacity_factor = min_t(unsigned,
capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
if (!capacity_factor)
capacity_factor = fix_small_capacity(env->sd, group);
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
s/static inline int/static inline bool
struct lb_env *env)
+{
if (sgs->sum_nr_running <= sgs->group_weight)
return false;
return capacity_factor;
if ((sgs->group_capacity * 100) <
(sgs->group_usage * env->sd->imbalance_pct))
return true;
return false;
}
static enum group_type -group_classify(struct sched_group *group, struct sg_lb_stats *sgs) +group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
struct lb_env *env)
{
if (sgs->sum_nr_running > sgs->group_capacity_factor)
if (group_is_overloaded(sgs, env)) return group_overloaded; if (sg_imbalanced(group))
@@ -6072,11 +6038,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight;
sgs->group_capacity_factor = sg_capacity_factor(env, group);
sgs->group_type = group_classify(group, sgs);
if (sgs->group_capacity_factor > sgs->sum_nr_running)
sgs->group_has_free_capacity = 1;
sgs->group_type = group_classify(group, sgs, env);
sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
In case sgs->group_type is group_overloaded you could set sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7cdf271e8e52..52d441c92a4f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6037,7 +6037,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_type = group_classify(group, sgs, env);
- sgs->group_out_of_capacity = group_is_overloaded(sgs, env); + if (sgs->group_type == group_overloaded) + sgs->group_out_of_capacity = 1; }
}
/** @@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
[...]
On 24 September 2014 19:48, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/09/14 17:08, Vincent Guittot wrote:
[snip]
This review (by PeterZ) during v5 of your patch-set recommended some renaming (e.g. s/group_has_free_capacity/group_has_capacity and s/group_out_of_capacity/group_no_capacity as well as reordering of the parameters which I agree with:
Ah... you're right, these changes have passed through my seance of renaming
-/*
[snip]
if (sgs->group_capacity_factor > sgs->sum_nr_running)
sgs->group_has_free_capacity = 1;
sgs->group_type = group_classify(group, sgs, env);
sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
In case sgs->group_type is group_overloaded you could set sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs, env) and use it in group_classify in case of future changes in the classification order
Regards, Vincent
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7cdf271e8e52..52d441c92a4f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6037,7 +6037,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_type = group_classify(group, sgs, env);
sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
if (sgs->group_type == group_overloaded)
sgs->group_out_of_capacity = 1;
}
}
/** @@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
[...]
On 25/09/14 09:35, Vincent Guittot wrote:
On 24 September 2014 19:48, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/09/14 17:08, Vincent Guittot wrote:
[snip]
This review (by PeterZ) during v5 of your patch-set recommended some renaming (e.g. s/group_has_free_capacity/group_has_capacity and s/group_out_of_capacity/group_no_capacity as well as reordering of the parameters which I agree with:
Ah... you're right, these changes have passed through my seance of renaming
What about the ordering of the function parameters in group_has_capacity, group_is_overloaded and group_type group_classify?
All the existing *load balance* related functions in fair.c seem to follow this (struct lb_env *env, struct sd_lb_stats *sds, struct sched_group *group, struct sg_lb_stats *sgs) order.
-/*
[snip]
if (sgs->group_capacity_factor > sgs->sum_nr_running)
sgs->group_has_free_capacity = 1;
sgs->group_type = group_classify(group, sgs, env);
sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
In case sgs->group_type is group_overloaded you could set sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs, env) and use it in group_classify in case of future changes in the classification order
Ok, but than group_is_overloaded is called twice at the end of update_sg_lb_stats with exactly the same result. Looks weird in my traces.
[...]
On 25 September 2014 21:19, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 25/09/14 09:35, Vincent Guittot wrote:
[snip]
In case sgs->group_type is group_overloaded you could set sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs, env) and use it in group_classify in case of future changes in the classification order
Ok, but than group_is_overloaded is called twice at the end of update_sg_lb_stats with exactly the same result. Looks weird in my traces.
sorry i don't catch your point. group_is_overloaded() is called once for group_out_of_capacity which is then used in group_classify so it should be called only once
[...]
On 26/09/14 13:39, Vincent Guittot wrote:
On 25 September 2014 21:19, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 25/09/14 09:35, Vincent Guittot wrote:
[snip]
In case sgs->group_type is group_overloaded you could set sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs, env) and use it in group_classify in case of future changes in the classification order
Ok, but than group_is_overloaded is called twice at the end of update_sg_lb_stats with exactly the same result. Looks weird in my traces.
sorry i don't catch your point. group_is_overloaded() is called once for group_out_of_capacity which is then used in group_classify so it should be called only once
You're right, I made a mistake in my local setup while replacing this patch with the new version. Problem doesn't exist any more.
[...]
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system, it sometimes works for big cores but fails to do the right thing for little cores.
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
2 - Then, if the default capacity of a CPU is greater than SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system that prevent the apparition of ghost CPUs) but if one CPU is fully used by a rt task (and its capacity is reduced to nearly nothing), the capacity factor of the group will still be 4 (div_round_closest(3*1512/1024) = 5).
So, this patch tries to solve this issue by removing capacity_factor and replacing it with the 2 following metrics : -The available CPU's capacity for CFS tasks which is the already used by load_balance. -The usage of the CPU by the CFS tasks. For the latter, I have re-introduced the utilization_avg_contrib which is in the range [0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.
This implementation of utilization_avg_contrib doesn't solve the scaling in-variance problem, so i have to scale the utilization with original capacity of the CPU in order to get the CPU usage and compare it with the capacity. Once the scaling invariance will have been added in utilization_avg_contrib, we will remove the scale of utilization_avg_contrib by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come in another patchset.
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed because it's more used during load balance.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org ---
Hi,
This update of the patch takes into account the missing renamings as pointed out by Dietmar
Regards, Vincent
kernel/sched/core.c | 12 ----- kernel/sched/fair.c | 131 ++++++++++++++++++++------------------------------- kernel/sched/sched.h | 2 +- 3 files changed, 51 insertions(+), 94 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e20f203..c7c8ac4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5342,17 +5342,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level, break; }
- /* - * Even though we initialize ->capacity to something semi-sane, - * we leave capacity_orig unset. This allows us to detect if - * domain iteration is still funny without causing /0 traps. - */ - if (!group->sgc->capacity_orig) { - printk(KERN_CONT "\n"); - printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n"); - break; - } - if (!cpumask_weight(sched_group_cpus(group))) { printk(KERN_CONT "\n"); printk(KERN_ERR "ERROR: empty group\n"); @@ -5837,7 +5826,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu) * die on a /0 trap. */ sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span); - sg->sgc->capacity_orig = sg->sgc->capacity;
/* * Make sure the first group of this domain contains the diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4097e3f..f305f17 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5676,11 +5676,10 @@ struct sg_lb_stats { unsigned long group_capacity; unsigned long group_usage; /* Total usage of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_has_free_capacity; + int group_no_capacity; #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5821,7 +5820,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) capacity >>= SCHED_CAPACITY_SHIFT;
cpu_rq(cpu)->cpu_capacity_orig = capacity; - sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) capacity *= arch_scale_freq_capacity(sd, cpu); @@ -5844,7 +5842,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, capacity_orig; + unsigned long capacity; unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval); @@ -5856,7 +5854,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) return; }
- capacity_orig = capacity = 0; + capacity = 0;
if (child->flags & SD_OVERLAP) { /* @@ -5876,19 +5874,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu) * Use capacity_of(), which is set irrespective of domains * in update_cpu_capacity(). * - * This avoids capacity/capacity_orig from being 0 and + * This avoids capacity from being 0 and * causing divide-by-zero issues on boot. - * - * Runtime updates will correct capacity_orig. */ if (unlikely(!rq->sd)) { - capacity_orig += capacity_orig_of(cpu); capacity += capacity_of(cpu); continue; }
sgc = rq->sd->groups->sgc; - capacity_orig += sgc->capacity_orig; capacity += sgc->capacity; } } else { @@ -5899,42 +5893,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
group = child->groups; do { - capacity_orig += group->sgc->capacity_orig; capacity += group->sgc->capacity; group = group->next; } while (group != child->groups); }
- sdg->sgc->capacity_orig = capacity_orig; sdg->sgc->capacity = capacity; }
/* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). - */ -static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) -{ - /* - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUCAPACITY)) - return 0; - - /* - * If ~90% of the cpu_capacity is still there, we're good. - */ - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29) - return 1; - - return 0; -} - -/* * Check whether the capacity of the rq has been noticeably reduced by side * activity. The imbalance_pct is used for the threshold. * Return true is the capacity is reduced @@ -5980,38 +5947,37 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgc->imbalance; }
-/* - * Compute the group capacity factor. - * - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit unit capacity with that. - */ -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline bool +group_has_capacity(struct sg_lb_stats *sgs, struct lb_env *env) { - unsigned int capacity_factor, smt, cpus; - unsigned int capacity, capacity_orig; + if ((sgs->group_capacity * 100) > + (sgs->group_usage * env->sd->imbalance_pct)) + return true;
- capacity = group->sgc->capacity; - capacity_orig = group->sgc->capacity_orig; - cpus = group->group_weight; + if (sgs->sum_nr_running < sgs->group_weight) + return true;
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); - capacity_factor = cpus / smt; /* cores */ + return false; +}
- capacity_factor = min_t(unsigned, - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); +static inline bool +group_is_overloaded(struct sg_lb_stats *sgs, struct lb_env *env) +{ + if (sgs->sum_nr_running <= sgs->group_weight) + return false; + + if ((sgs->group_capacity * 100) < + (sgs->group_usage * env->sd->imbalance_pct)) + return true;
- return capacity_factor; + return false; }
-static enum group_type -group_classify(struct sched_group *group, struct sg_lb_stats *sgs) +static enum group_type group_classify(struct sched_group *group, + struct sg_lb_stats *sgs, + struct lb_env *env) { - if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (sgs->group_no_capacity) return group_overloaded;
if (sg_imbalanced(group)) @@ -6072,11 +6038,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight; - sgs->group_capacity_factor = sg_capacity_factor(env, group); - sgs->group_type = group_classify(group, sgs);
- if (sgs->group_capacity_factor > sgs->sum_nr_running) - sgs->group_has_free_capacity = 1; + sgs->group_no_capacity = group_is_overloaded(sgs, env); + sgs->group_type = group_classify(group, sgs, env); }
/** @@ -6198,17 +6162,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity factor to one so that we'll try + * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity_factor. The + * these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + group_has_capacity(&sds->local_stat, env)) { + if (sgs->sum_nr_running > 1) + sgs->group_no_capacity = 1; + sgs->group_capacity = min(sgs->group_capacity, + SCHED_CAPACITY_SCALE); + }
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6387,11 +6355,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->group_type == group_overloaded && local->group_type == group_overloaded) { - load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity_factor); - - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); - load_above_capacity /= busiest->group_capacity; + load_above_capacity = busiest->sum_nr_running * + SCHED_LOAD_SCALE; + if (load_above_capacity > busiest->group_capacity) + load_above_capacity -= busiest->group_capacity; + else + load_above_capacity = ~0UL; }
/* @@ -6454,6 +6423,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6474,8 +6444,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && - !busiest->group_has_free_capacity) + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(local, env) && + busiest->group_no_capacity) goto force_balance;
/* @@ -6533,7 +6503,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long capacity, capacity_factor, wl; + unsigned long capacity, wl; enum fbq_type rt;
rq = cpu_rq(i); @@ -6562,9 +6532,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
capacity = capacity_of(i); - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6572,7 +6539,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) + + if (rq->nr_running == 1 && wl > env->imbalance && + !check_cpu_capacity(rq, env->sd)) continue;
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3ccb136..8d224c5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -750,7 +750,7 @@ struct sched_group_capacity { * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity * for a single CPU. */ - unsigned int capacity, capacity_orig; + unsigned int capacity; unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ /*
On 23/09/14 17:08, Vincent Guittot wrote:
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system, it sometimes works for big cores but fails to do the right thing for little cores.
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
2 - Then, if the default capacity of a CPU is greater than SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system that prevent the apparition of ghost CPUs) but if one CPU is fully used by a rt task (and its capacity is reduced to nearly nothing), the capacity factor of the group will still be 4 (div_round_closest(3*1512/1024) = 5).
So, this patch tries to solve this issue by removing capacity_factor and replacing it with the 2 following metrics : -The available CPU's capacity for CFS tasks which is the already used by load_balance. -The usage of the CPU by the CFS tasks. For the latter, I have re-introduced the utilization_avg_contrib which is in the range [0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.
IMHO, this last sentence is misleading. The usage of a cpu can be temporally unbounded (in case a lot of tasks have just been spawned on this cpu, testcase: hackbench) but it converges very quickly towards a value between [0..1024]. Your implementation is already handling this case by capping usage to cpu_rq(cpu)->capacity_orig + 1 . BTW, couldn't find the definition of SCHED_CPU_LOAD.
[...]
On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
I did some testing on TC2 which has the setup you describe above on the A7 cluster when the clock-frequency property is set in DT. The two A15s have max capacities above 1024. When using sysbench with five threads I still get three tasks on the two A15s and two tasks on the three A7 leaving one cpu idle (on average).
Using cpu utilization (usage) does correctly identify the A15 cluster as overloaded and it even gets as far as selecting the A15 running two tasks as the source cpu in load_balance(). However, load_balance() bails out without pulling the task due to calculate_imbalance() returning a zero imbalance. calculate_imbalance() bails out just before the hunk you changed due to comparison of the sched_group avg_loads. sgs->avg_load is basically the sum of load in the group divided by its capacity. Since load isn't scaled the avg_load of the overloaded A15 group is slightly _lower_ than the partially idle A7 group. Hence calculate_imbalance() bails out, which isn't what we want.
I think we need to have a closer look at the imbalance calculation code and any other users of sgs->avg_load to get rid of all code making assumptions about cpu capacity.
Morten
On 2 October 2014 18:57, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
I did some testing on TC2 which has the setup you describe above on the A7 cluster when the clock-frequency property is set in DT. The two A15s have max capacities above 1024. When using sysbench with five threads I still get three tasks on the two A15s and two tasks on the three A7 leaving one cpu idle (on average).
Using cpu utilization (usage) does correctly identify the A15 cluster as overloaded and it even gets as far as selecting the A15 running two tasks as the source cpu in load_balance(). However, load_balance() bails out without pulling the task due to calculate_imbalance() returning a zero imbalance. calculate_imbalance() bails out just before the hunk you changed due to comparison of the sched_group avg_loads. sgs->avg_load is basically the sum of load in the group divided by its capacity. Since load isn't scaled the avg_load of the overloaded A15 group is slightly _lower_ than the partially idle A7 group. Hence calculate_imbalance() bails out, which isn't what we want.
I think we need to have a closer look at the imbalance calculation code and any other users of sgs->avg_load to get rid of all code making assumptions about cpu capacity.
We already had this discussion during the review of a previous version https://lkml.org/lkml/2014/6/3/422 and my answer has not changed; This patch is necessary to solve the 1 task per CPU issue of the HMP system but is not enough. I have a patch for solving the imbalance calculation issue and i have planned to send it once this patch will be in a good shape for being accepted by Peter.
I don't want to mix this patch and the next one because there are addressing different problem: this one is how evaluate the capacity of a system and detect when a group is overloaded and the next one will handle the case when the balance of the system can't rely on the average load figures of the group because we have a significant capacity difference between groups. Not that i haven't specifically mentioned HMP for the last patch because SMP system can also take advantage of it
Regards, Vincent
Morten
On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
On 2 October 2014 18:57, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
I did some testing on TC2 which has the setup you describe above on the A7 cluster when the clock-frequency property is set in DT. The two A15s have max capacities above 1024. When using sysbench with five threads I still get three tasks on the two A15s and two tasks on the three A7 leaving one cpu idle (on average).
Using cpu utilization (usage) does correctly identify the A15 cluster as overloaded and it even gets as far as selecting the A15 running two tasks as the source cpu in load_balance(). However, load_balance() bails out without pulling the task due to calculate_imbalance() returning a zero imbalance. calculate_imbalance() bails out just before the hunk you changed due to comparison of the sched_group avg_loads. sgs->avg_load is basically the sum of load in the group divided by its capacity. Since load isn't scaled the avg_load of the overloaded A15 group is slightly _lower_ than the partially idle A7 group. Hence calculate_imbalance() bails out, which isn't what we want.
I think we need to have a closer look at the imbalance calculation code and any other users of sgs->avg_load to get rid of all code making assumptions about cpu capacity.
We already had this discussion during the review of a previous version https://lkml.org/lkml/2014/6/3/422 and my answer has not changed; This patch is necessary to solve the 1 task per CPU issue of the HMP system but is not enough. I have a patch for solving the imbalance calculation issue and i have planned to send it once this patch will be in a good shape for being accepted by Peter.
I don't want to mix this patch and the next one because there are addressing different problem: this one is how evaluate the capacity of a system and detect when a group is overloaded and the next one will handle the case when the balance of the system can't rely on the average load figures of the group because we have a significant capacity difference between groups. Not that i haven't specifically mentioned HMP for the last patch because SMP system can also take advantage of it
You do mention 'big' and 'little' cores in your commit message and quote example numbers with are identical to the cpu capacities for TC2. That lead me to believe that this patch would address the issues we see for HMP systems. But that is clearly wrong. I would suggest that you drop mentioning big and little cores and stick to only describing cpu capacity reductions due to rt tasks and irq to avoid any confusion about the purpose of the patch. Maybe explicitly say that non-default cpu capacities (capacity_orig) isn't addressed yet.
I think the two problems you describe are very much related. This patch set is half the solution of the HMP balancing problem. We just need the last bit for avg_load and then we can add scale-invariance on top.
IMHO, it would be good to have all the bits and pieces for cpu capacity scaling and scaling of per-entity load-tracking on the table so we fit things together. We only have patches for parts of the solution posted so far.
Morten
On 3 October 2014 11:35, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
On 2 October 2014 18:57, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
I did some testing on TC2 which has the setup you describe above on the A7 cluster when the clock-frequency property is set in DT. The two A15s have max capacities above 1024. When using sysbench with five threads I still get three tasks on the two A15s and two tasks on the three A7 leaving one cpu idle (on average).
Using cpu utilization (usage) does correctly identify the A15 cluster as overloaded and it even gets as far as selecting the A15 running two tasks as the source cpu in load_balance(). However, load_balance() bails out without pulling the task due to calculate_imbalance() returning a zero imbalance. calculate_imbalance() bails out just before the hunk you changed due to comparison of the sched_group avg_loads. sgs->avg_load is basically the sum of load in the group divided by its capacity. Since load isn't scaled the avg_load of the overloaded A15 group is slightly _lower_ than the partially idle A7 group. Hence calculate_imbalance() bails out, which isn't what we want.
I think we need to have a closer look at the imbalance calculation code and any other users of sgs->avg_load to get rid of all code making assumptions about cpu capacity.
We already had this discussion during the review of a previous version https://lkml.org/lkml/2014/6/3/422 and my answer has not changed; This patch is necessary to solve the 1 task per CPU issue of the HMP system but is not enough. I have a patch for solving the imbalance calculation issue and i have planned to send it once this patch will be in a good shape for being accepted by Peter.
I don't want to mix this patch and the next one because there are addressing different problem: this one is how evaluate the capacity of a system and detect when a group is overloaded and the next one will handle the case when the balance of the system can't rely on the average load figures of the group because we have a significant capacity difference between groups. Not that i haven't specifically mentioned HMP for the last patch because SMP system can also take advantage of it
You do mention 'big' and 'little' cores in your commit message and quote example numbers with are identical to the cpu capacities for TC2. That
By last patch, i mean the patch about imbalance that i haven't sent yet, but it's not related with this patch
lead me to believe that this patch would address the issues we see for HMP systems. But that is clearly wrong. I would suggest that you drop
This patch addresses one issue: correctly detect how much capacity we have and correctly detect when the group is overloaded; HMP system clearly falls in this category but not only. This is the only purpose of this patch and not the "1 task per CPU issue" that you mentioned.
The second patch is about correctly balance the tasks on system where the capacity of a group is significantly less than another group. It has nothing to do in capacity computation and overload detection but it will use these corrected/new metrics to make a better choice.
Then, there is the "1 task per CPU issue on HMP" that you mentioned and this latter will be solved thanks to these 2 patchsets but this is not the only/main target of these patchsets so i don't want to reduce them into: "solve an HMP issue"
mentioning big and little cores and stick to only describing cpu capacity reductions due to rt tasks and irq to avoid any confusion about the purpose of the patch. Maybe explicitly say that non-default cpu capacities (capacity_orig) isn't addressed yet.
But they are addressed by this patchset. you just mixed various goal together (see above)
I think the two problems you describe are very much related. This patch set is half the solution of the HMP balancing problem. We just need the last bit for avg_load and then we can add scale-invariance on top.
i don't see the link between scale invariance and a bad load-balancing choice. The fact that the current load balancer puts more tasks than CPUs in a group on HMP system should not influence or break the scale invariance load tracking. Isn't it ?
Now, i could send the other patchset but i'm afraid that this will generate more confusion than help in the process review.
Regards, Vincent
IMHO, it would be good to have all the bits and pieces for cpu capacity scaling and scaling of per-entity load-tracking on the table so we fit things together. We only have patches for parts of the solution posted so far.
Morten
Hi Vincent, On 10/3/14, 8:50 PM, Vincent Guittot wrote:
On 3 October 2014 11:35, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
On 2 October 2014 18:57, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
I did some testing on TC2 which has the setup you describe above on the A7 cluster when the clock-frequency property is set in DT. The two A15s have max capacities above 1024. When using sysbench with five threads I still get three tasks on the two A15s and two tasks on the three A7 leaving one cpu idle (on average).
Using cpu utilization (usage) does correctly identify the A15 cluster as overloaded and it even gets as far as selecting the A15 running two tasks as the source cpu in load_balance(). However, load_balance() bails out without pulling the task due to calculate_imbalance() returning a zero imbalance. calculate_imbalance() bails out just before the hunk you changed due to comparison of the sched_group avg_loads. sgs->avg_load is basically the sum of load in the group divided by its capacity. Since load isn't scaled the avg_load of the overloaded A15 group is slightly _lower_ than the partially idle A7 group. Hence calculate_imbalance() bails out, which isn't what we want.
I think we need to have a closer look at the imbalance calculation code and any other users of sgs->avg_load to get rid of all code making assumptions about cpu capacity.
We already had this discussion during the review of a previous version https://lkml.org/lkml/2014/6/3/422 and my answer has not changed; This patch is necessary to solve the 1 task per CPU issue of the HMP system but is not enough. I have a patch for solving the imbalance calculation issue and i have planned to send it once this patch will be in a good shape for being accepted by Peter.
I don't want to mix this patch and the next one because there are addressing different problem: this one is how evaluate the capacity of a system and detect when a group is overloaded and the next one will handle the case when the balance of the system can't rely on the average load figures of the group because we have a significant capacity difference between groups. Not that i haven't specifically mentioned HMP for the last patch because SMP system can also take advantage of it
You do mention 'big' and 'little' cores in your commit message and quote example numbers with are identical to the cpu capacities for TC2. That
By last patch, i mean the patch about imbalance that i haven't sent yet, but it's not related with this patch
lead me to believe that this patch would address the issues we see for HMP systems. But that is clearly wrong. I would suggest that you drop
This patch addresses one issue: correctly detect how much capacity we have and correctly detect when the group is overloaded; HMP system
What's the meaning of "HMP system"?
Regards, Wanpeng Li
clearly falls in this category but not only. This is the only purpose of this patch and not the "1 task per CPU issue" that you mentioned.
The second patch is about correctly balance the tasks on system where the capacity of a group is significantly less than another group. It has nothing to do in capacity computation and overload detection but it will use these corrected/new metrics to make a better choice.
Then, there is the "1 task per CPU issue on HMP" that you mentioned and this latter will be solved thanks to these 2 patchsets but this is not the only/main target of these patchsets so i don't want to reduce them into: "solve an HMP issue"
mentioning big and little cores and stick to only describing cpu capacity reductions due to rt tasks and irq to avoid any confusion about the purpose of the patch. Maybe explicitly say that non-default cpu capacities (capacity_orig) isn't addressed yet.
But they are addressed by this patchset. you just mixed various goal together (see above)
I think the two problems you describe are very much related. This patch set is half the solution of the HMP balancing problem. We just need the last bit for avg_load and then we can add scale-invariance on top.
i don't see the link between scale invariance and a bad load-balancing choice. The fact that the current load balancer puts more tasks than CPUs in a group on HMP system should not influence or break the scale invariance load tracking. Isn't it ?
Now, i could send the other patchset but i'm afraid that this will generate more confusion than help in the process review.
Regards, Vincent
IMHO, it would be good to have all the bits and pieces for cpu capacity scaling and scaling of per-entity load-tracking on the table so we fit things together. We only have patches for parts of the solution posted so far.
Morten
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 23 November 2014 at 01:22, Wanpeng Li kernellwp@gmail.com wrote:
Hi Vincent,
On 10/3/14, 8:50 PM, Vincent Guittot wrote:
On 3 October 2014 11:35, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
On 2 October 2014 18:57, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
Below are two examples to illustrate the problem that this patch solves:
1 - capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
I did some testing on TC2 which has the setup you describe above on the A7 cluster when the clock-frequency property is set in DT. The two A15s have max capacities above 1024. When using sysbench with five threads I still get three tasks on the two A15s and two tasks on the three A7 leaving one cpu idle (on average).
Using cpu utilization (usage) does correctly identify the A15 cluster as overloaded and it even gets as far as selecting the A15 running two tasks as the source cpu in load_balance(). However, load_balance() bails out without pulling the task due to calculate_imbalance() returning a zero imbalance. calculate_imbalance() bails out just before the hunk you changed due to comparison of the sched_group avg_loads. sgs->avg_load is basically the sum of load in the group divided by its capacity. Since load isn't scaled the avg_load of the overloaded A15 group is slightly _lower_ than the partially idle A7 group. Hence calculate_imbalance() bails out, which isn't what we want.
I think we need to have a closer look at the imbalance calculation code and any other users of sgs->avg_load to get rid of all code making assumptions about cpu capacity.
We already had this discussion during the review of a previous version https://lkml.org/lkml/2014/6/3/422 and my answer has not changed; This patch is necessary to solve the 1 task per CPU issue of the HMP system but is not enough. I have a patch for solving the imbalance calculation issue and i have planned to send it once this patch will be in a good shape for being accepted by Peter.
I don't want to mix this patch and the next one because there are addressing different problem: this one is how evaluate the capacity of a system and detect when a group is overloaded and the next one will handle the case when the balance of the system can't rely on the average load figures of the group because we have a significant capacity difference between groups. Not that i haven't specifically mentioned HMP for the last patch because SMP system can also take advantage of it
You do mention 'big' and 'little' cores in your commit message and quote example numbers with are identical to the cpu capacities for TC2. That
By last patch, i mean the patch about imbalance that i haven't sent yet, but it's not related with this patch
lead me to believe that this patch would address the issues we see for HMP systems. But that is clearly wrong. I would suggest that you drop
This patch addresses one issue: correctly detect how much capacity we have and correctly detect when the group is overloaded; HMP system
What's the meaning of "HMP system"?
Heterogeneous Multi Processor system are system that gathers processors with different micro architecture and/or different max frequency. The main result is that processors don't have the same maximum capacity and the same DMIPS/W efficiency
Regards, Vincent
Regards, Wanpeng Li
clearly falls in this category but not only. This is the only purpose of this patch and not the "1 task per CPU issue" that you mentioned.
The second patch is about correctly balance the tasks on system where the capacity of a group is significantly less than another group. It has nothing to do in capacity computation and overload detection but it will use these corrected/new metrics to make a better choice.
Then, there is the "1 task per CPU issue on HMP" that you mentioned and this latter will be solved thanks to these 2 patchsets but this is not the only/main target of these patchsets so i don't want to reduce them into: "solve an HMP issue"
mentioning big and little cores and stick to only describing cpu capacity reductions due to rt tasks and irq to avoid any confusion about the purpose of the patch. Maybe explicitly say that non-default cpu capacities (capacity_orig) isn't addressed yet.
But they are addressed by this patchset. you just mixed various goal together (see above)
I think the two problems you describe are very much related. This patch set is half the solution of the HMP balancing problem. We just need the last bit for avg_load and then we can add scale-invariance on top.
i don't see the link between scale invariance and a bad load-balancing choice. The fact that the current load balancer puts more tasks than CPUs in a group on HMP system should not influence or break the scale invariance load tracking. Isn't it ?
Now, i could send the other patchset but i'm afraid that this will generate more confusion than help in the process review.
Regards, Vincent
IMHO, it would be good to have all the bits and pieces for cpu capacity scaling and scaling of per-entity load-tracking on the table so we fit things together. We only have patches for parts of the solution posted so far.
Morten
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:
This implementation of utilization_avg_contrib doesn't solve the scaling in-variance problem, so i have to scale the utilization with original capacity of the CPU in order to get the CPU usage and compare it with the capacity. Once the scaling invariance will have been added in utilization_avg_contrib, we will remove the scale of utilization_avg_contrib by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come in another patchset.
I would have expected this in the previous patch that introduced that lot. Including a few words on how/why the cpu_capacity is a 'good' approximation etc..
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed because it's more used during load balance.
That sentence is a contradiction, I expect there's a negative gone missing someplace.
On 3 October 2014 17:38, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:
This implementation of utilization_avg_contrib doesn't solve the scaling in-variance problem, so i have to scale the utilization with original capacity of the CPU in order to get the CPU usage and compare it with the capacity. Once the scaling invariance will have been added in utilization_avg_contrib, we will remove the scale of utilization_avg_contrib by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come in another patchset.
I would have expected this in the previous patch that introduced that lot. Including a few words on how/why the cpu_capacity is a 'good' approximation etc..
That's fair, i can move the explanation
Finally, the sched_group->sched_group_capacity->capacity_orig has been removed because it's more used during load balance.
That sentence is a contradiction, I expect there's a negative gone missing someplace.
yes the "no" is of "because it is no more used during load balance" is missing
add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c7c8ac4..f72663e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */
if (sd->flags & SD_SHARE_CPUCAPACITY) { + sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
On 09/23/2014 09:38 PM, Vincent Guittot wrote:
add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c7c8ac4..f72663e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */
if (sd->flags & SD_SHARE_CPUCAPACITY) {
sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */sd->flags |= SD_PREFER_SIBLING;
Reviewed-by: Preeti U. Murthy preeti@linux.vnet.ibm.com
On 24 September 2014 14:27, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 09/23/2014 09:38 PM, Vincent Guittot wrote:
add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c7c8ac4..f72663e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */
if (sd->flags & SD_SHARE_CPUCAPACITY) {
sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
Reviewed-by: Preeti U. Murthy preeti@linux.vnet.ibm.com
Thanks
linaro-kernel@lists.linaro.org