Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_capacity (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of (cpu_)capacity_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_capacity_orig/capacity_orig) and the current capacity (cpu_capacity/capacity) of CPUs and sched_groups. A new function arch_scale_cpu_capacity has been created and replace arch_scale_smt_capacity, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores or by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we detect when the group is fully utilized
Now that we have the original capacity of CPUS and their activity/utilization, we can evaluate more accuratly the capacity and the level of utilization of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance.
Tests results (done on v4, no test has been done on v5 that is only a rebase): I have put below results of 4 kind of tests: - hackbench -l 500 -s 4096 - perf bench sched pipe -l 400000 - scp of 100MB file on the platform - ebizzy with various number of threads on 4 kernels : - tip = tip/sched/core - step1 = tip + patches(1-8) - patchset = tip + whole patchset - patchset+irq = tip + this patchset + irq accounting
each test has been run 6 times and the figure below show the stdev and the diff compared to the tip kernel
Dual A7 tip | +step1 | +patchset | patchset+irq stdev | results stdev | results stdev | results stdev hackbench (lower is better) (+/-)0.64% | -0.19% (+/-)0.73% | 0.58% (+/-)1.29% | 0.20% (+/-)1.00% perf (lower is better) (+/-)0.28% | 1.22% (+/-)0.17% | 1.29% (+/-)0.06% | 2.85% (+/-)0.33% scp (+/-)4.81% | 2.61% (+/-)0.28% | 2.39% (+/-)0.22% | 82.18% (+/-)3.30% ebizzy -t 1 (+/-)2.31% | -1.32% (+/-)1.90% | -0.79% (+/-)2.88% | 3.10% (+/-)2.32% ebizzy -t 2 (+/-)0.70% | 8.29% (+/-)6.66% | 1.93% (+/-)5.47% | 2.72% (+/-)5.72% ebizzy -t 4 (+/-)3.54% | 5.57% (+/-)8.00% | 0.36% (+/-)9.00% | 2.53% (+/-)3.17% ebizzy -t 6 (+/-)2.36% | -0.43% (+/-)3.29% | -1.93% (+/-)3.47% | 0.57% (+/-)0.75% ebizzy -t 8 (+/-)1.65% | -0.45% (+/-)0.93% | -1.95% (+/-)1.52% | -1.18% (+/-)1.61% ebizzy -t 10 (+/-)2.55% | -0.98% (+/-)3.06% | -1.18% (+/-)6.17% | -2.33% (+/-)3.28% ebizzy -t 12 (+/-)6.22% | 0.17% (+/-)5.63% | 2.98% (+/-)7.11% | 1.19% (+/-)4.68% ebizzy -t 14 (+/-)5.38% | -0.14% (+/-)5.33% | 2.49% (+/-)4.93% | 1.43% (+/-)6.55% Quad A15 tip | +patchset1 | +patchset2 | patchset+irq stdev | results stdev | results stdev | results stdev hackbench (lower is better) (+/-)0.78% | 0.87% (+/-)1.72% | 0.91% (+/-)2.02% | 3.30% (+/-)2.02% perf (lower is better) (+/-)2.03% | -0.31% (+/-)0.76% | -2.38% (+/-)1.37% | 1.42% (+/-)3.14% scp (+/-)0.04% | 0.51% (+/-)1.37% | 1.79% (+/-)0.84% | 1.77% (+/-)0.38% ebizzy -t 1 (+/-)0.41% | 2.05% (+/-)0.38% | 2.08% (+/-)0.24% | 0.17% (+/-)0.62% ebizzy -t 2 (+/-)0.78% | 0.60% (+/-)0.63% | 0.43% (+/-)0.48% | 1.61% (+/-)0.38% ebizzy -t 4 (+/-)0.58% | -0.10% (+/-)0.97% | -0.65% (+/-)0.76% | -0.75% (+/-)0.86% ebizzy -t 6 (+/-)0.31% | 1.07% (+/-)1.12% | -0.16% (+/-)0.87% | -0.76% (+/-)0.22% ebizzy -t 8 (+/-)0.95% | -0.30% (+/-)0.85% | -0.79% (+/-)0.28% | -1.66% (+/-)0.21% ebizzy -t 10 (+/-)0.31% | 0.04% (+/-)0.97% | -1.44% (+/-)1.54% | -0.55% (+/-)0.62% ebizzy -t 12 (+/-)8.35% | -1.89% (+/-)7.64% | 0.75% (+/-)5.30% | -1.18% (+/-)8.16% ebizzy -t 14 (+/-)13.17% | 6.22% (+/-)4.71% | 5.25% (+/-)9.14% | 5.87% (+/-)5.77%
I haven't been able to fully test the patchset for a SMT system to check that the regression that has been reported by Preethi has been solved but the various tests that i have done, don't show any regression so far. The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level should have fix the regression.
The usage_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the usage_avg_contrib that is based on the new implementation [3] but haven't provide the patches and results as [3] is still under review. I can provide change above [3] to change how usage_avg_contrib is computed and adapt to new mecanism.
Change since V4 - rebase to manage conflicts with changes in selection of busiest group [4]
Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments
Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics
Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2014/3/19/377 [3] https://lkml.org/lkml/2014/7/18/110 [4] https://lkml.org/lkml/2014/7/25/589
Vincent Guittot (12): sched: fix imbalance flag reset sched: remove a wake_affine condition sched: fix avg_load computation sched: Allow all archs to set the capacity_orig ARM: topology: use new cpu_capacity interface sched: add per rq cpu_capacity_orig sched: test the cpu's capacity in wake affine sched: move cfs task on a CPU with higher capacity sched: add usage_load_avg sched: get CPU's utilization statistic sched: replace capacity_factor by utilization sched: add SD_PREFER_SIBLING for SMT level
arch/arm/kernel/topology.c | 4 +- include/linux/sched.h | 4 +- kernel/sched/core.c | 3 +- kernel/sched/fair.c | 356 ++++++++++++++++++++++++++------------------- kernel/sched/sched.h | 3 +- 5 files changed, 211 insertions(+), 159 deletions(-)
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3427a8..085e853 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6765,10 +6765,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1; - } else if (*group_imbalance) - *group_imbalance = 0; }
/* All tasks on this runqueue were pinned by CPU affinity */ @@ -6779,7 +6777,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; } - goto out_balanced; + goto out_all_pinned; } }
@@ -6853,6 +6851,23 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out;
out_balanced: + /* + * We reach balance although we may have faced some affinity + * constraints. Clear the imbalance flag if it was set. + */ + if (sd_parent) { + int *group_imbalance = &sd_parent->groups->sgc->imbalance; + + if (*group_imbalance) + *group_imbalance = 0; + } + +out_all_pinned: + /* + * We reach balance because all tasks are pinned at this level so + * we can't migrate them. Let the imbalance flag set so parent level + * can try to migrate them. + */ schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found clear description in the previous commits' log. Futhermore, the comment of the condition refers to the task_hot function that was used before being replaced by the current condition: /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and * there is no bad imbalance. */
If we look more deeply the below condition this_load + target_load(prev_cpu, idx) <= tl_per_task
When sync is clear, we have : tl_per_task = runnable_load_avg / nr_running this_load = max(runnable_load_avg, cpuload[idx]) target_load = max(runnable_load_avg', cpuload'[idx])
It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the condition. This implies that runnable_load_avg == 0 too because of the condition: this_load <= load but if this _load is null, balanced is already set and the test is redundant.
If sync is set, it's not as straight forward as above (especially if cgroup are involved) but the policy should be similar as we have removed a task that's going to sleep in order to get a more accurate load and this_load values.
The current conclusion is that these additional condition don't give any benefit so we can remove them.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 30 ++++++------------------------ 1 file changed, 6 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 085e853..87b9dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4282,7 +4282,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu; - unsigned long tl_per_task; struct task_group *tg; unsigned long weight; int balanced; @@ -4340,32 +4339,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) balanced = this_eff_load <= prev_eff_load; } else balanced = true; - - /* - * If the currently running task will sleep within - * a reasonable amount of time then attract this newly - * woken task: - */ - if (sync && balanced) - return 1; - schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); - tl_per_task = cpu_avg_load_per_task(this_cpu);
- if (balanced || - (this_load <= load && - this_load + target_load(prev_cpu, idx) <= tl_per_task)) { - /* - * This domain has SD_WAKE_AFFINE and - * p is cache cold in this domain, and - * there is no bad imbalance. - */ - schedstat_inc(sd, ttwu_move_affine); - schedstat_inc(p, se.statistics.nr_wakeups_affine); + if (!balanced) + return 0;
- return 1; - } - return 0; + schedstat_inc(sd, ttwu_move_affine); + schedstat_inc(p, se.statistics.nr_wakeups_affine); + + return 1; }
/*
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87b9dc7..b85e9f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = ACCESS_ONCE(rq->nr_running); + unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg;
if (nr_running) @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; - sgs->sum_nr_running += rq->nr_running; + sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1) *overload = true;
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87b9dc7..b85e9f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
- unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg;
if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1) *overload = true;
Why do we probe rq->nr_running while we do load balancing? Should not we be probing cfs_rq->nr_running instead? We are interested after all in load balancing fair tasks right? The reason I ask this is, I was wondering if we need to make the above similar change in more places in load balancing.
To cite examples: The above check says a cpu is overloaded when rq->nr_running > 1. However if these tasks happen to be rt tasks, we would anyway not be able to load balance. So while I was looking through this patch, I noticed this and wanted to cross verify if we are checking rq->nr_running on purpose in some places in load balancing; another example being in nohz_kick_needed().
Regards Preeti U Murthy
On 30 August 2014 14:00, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87b9dc7..b85e9f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg; if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1) *overload = true;
Why do we probe rq->nr_running while we do load balancing? Should not we be probing cfs_rq->nr_running instead? We are interested after all in load balancing fair tasks right? The reason I ask this is, I was wondering if we need to make the above similar change in more places in load balancing.
Hi Preeti,
Yes, we should probably the test rq->cfs.h_nr_running > 0 before setting overload.
Sorry for this late answer, the email was lost in my messy inbox
Vincent
To cite examples: The above check says a cpu is overloaded when rq->nr_running > 1. However if these tasks happen to be rt tasks, we would anyway not be able to load balance. So while I was looking through this patch, I noticed this and wanted to cross verify if we are checking rq->nr_running on purpose in some places in load balancing; another example being in nohz_kick_needed().
Regards Preeti U Murthy
On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
On 30 August 2014 14:00, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87b9dc7..b85e9f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg; if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1) *overload = true;
Why do we probe rq->nr_running while we do load balancing? Should not we be probing cfs_rq->nr_running instead? We are interested after all in load balancing fair tasks right? The reason I ask this is, I was wondering if we need to make the above similar change in more places in load balancing.
Hi Preeti,
Yes, we should probably the test rq->cfs.h_nr_running > 0 before setting overload.
The overload indicator is used for knowing when we can totally avoid load balancing to a cpu that is about to go idle. We can avoid load balancing when no cpu has more than 1 task. So if you have say just one fair task and multiple deadline tasks on a cpu, and another cpu about to go idle, you should turn on normal load balancing in the idle path by setting overload to true.
So setting overload should be set based on rq->nr_running and not on rq->cfs.h_nr_running.
Thanks.
Tim
On 4 September 2014 01:43, Tim Chen tim.c.chen@linux.intel.com wrote:
On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
On 30 August 2014 14:00, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87b9dc7..b85e9f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg; if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1) *overload = true;
Why do we probe rq->nr_running while we do load balancing? Should not we be probing cfs_rq->nr_running instead? We are interested after all in load balancing fair tasks right? The reason I ask this is, I was wondering if we need to make the above similar change in more places in load balancing.
Hi Preeti,
Yes, we should probably the test rq->cfs.h_nr_running > 0 before setting overload.
The overload indicator is used for knowing when we can totally avoid load balancing to a cpu that is about to go idle. We can avoid load balancing when no cpu has more than 1 task. So if you have say just one fair task and multiple deadline tasks on a cpu, and another cpu about to go idle, you should turn on normal load balancing in the idle path by setting overload to true.
The newly idle load balancing can only affect CFS tasks so triggering a load_balance because a cpu is overloaded by rt tasks only, will not change anything.
So setting overload should be set based on rq->nr_running and not on rq->cfs.h_nr_running.
We should probably use both values like below
if ((rq->nr_running > 1) && ( rq->cfs.h_nr_running > 0))
Regards, Vincent
Thanks.
Tim
On Thu, 2014-09-04 at 09:17 +0200, Vincent Guittot wrote:
On 4 September 2014 01:43, Tim Chen tim.c.chen@linux.intel.com wrote:
On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
On 30 August 2014 14:00, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87b9dc7..b85e9f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg; if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
sgs->sum_nr_running += rq->cfs.h_nr_running; if (rq->nr_running > 1) *overload = true;
Why do we probe rq->nr_running while we do load balancing? Should not we be probing cfs_rq->nr_running instead? We are interested after all in load balancing fair tasks right? The reason I ask this is, I was wondering if we need to make the above similar change in more places in load balancing.
Hi Preeti,
Yes, we should probably the test rq->cfs.h_nr_running > 0 before setting overload.
The overload indicator is used for knowing when we can totally avoid load balancing to a cpu that is about to go idle. We can avoid load balancing when no cpu has more than 1 task. So if you have say just one fair task and multiple deadline tasks on a cpu, and another cpu about to go idle, you should turn on normal load balancing in the idle path by setting overload to true.
The newly idle load balancing can only affect CFS tasks so triggering a load_balance because a cpu is overloaded by rt tasks only, will not change anything.
So setting overload should be set based on rq->nr_running and not on rq->cfs.h_nr_running.
We should probably use both values like below
if ((rq->nr_running > 1) && ( rq->cfs.h_nr_running > 0))
Yes, this modification is the correct one that takes care of the condition I objected to previously.
Thanks.
Tim
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87b9dc7..b85e9f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
- unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg;
if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load;
sgs->sum_nr_running += rq->nr_running;
sgs->sum_nr_running += rq->cfs.h_nr_running;
Yes this was one of the concerns I had around the usage of rq->nr_running. Looks good to me.
if (rq->nr_running > 1) *overload = true;
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com
capacity_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_capacity that can be also used by non SMT platform to set capacity_orig.
The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to keep backward compatibility in the use of capacity_orig.
arch_scale_smt_capacity and default_scale_smt_capacity have been removed as they were not use elsewhere than in arch_scale_cpu_capacity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 25 ++++++------------------- 1 file changed, 6 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b85e9f7..8176bda 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu) return default_scale_capacity(sd, cpu); }
-static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu) +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; - unsigned long smt_gain = sd->smt_gain; + if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) + return sd->smt_gain / sd->span_weight;
- smt_gain /= weight; - - return smt_gain; -} - -unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu) -{ - return default_scale_smt_capacity(sd, cpu); + return SCHED_CAPACITY_SCALE; }
static unsigned long scale_rt_capacity(int cpu) @@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) { - if (sched_feat(ARCH_CAPACITY)) - capacity *= arch_scale_smt_capacity(sd, cpu); - else - capacity *= default_scale_smt_capacity(sd, cpu); + capacity *= arch_scale_cpu_capacity(sd, cpu);
- capacity >>= SCHED_CAPACITY_SHIFT; - } + capacity >>= SCHED_CAPACITY_SHIFT;
sdg->sgc->capacity_orig = capacity;
* Vincent Guittot vincent.guittot@linaro.org [2014-08-26 13:06:47]:
capacity_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_capacity that can be also used by non SMT platform to set capacity_orig.
The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to keep backward compatibility in the use of capacity_orig.
arch_scale_smt_capacity and default_scale_smt_capacity have been removed as they were not use elsewhere than in arch_scale_cpu_capacity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Reviewed-by: Kamalesh Babulal kamalesh@linux.vnet.ibm.com
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
capacity_orig is only changed for system with a SMT sched_domain level in order
I think I had asked this before, but why only capacity_orig? The capacity of a group is also being updated the same way. This patch fixes the capacity of a group to reflect the capacity of the heterogeneous CPUs in it, this capacity being both the full capacity of the group: capacity_orig and the capacity available for the fair tasks. So I feel in the subject as well as the changelog it would suffice to say 'capacity'.
to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_capacity that can be also used by non SMT platform to set capacity_orig.
The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to keep backward compatibility in the use of capacity_orig.
arch_scale_smt_capacity and default_scale_smt_capacity have been removed as they were not use elsewhere than in arch_scale_cpu_capacity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 25 ++++++------------------- 1 file changed, 6 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b85e9f7..8176bda 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu) return default_scale_capacity(sd, cpu); }
-static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu) +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) {
- unsigned long weight = sd->span_weight;
- unsigned long smt_gain = sd->smt_gain;
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;
- smt_gain /= weight;
- return smt_gain;
-}
-unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu) -{
- return default_scale_smt_capacity(sd, cpu);
- return SCHED_CAPACITY_SCALE;
}
static unsigned long scale_rt_capacity(int cpu) @@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu) {
unsigned long weight = sd->span_weight; unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
if (sched_feat(ARCH_CAPACITY))
Aren't you missing this check above? I understand that it is not crucial, but that would also mean removing ARCH_CAPACITY sched_feat altogether, wouldn't it?
Regards Preeti U Murthy
capacity *= arch_scale_smt_capacity(sd, cpu);
else
capacity *= default_scale_smt_capacity(sd, cpu);
- capacity *= arch_scale_cpu_capacity(sd, cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
- }
capacity >>= SCHED_CAPACITY_SHIFT;
sdg->sgc->capacity_orig = capacity;
On 30 August 2014 19:07, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
capacity_orig is only changed for system with a SMT sched_domain level in order
I think I had asked this before, but why only capacity_orig? The capacity of a group is also being updated the same way. This patch fixes the capacity of a group to reflect the capacity of the heterogeneous CPUs in it, this capacity being both the full capacity of the group: capacity_orig and the capacity available for the fair tasks. So I feel in the subject as well as the changelog it would suffice to say 'capacity'.
IIRC, we discussed that point on a former version. The patch changes only the behavior of capacity_orig but the behavior of capacity stays unchanged as all archs can already set the capacity whereas the capacity_orig was configurable only if the SD_SHARE_CPUCAPACITY was set in the sched_domain. This is no more the case with this patch which creates arch_scale_cpu_capacity for setting capacity_orig.
to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_capacity that can be also used by non SMT platform to set capacity_orig.
The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to keep backward compatibility in the use of capacity_orig.
arch_scale_smt_capacity and default_scale_smt_capacity have been removed as they were not use elsewhere than in arch_scale_cpu_capacity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 25 ++++++------------------- 1 file changed, 6 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b85e9f7..8176bda 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu) return default_scale_capacity(sd, cpu); }
-static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu) +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) {
unsigned long weight = sd->span_weight;
unsigned long smt_gain = sd->smt_gain;
if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
return sd->smt_gain / sd->span_weight;
smt_gain /= weight;
return smt_gain;
-}
-unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu) -{
return default_scale_smt_capacity(sd, cpu);
return SCHED_CAPACITY_SCALE;
}
static unsigned long scale_rt_capacity(int cpu) @@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu) {
unsigned long weight = sd->span_weight; unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
if (sched_feat(ARCH_CAPACITY))
Aren't you missing this check above? I understand that it is not crucial, but that would also mean removing ARCH_CAPACITY sched_feat altogether, wouldn't it?
Peter has proposed to remove all those checks and to keep only the default behavior because no arch uses arch_scale_smt_capacity. Then, ARCH_CAPACITY is still used in update_cpu_capacity
Regards, Vincent
Regards Preeti U Murthy
capacity *= arch_scale_smt_capacity(sd, cpu);
else
capacity *= default_scale_smt_capacity(sd, cpu);
capacity *= arch_scale_cpu_capacity(sd, cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
}
capacity >>= SCHED_CAPACITY_SHIFT; sdg->sgc->capacity_orig = capacity;
On 09/01/2014 01:35 PM, Vincent Guittot wrote:
On 30 August 2014 19:07, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
capacity_orig is only changed for system with a SMT sched_domain level in order
I think I had asked this before, but why only capacity_orig? The capacity of a group is also being updated the same way. This patch fixes the capacity of a group to reflect the capacity of the heterogeneous CPUs in it, this capacity being both the full capacity of the group: capacity_orig and the capacity available for the fair tasks. So I feel in the subject as well as the changelog it would suffice to say 'capacity'.
IIRC, we discussed that point on a former version. The patch changes only the behavior of capacity_orig but the behavior of capacity stays unchanged as all archs can already set the capacity whereas the capacity_orig was configurable only if the SD_SHARE_CPUCAPACITY was set in the sched_domain. This is no more the case with this patch which creates arch_scale_cpu_capacity for setting capacity_orig.
Yes, sorry I overlooked that.
Reviewed-by: Preeti U. Murthy preeti@linux.vnet.ibm.com
Regards Preeti U Murthy
On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
if (sched_feat(ARCH_CAPACITY))
Aren't you missing this check above? I understand that it is not crucial, but that would also mean removing ARCH_CAPACITY sched_feat altogether, wouldn't it?
Yes he's missing that, I added the below bit on top.
So the argument last time: lkml.kernel.org/r/20140709105721.GT19379@twins.programming.kicks-ass.net was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_* function. The test has to be outside, seeing how it needs to decide to call the arch function at all (or revert to the default implementation).
--- --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap return default_scale_capacity(sd, cpu); }
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu) { if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) return sd->smt_gain / sd->span_weight; @@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa return SCHED_CAPACITY_SCALE; }
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{ + return default_scale_cpu_capacity(sd, cpu); +} + static unsigned long scale_rt_capacity(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
- capacity *= arch_scale_cpu_capacity(sd, cpu); + if (sched_feat(ARCH_CAPACITY)) + capacity *= arch_scale_cpu_capacity(sd, cpu); + else + capacity *= default_scale_cpu_capacity(sd, cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
On 10 September 2014 15:50, Peter Zijlstra peterz@infradead.org wrote:
On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
if (sched_feat(ARCH_CAPACITY))
Aren't you missing this check above? I understand that it is not crucial, but that would also mean removing ARCH_CAPACITY sched_feat altogether, wouldn't it?
Yes he's missing that, I added the below bit on top.
FWIW, your changes are fine for me
So the argument last time: lkml.kernel.org/r/20140709105721.GT19379@twins.programming.kicks-ass.net was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_* function. The test has to be outside, seeing how it needs to decide to call the arch function at all (or revert to the default implementation).
--- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap return default_scale_capacity(sd, cpu); }
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu) { if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) return sd->smt_gain / sd->span_weight; @@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa return SCHED_CAPACITY_SCALE; }
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{
return default_scale_cpu_capacity(sd, cpu);
+}
static unsigned long scale_rt_capacity(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
capacity *= arch_scale_cpu_capacity(sd, cpu);
if (sched_feat(ARCH_CAPACITY))
capacity *= arch_scale_cpu_capacity(sd, cpu);
else
capacity *= default_scale_cpu_capacity(sd, cpu); capacity >>= SCHED_CAPACITY_SHIFT;
On 09/10/2014 07:20 PM, Peter Zijlstra wrote:
On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
if (sched_feat(ARCH_CAPACITY))
Aren't you missing this check above? I understand that it is not crucial, but that would also mean removing ARCH_CAPACITY sched_feat altogether, wouldn't it?
Yes he's missing that, I added the below bit on top.
So the argument last time: lkml.kernel.org/r/20140709105721.GT19379@twins.programming.kicks-ass.net was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_* function. The test has to be outside, seeing how it needs to decide to call the arch function at all (or revert to the default implementation).
--- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap return default_scale_capacity(sd, cpu); }
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu) { if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) return sd->smt_gain / sd->span_weight; @@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa return SCHED_CAPACITY_SCALE; }
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{
- return default_scale_cpu_capacity(sd, cpu);
+}
static unsigned long scale_rt_capacity(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
- capacity *= arch_scale_cpu_capacity(sd, cpu);
if (sched_feat(ARCH_CAPACITY))
capacity *= arch_scale_cpu_capacity(sd, cpu);
else
capacity *= default_scale_cpu_capacity(sd, cpu);
capacity >>= SCHED_CAPACITY_SHIFT;
Alright, I see now. Thanks!
Regards Preeti U Murthy
Use the new arch_scale_cpu_capacity in order to reflect the original capacity of a CPU instead of arch_scale_freq_capacity which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index e35d880..89cfdd6 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu) set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n", - cpu, arch_scale_freq_capacity(NULL, cpu)); + cpu, arch_scale_cpu_capacity(NULL, cpu)); }
#else
On Tue, 26 Aug 2014, Vincent Guittot wrote:
Use the new arch_scale_cpu_capacity in order to reflect the original capacity of a CPU instead of arch_scale_freq_capacity which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Nicolas Pitre nico@linaro.org
arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index e35d880..89cfdd6 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale); -unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu) set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity); printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
cpu, arch_scale_freq_capacity(NULL, cpu));
cpu, arch_scale_cpu_capacity(NULL, cpu));
}
#else
1.9.1
This new field cpu_capacity_orig reflects the available capacity of a CPUs unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a394f99..7c3b237 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7003,7 +7003,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq->cpu_capacity = SCHED_CAPACITY_SCALE; + rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8176bda..17c16cc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index aa0f73b..7c0a74e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -575,6 +575,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_capacity; + unsigned long cpu_capacity_orig;
unsigned char idle_balance; /* For active balancing */
* Vincent Guittot vincent.guittot@linaro.org [2014-08-26 13:06:49]:
This new field cpu_capacity_orig reflects the available capacity of a CPUs unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Reviewed-by: Kamalesh Babulal kamalesh@linux.vnet.ibm.com
On 27 August 2014 15:32, Kamalesh Babulal kamalesh@linux.vnet.ibm.com wrote:
- Vincent Guittot vincent.guittot@linaro.org [2014-08-26 13:06:49]:
This new field cpu_capacity_orig reflects the available capacity of a CPUs unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Reviewed-by: Kamalesh Babulal kamalesh@linux.vnet.ibm.com
Thanks
On Tue, Aug 26, 2014 at 01:06:49PM +0200, Vincent Guittot wrote:
This new field cpu_capacity_orig reflects the available capacity of a CPUs unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
No real objection to the patch as such, but the Changelog fails to tell us why you want this.
On 10 September 2014 15:53, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:49PM +0200, Vincent Guittot wrote:
This new field cpu_capacity_orig reflects the available capacity of a CPUs unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
No real objection to the patch as such, but the Changelog fails to tell us why you want this.
the cpu_capacity_orig is used in several places in the following patches to detect when a CPU's capacity has been noticeably reduced so we can trig load balance to look for a CPU with full capacity. As an example, we can detect when a CPU handles a significant amount of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but is seen as an idle CPU by scheduler whereas really idle CPU are available
On Tue, 26 Aug 2014, Vincent Guittot wrote:
This new field cpu_capacity_orig reflects the available capacity of a CPUs
s/a CPUs/a CPU/
unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
Shouldn't this be described as the "highest possible capacity" instead of only "available capacity" to make it clearer? In which case, shouldn't it be called capacity_max instead of capacity_orig? With "orig" we may think "origin" as the capacity that was available at boot time or the like. If for example the CPU was booted with a low P-state then its max capacity is different from the original one.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a394f99..7c3b237 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7003,7 +7003,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL;
rq->cpu_capacity = SCHED_CAPACITY_SCALE;
rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies;rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8176bda..17c16cc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) capacity >>= SCHED_CAPACITY_SHIFT;
- cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index aa0f73b..7c0a74e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -575,6 +575,7 @@ struct rq { struct sched_domain *sd; unsigned long cpu_capacity;
- unsigned long cpu_capacity_orig;
unsigned char idle_balance; /* For active balancing */ -- 1.9.1
On 11 September 2014 21:02, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Tue, 26 Aug 2014, Vincent Guittot wrote:
This new field cpu_capacity_orig reflects the available capacity of a CPUs
s/a CPUs/a CPU/
good catch
unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
Shouldn't this be described as the "highest possible capacity" instead of only "available capacity" to make it clearer? In which case, shouldn't it be called capacity_max instead of capacity_orig? With "orig" we may think "origin" as the capacity that was available at boot time or the like. If for example the CPU was booted with a low P-state then its max capacity is different from the original one.
I have use the orig suffix to be aligned with the capacity_orig which is a sum of cpu_capacity_orig. I agree that orig is standing for original in the sense that it's the original capacity befor being altered by scale_freq and scale_rt
Vincent
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a394f99..7c3b237 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7003,7 +7003,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL;
rq->cpu_capacity = SCHED_CAPACITY_SCALE;
rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8176bda..17c16cc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity; if (sched_feat(ARCH_CAPACITY))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index aa0f73b..7c0a74e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -575,6 +575,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig; unsigned char idle_balance; /* For active balancing */
-- 1.9.1
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 17c16cc..18db43e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4281,6 +4281,7 @@ static int wake_wide(struct task_struct *p) static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; + s64 this_eff_load, prev_eff_load; int idx, this_cpu, prev_cpu; struct task_group *tg; unsigned long weight; @@ -4324,21 +4325,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) * Otherwise check if either cpus are near enough in load to allow this * task to be woken on this_cpu. */ - if (this_load > 0) { - s64 this_eff_load, prev_eff_load; + this_eff_load = 100; + this_eff_load *= capacity_of(prev_cpu); + + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= capacity_of(this_cpu);
- this_eff_load = 100; - this_eff_load *= capacity_of(prev_cpu); + if (this_load > 0) { this_eff_load *= this_load + effective_load(tg, this_cpu, weight, weight);
- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; - prev_eff_load *= capacity_of(this_cpu); prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); + } + + balanced = this_eff_load <= prev_eff_load;
- balanced = this_eff_load <= prev_eff_load; - } else - balanced = true; schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
if (!balanced)
On Tue, Aug 26, 2014 at 01:06:50PM +0200, Vincent Guittot wrote:
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Rik, you've been 'fighting' wake_affine, any objections?
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 49 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
+ /* + * The group capacity is reduced probably because of activity from other + * sched class or interrupts which use part of the available capacity + */ + if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity * + env->sd->imbalance_pct)) + return true; + return false; }
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) { + int src_cpu = env->src_cpu;
/* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */ - if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) + if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu) + return 1; + + /* + * If the CPUs share their cache and the src_cpu's capacity is + * reduced because of other sched_class or IRQs, we trig an + * active balance to move the task + */ + if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) * + sd->imbalance_pct)) return 1; }
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu; + ld_moved = 0; if (busiest->nr_running > 1) { /* @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/* * Current heuristic for kicking the idle load balancer in the presence - * of an idle cpu is the system. + * of an idle cpu in the system. * - This rq has more than one task. - * - At any scheduler domain level, this cpu's scheduler group has multiple - * busy cpu's exceeding the group's capacity. + * - This rq has at least one CFS task and the capacity of the CPU is + * significantly reduced because of RT tasks or IRQs. + * - At parent of LLC scheduler domain level, this cpu's scheduler group has + * multiple busy cpu. * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu; + bool kick = false;
if (unlikely(rq->idle_balance)) - return 0; + return false;
/* * We may be recently in ticked or tickless idle mode. At the first @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus))) - return 0; + return false;
if (time_before(now, nohz.next_balance)) - return 0; + return false;
if (rq->nr_running >= 2) - goto need_kick; + return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); - if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1) - goto need_kick_unlock; + if (nr_busy > 1) { + kick = true; + goto unlock; + } + }
- sd = rcu_dereference(per_cpu(sd_asym, cpu)); + sd = rcu_dereference(rq->sd); + if (sd) { + if ((rq->cfs.h_nr_running >= 1) && + ((rq->cpu_capacity * sd->imbalance_pct) < + (rq->cpu_capacity_orig * 100))) { + kick = true; + goto unlock; + } + }
+ sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) - goto need_kick_unlock; + kick = true;
+unlock: rcu_read_unlock(); - return 0; - -need_kick_unlock: - rcu_read_unlock(); -need_kick: - return 1; + return kick; } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 49 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
- /*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
- if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
Wouldn't the check on avg_load let us know if we are packing more tasks in this group than its capacity ? Isn't that the metric we are more interested in?
return true;
- return false;
}
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) {
int src_cpu = env->src_cpu;
/*
- ASYM_PACKING needs to force migrate tasks from busy but
- higher numbered CPUs in order to pack all tasks in the
- lowest numbered CPUs.
*/
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
return 1;
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
}sd->imbalance_pct)) return 1;
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
- env.src_cpu = busiest->cpu;
- ld_moved = 0; if (busiest->nr_running > 1) { /*
@@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED;
env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);env.src_cpu = busiest->cpu;
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/*
- Current heuristic for kicking the idle load balancer in the presence
- of an idle cpu is the system.
- of an idle cpu in the system.
- This rq has more than one task.
- At any scheduler domain level, this cpu's scheduler group has multiple
busy cpu's exceeding the group's capacity.
- This rq has at least one CFS task and the capacity of the CPU is
significantly reduced because of RT tasks or IRQs.
- At parent of LLC scheduler domain level, this cpu's scheduler group has
*/
multiple busy cpu.
- For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
domain span are idle.
@@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu;
bool kick = false;
if (unlikely(rq->idle_balance))
return 0;
return false;
/*
- We may be recently in ticked or tickless idle mode. At the first
@@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus)))
return 0;
return false;
if (time_before(now, nohz.next_balance))
return 0;
return false;
if (rq->nr_running >= 2)
Will this check ^^ not catch those cases which this patch is targeting?
Regards Preeti U Murthy
goto need_kick;
return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu));
if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
if (nr_busy > 1)
goto need_kick_unlock;
if (nr_busy > 1) {
kick = true;
goto unlock;
}
- }
- sd = rcu_dereference(per_cpu(sd_asym, cpu));
sd = rcu_dereference(rq->sd);
if (sd) {
if ((rq->cfs.h_nr_running >= 1) &&
((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
kick = true;
goto unlock;
}
}
sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu))
goto need_kick_unlock;
kick = true;
+unlock: rcu_read_unlock();
- return 0;
-need_kick_unlock:
- rcu_read_unlock();
-need_kick:
- return 1;
- return kick;
} #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
On 30 August 2014 19:50, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 49 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
Wouldn't the check on avg_load let us know if we are packing more tasks in this group than its capacity ? Isn't that the metric we are more interested in?
With this patch, we don't want to pack but we prefer to spread the task on another CPU than the one which handles the interruption if latter uses a significant part of the CPU capacity.
return true;
return false;
}
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) {
int src_cpu = env->src_cpu; /* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
return 1;
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct)) return 1; }
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
env.src_cpu = busiest->cpu;
ld_moved = 0; if (busiest->nr_running > 1) { /*
@@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED;
env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/*
- Current heuristic for kicking the idle load balancer in the presence
- of an idle cpu is the system.
- of an idle cpu in the system.
- This rq has more than one task.
- At any scheduler domain level, this cpu's scheduler group has multiple
busy cpu's exceeding the group's capacity.
- This rq has at least one CFS task and the capacity of the CPU is
significantly reduced because of RT tasks or IRQs.
- At parent of LLC scheduler domain level, this cpu's scheduler group has
*/
multiple busy cpu.
- For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
domain span are idle.
@@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu;
bool kick = false; if (unlikely(rq->idle_balance))
return 0;
return false; /* * We may be recently in ticked or tickless idle mode. At the first
@@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus)))
return 0;
return false; if (time_before(now, nohz.next_balance))
return 0;
return false; if (rq->nr_running >= 2)
Will this check ^^ not catch those cases which this patch is targeting?
This patch is not about how many tasks are running but if the capacity of the CPU is reduced because of side activity like interruptions which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING config) but not in nr_running. Even if the capacity is reduced because of RT tasks, nothing ensures that the RT task will be running when the tick fires
Regards, Vincent
Regards Preeti U Murthy
goto need_kick;
return true; rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu));
if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
if (nr_busy > 1)
goto need_kick_unlock;
if (nr_busy > 1) {
kick = true;
goto unlock;
}
}
sd = rcu_dereference(per_cpu(sd_asym, cpu));
sd = rcu_dereference(rq->sd);
if (sd) {
if ((rq->cfs.h_nr_running >= 1) &&
((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
kick = true;
goto unlock;
}
}
sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu))
goto need_kick_unlock;
kick = true;
+unlock: rcu_read_unlock();
return 0;
-need_kick_unlock:
rcu_read_unlock();
-need_kick:
return 1;
return kick;
} #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
On 09/01/2014 02:15 PM, Vincent Guittot wrote:
On 30 August 2014 19:50, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
Wouldn't the check on avg_load let us know if we are packing more tasks in this group than its capacity ? Isn't that the metric we are more interested in?
With this patch, we don't want to pack but we prefer to spread the task on another CPU than the one which handles the interruption if latter uses a significant part of the CPU capacity.
return true;
return false;
}
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) {
int src_cpu = env->src_cpu; /* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
return 1;
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct)) return 1; }
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
env.src_cpu = busiest->cpu;
ld_moved = 0; if (busiest->nr_running > 1) { /*
@@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED;
env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/*
- Current heuristic for kicking the idle load balancer in the presence
- of an idle cpu is the system.
- of an idle cpu in the system.
- This rq has more than one task.
- At any scheduler domain level, this cpu's scheduler group has multiple
busy cpu's exceeding the group's capacity.
- This rq has at least one CFS task and the capacity of the CPU is
significantly reduced because of RT tasks or IRQs.
- At parent of LLC scheduler domain level, this cpu's scheduler group has
*/
multiple busy cpu.
- For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
domain span are idle.
@@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu;
bool kick = false; if (unlikely(rq->idle_balance))
return 0;
return false; /* * We may be recently in ticked or tickless idle mode. At the first
@@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus)))
return 0;
return false; if (time_before(now, nohz.next_balance))
return 0;
return false; if (rq->nr_running >= 2)
Will this check ^^ not catch those cases which this patch is targeting?
This patch is not about how many tasks are running but if the capacity of the CPU is reduced because of side activity like interruptions which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING config) but not in nr_running. Even if the capacity is reduced because of RT tasks, nothing ensures that the RT task will be running when the tick fires
Regards, Vincent
Regards Preeti U Murthy
goto need_kick;
return true; rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu));
if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
if (nr_busy > 1)
goto need_kick_unlock;
if (nr_busy > 1) {
kick = true;
goto unlock;
}
}
sd = rcu_dereference(per_cpu(sd_asym, cpu));
sd = rcu_dereference(rq->sd);
if (sd) {
if ((rq->cfs.h_nr_running >= 1) &&
((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
Ok I understand your explanation above. But I was wondering if you would need to add this check around rq->cfs.h_nr_running >= 1 in the above two cases as well.
I have actually raised this concern over whether we should be using rq->nr_running or cfs_rq->nr_running while we do load balancing in reply to your patch3. While all our load measurements are about the cfs_rq load, we use rq->nr_running, which may include tasks from other sched classes as well. We divide them to get average load per task during load balancing which is wrong, isn't it? Similarly during nohz_kick_needed(), we trigger load balancing based on rq->nr_running again.
In this patch too, even if you know that the cpu is being dominated by tasks that do not belong to cfs class, you would be triggering a futile load balance if there are no fair tasks to move.
Regards Preeti U Murthy
On 3 September 2014 11:11, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 09/01/2014 02:15 PM, Vincent Guittot wrote:
On 30 August 2014 19:50, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
Wouldn't the check on avg_load let us know if we are packing more tasks in this group than its capacity ? Isn't that the metric we are more interested in?
With this patch, we don't want to pack but we prefer to spread the task on another CPU than the one which handles the interruption if latter uses a significant part of the CPU capacity.
return true;
return false;
}
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) {
int src_cpu = env->src_cpu; /* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
return 1;
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct)) return 1; }
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
env.src_cpu = busiest->cpu;
ld_moved = 0; if (busiest->nr_running > 1) { /*
@@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED;
env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/*
- Current heuristic for kicking the idle load balancer in the presence
- of an idle cpu is the system.
- of an idle cpu in the system.
- This rq has more than one task.
- At any scheduler domain level, this cpu's scheduler group has multiple
busy cpu's exceeding the group's capacity.
- This rq has at least one CFS task and the capacity of the CPU is
significantly reduced because of RT tasks or IRQs.
- At parent of LLC scheduler domain level, this cpu's scheduler group has
*/
multiple busy cpu.
- For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
domain span are idle.
@@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu;
bool kick = false; if (unlikely(rq->idle_balance))
return 0;
return false; /* * We may be recently in ticked or tickless idle mode. At the first
@@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus)))
return 0;
return false; if (time_before(now, nohz.next_balance))
return 0;
return false; if (rq->nr_running >= 2)
Will this check ^^ not catch those cases which this patch is targeting?
This patch is not about how many tasks are running but if the capacity of the CPU is reduced because of side activity like interruptions which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING config) but not in nr_running. Even if the capacity is reduced because of RT tasks, nothing ensures that the RT task will be running when the tick fires
Regards, Vincent
Regards Preeti U Murthy
goto need_kick;
return true; rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu));
if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
if (nr_busy > 1)
goto need_kick_unlock;
if (nr_busy > 1) {
kick = true;
goto unlock;
}
}
sd = rcu_dereference(per_cpu(sd_asym, cpu));
sd = rcu_dereference(rq->sd);
if (sd) {
if ((rq->cfs.h_nr_running >= 1) &&
((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
Ok I understand your explanation above. But I was wondering if you would need to add this check around rq->cfs.h_nr_running >= 1 in the above two cases as well.
yes you're right for the test if (rq->nr_running >= 2).
It's not so straight forward for nr_busy_cpus which reflects how many CPUs have not stopped their tick. The scheduler assumes that the latter are busy with cfs tasks
I have actually raised this concern over whether we should be using rq->nr_running or cfs_rq->nr_running while we do load balancing in reply to your patch3. While all our load measurements are about the cfs_rq
I have just replied to your comments on patch 3. Sorry for the delay
load, we use rq->nr_running, which may include tasks from other sched classes as well. We divide them to get average load per task during load balancing which is wrong, isn't it? Similarly during nohz_kick_needed(), we trigger load balancing based on rq->nr_running again.
In this patch too, even if you know that the cpu is being dominated by tasks that do not belong to cfs class, you would be triggering a futile load balance if there are no fair tasks to move.
This patch adds one additional condition that is based on rq->cfs.h_nr_running so it should not trigger any futile load balance. Then, I have also take advantage of this patch to clean up nohz_kick_needed as proposed by Peter but the conditions are the same than previously (except the one with rq->cfs.h_nr_running)
I can prepare another patchset that will solve the concerns that you raised for nohz_kick_needed and in patch 3 but i would prefer not include them in this patchset which is large enough and which subject is a bit different. Does it seem ok for you ?
Regards, Vincent
Regards Preeti U Murthy
On 09/03/2014 05:14 PM, Vincent Guittot wrote:
On 3 September 2014 11:11, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 09/01/2014 02:15 PM, Vincent Guittot wrote:
On 30 August 2014 19:50, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
Wouldn't the check on avg_load let us know if we are packing more tasks in this group than its capacity ? Isn't that the metric we are more interested in?
With this patch, we don't want to pack but we prefer to spread the task on another CPU than the one which handles the interruption if latter uses a significant part of the CPU capacity.
return true;
return false;
}
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) {
int src_cpu = env->src_cpu; /* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
return 1;
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct)) return 1; }
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
env.src_cpu = busiest->cpu;
ld_moved = 0; if (busiest->nr_running > 1) { /*
@@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED;
env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/*
- Current heuristic for kicking the idle load balancer in the presence
- of an idle cpu is the system.
- of an idle cpu in the system.
- This rq has more than one task.
- At any scheduler domain level, this cpu's scheduler group has multiple
busy cpu's exceeding the group's capacity.
- This rq has at least one CFS task and the capacity of the CPU is
significantly reduced because of RT tasks or IRQs.
- At parent of LLC scheduler domain level, this cpu's scheduler group has
*/
multiple busy cpu.
- For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
domain span are idle.
@@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu;
bool kick = false; if (unlikely(rq->idle_balance))
return 0;
return false; /* * We may be recently in ticked or tickless idle mode. At the first
@@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus)))
return 0;
return false; if (time_before(now, nohz.next_balance))
return 0;
return false; if (rq->nr_running >= 2)
Will this check ^^ not catch those cases which this patch is targeting?
This patch is not about how many tasks are running but if the capacity of the CPU is reduced because of side activity like interruptions which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING config) but not in nr_running. Even if the capacity is reduced because of RT tasks, nothing ensures that the RT task will be running when the tick fires
Regards, Vincent
Regards Preeti U Murthy
goto need_kick;
return true; rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu));
if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
if (nr_busy > 1)
goto need_kick_unlock;
if (nr_busy > 1) {
kick = true;
goto unlock;
}
}
sd = rcu_dereference(per_cpu(sd_asym, cpu));
sd = rcu_dereference(rq->sd);
if (sd) {
if ((rq->cfs.h_nr_running >= 1) &&
((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
Ok I understand your explanation above. But I was wondering if you would need to add this check around rq->cfs.h_nr_running >= 1 in the above two cases as well.
yes you're right for the test if (rq->nr_running >= 2).
It's not so straight forward for nr_busy_cpus which reflects how many CPUs have not stopped their tick. The scheduler assumes that the latter are busy with cfs tasks
I have actually raised this concern over whether we should be using rq->nr_running or cfs_rq->nr_running while we do load balancing in reply to your patch3. While all our load measurements are about the cfs_rq
I have just replied to your comments on patch 3. Sorry for the delay
load, we use rq->nr_running, which may include tasks from other sched classes as well. We divide them to get average load per task during load balancing which is wrong, isn't it? Similarly during nohz_kick_needed(), we trigger load balancing based on rq->nr_running again.
In this patch too, even if you know that the cpu is being dominated by tasks that do not belong to cfs class, you would be triggering a futile load balance if there are no fair tasks to move.
This patch adds one additional condition that is based on rq->cfs.h_nr_running so it should not trigger any futile load balance. Then, I have also take advantage of this patch to clean up nohz_kick_needed as proposed by Peter but the conditions are the same than previously (except the one with rq->cfs.h_nr_running)
I can prepare another patchset that will solve the concerns that you raised for nohz_kick_needed and in patch 3 but i would prefer not include them in this patchset which is large enough and which subject is a bit different. Does it seem ok for you ?
Sure Vincent, thanks! I have in fact sent out a mail raising my concern over rq->nr_running. If others agree on the issue to be existing, maybe we can work on this next patchset that can clean this up in all places necessary and not just in nohz_kick_needed().
Regards Preeti U Murthy
Regards, Vincent
Regards Preeti U Murthy
On 3 September 2014 14:26, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 09/03/2014 05:14 PM, Vincent Guittot wrote:
On 3 September 2014 11:11, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 09/01/2014 02:15 PM, Vincent Guittot wrote:
[snip]
Ok I understand your explanation above. But I was wondering if you would need to add this check around rq->cfs.h_nr_running >= 1 in the above two cases as well.
yes you're right for the test if (rq->nr_running >= 2).
It's not so straight forward for nr_busy_cpus which reflects how many CPUs have not stopped their tick. The scheduler assumes that the latter are busy with cfs tasks
I have actually raised this concern over whether we should be using rq->nr_running or cfs_rq->nr_running while we do load balancing in reply to your patch3. While all our load measurements are about the cfs_rq
I have just replied to your comments on patch 3. Sorry for the delay
load, we use rq->nr_running, which may include tasks from other sched classes as well. We divide them to get average load per task during load balancing which is wrong, isn't it? Similarly during nohz_kick_needed(), we trigger load balancing based on rq->nr_running again.
In this patch too, even if you know that the cpu is being dominated by tasks that do not belong to cfs class, you would be triggering a futile load balance if there are no fair tasks to move.
This patch adds one additional condition that is based on rq->cfs.h_nr_running so it should not trigger any futile load balance. Then, I have also take advantage of this patch to clean up nohz_kick_needed as proposed by Peter but the conditions are the same than previously (except the one with rq->cfs.h_nr_running)
I can prepare another patchset that will solve the concerns that you raised for nohz_kick_needed and in patch 3 but i would prefer not include them in this patchset which is large enough and which subject is a bit different. Does it seem ok for you ?
Sure Vincent, thanks! I have in fact sent out a mail raising my concern over rq->nr_running. If others agree on the issue to be existing, maybe we can work on this next patchset that can clean this up in all places necessary and not just in nohz_kick_needed().
Ok, let continue this discussion on the other thread
Regards, Vincent
Regards Preeti U Murthy
Regards, Vincent
Regards Preeti U Murthy
On Wed, Sep 03, 2014 at 05:56:04PM +0530, Preeti U Murthy wrote:
On 09/03/2014 05:14 PM, Vincent Guittot wrote:
On 3 September 2014 11:11, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 09/01/2014 02:15 PM, Vincent Guittot wrote:
On 30 August 2014 19:50, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
index 18db43e..60ae1ce 100644
Guys, really, trim excess crap. This email had over 150 lines of garbage and 5 pointless quote levels or so before there was anything useful.
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
So I see that there are added checks in your previous patches on if the cpu capacity for CFS tasks is good enough to run tasks on the cpu. My concern is although they appear sensible, would they trigger an increase in the number of times we load balance to a large extent.
Ebizzy would not test this aspect right? There are no real time tasks/interrupts that get generated.
Besides, what is the column that says patchset+irq? What is the irq accounting patchset that you refer to in your cover letter?
Regards Preeti U Murthy
On 5 September 2014 14:06, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 08/26/2014 04:36 PM, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
So I see that there are added checks in your previous patches on if the cpu capacity for CFS tasks is good enough to run tasks on the cpu. My concern is although they appear sensible, would they trigger an increase in the number of times we load balance to a large extent.
Ebizzy would not test this aspect right? There are no real time tasks/interrupts that get generated.
yes, ebizzy doesn't test this part but check for non regression
The scp test is the one that i use to check this patch and the previous one but a test with some cfs and rt tasks should also do the jobs. we can see an increase of 82% for the dual core when CONFIG_IRQ_TIME_ACCOUNTING is enable
Besides, what is the column that says patchset+irq? What is the irq accounting patchset that you refer to in your cover letter?
it refers to CONFIG_IRQ_TIME_ACCOUNTING which includes the time spent under interrupt context to compute the scale_rt_capacity
Regards, Vincent
Regards Preeti U Murthy
On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
- /*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
- if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
return true;
- return false;
}
This is unlikely to have worked as intended. You will never reach this, except on PowerPC >= 7. All other machines will have bailed at the !ASYM_PACKING check above this.
On 11 September 2014 12:07, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
return true;
return false;
}
This is unlikely to have worked as intended. You will never reach this, except on PowerPC >= 7. All other machines will have bailed at the !ASYM_PACKING check above this.
Ah yes, i miss that change while rebasing on rik patches. My use case fall in this wider test now that we always select a busiest group
On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
- /*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
- if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
return true;
- return false;
} @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd; if (env->idle == CPU_NEWLY_IDLE) {
int src_cpu = env->src_cpu;
/* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
return 1;
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
}sd->imbalance_pct)) return 1;
Should you not also check -- in both cases -- that the destination is any better?
Also, there's some obvious repetition going on there, maybe add a helper?
Also, both sites should probably ensure they're operating in the non-saturated/overloaded scenario. Because as soon as we're completely saturated we want SMP nice etc. and that all already works right (presumably).
On 11 September 2014 12:13, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18db43e..60ae1ce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
env->sd->imbalance_pct))
return true;
return false;
}
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) {
int src_cpu = env->src_cpu; /* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */
if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
return 1;
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct)) return 1; }
Should you not also check -- in both cases -- that the destination is any better?
The case should have been solved earlier when calculating the imbalance which should be null if the destination is worse than the source.
But i haven't formally check that calculate_imbalance correctly handles that case
Also, there's some obvious repetition going on there, maybe add a helper?
yes
Also, both sites should probably ensure they're operating in the non-saturated/overloaded scenario. Because as soon as we're completely saturated we want SMP nice etc. and that all already works right (presumably).
If both are overloaded, calculated_imbalance will cap the max load that can be pulled so the busiest_group will not become idle
On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct))
Note that capacity_orig_of() is introduced a lot later on in the series. Patch 10 iirc, so this will not actually compile.
Add new statistics which reflect the average time a task is running on the CPU and the sum of the tasks' running on a runqueue. The latter is named usage_avg_contrib.
This patch is based on the usage metric that was proposed in the 1st versions of the per-entity load tracking patchset but that has be removed afterward. This version differs from the original one in the sense that it's not linked to task_group.
The rq's usage_avg_contrib will be used to check if a rq is overloaded or not instead of trying to compute how many task a group of CPUs can handle
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 4 ++-- kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 +- 3 files changed, 45 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 5c2c885..7dfd584 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1073,10 +1073,10 @@ struct sched_avg { * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024. */ - u32 runnable_avg_sum, runnable_avg_period; + u32 runnable_avg_sum, runnable_avg_period, running_avg_sum; u64 last_runnable_update; s64 decay_count; - unsigned long load_avg_contrib; + unsigned long load_avg_contrib, usage_avg_contrib; };
#ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 60ae1ce..1fd2131 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -676,7 +676,7 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.decay_count = 0; slice = sched_slice(task_cfs_rq(p), &p->se) >> 10; - p->se.avg.runnable_avg_sum = slice; + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice; p->se.avg.runnable_avg_period = slice; __update_task_entity_contrib(&p->se); } @@ -2289,7 +2289,8 @@ static u32 __compute_runnable_contrib(u64 n) */ static __always_inline int __update_entity_runnable_avg(u64 now, struct sched_avg *sa, - int runnable) + int runnable, + int running) { u64 delta, periods; u32 runnable_contrib; @@ -2328,6 +2329,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, delta_w = 1024 - delta_w; if (runnable) sa->runnable_avg_sum += delta_w; + if (running) + sa->running_avg_sum += delta_w; sa->runnable_avg_period += delta_w;
delta -= delta_w; @@ -2338,6 +2341,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, periods + 1); + sa->running_avg_sum = decay_load(sa->running_avg_sum, + periods + 1); sa->runnable_avg_period = decay_load(sa->runnable_avg_period, periods + 1);
@@ -2345,12 +2350,16 @@ static __always_inline int __update_entity_runnable_avg(u64 now, runnable_contrib = __compute_runnable_contrib(periods); if (runnable) sa->runnable_avg_sum += runnable_contrib; + if (running) + sa->running_avg_sum += runnable_contrib; sa->runnable_avg_period += runnable_contrib; }
/* Remainder of delta accrued against u_0` */ if (runnable) sa->runnable_avg_sum += delta; + if (running) + sa->running_avg_sum += delta; sa->runnable_avg_period += delta;
return decayed; @@ -2490,6 +2499,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se) return se->avg.load_avg_contrib - old_contrib; }
+ +static inline void __update_task_entity_usage(struct sched_entity *se) +{ + u32 contrib; + + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE); + contrib /= (se->avg.runnable_avg_period + 1); + se->avg.usage_avg_contrib = scale_load(contrib); +} + +static long __update_entity_usage_avg_contrib(struct sched_entity *se) +{ + long old_contrib = se->avg.usage_avg_contrib; + + if (entity_is_task(se)) + __update_task_entity_usage(se); + + return se->avg.usage_avg_contrib - old_contrib; +} + static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq, long load_contrib) { @@ -2506,7 +2536,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, int update_cfs_rq) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - long contrib_delta; + long contrib_delta, usage_delta; u64 now;
/* @@ -2518,16 +2548,20 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq, + cfs_rq->curr == se)) return;
contrib_delta = __update_entity_load_avg_contrib(se); + usage_delta = __update_entity_usage_avg_contrib(se);
if (!update_cfs_rq) return;
- if (se->on_rq) + if (se->on_rq) { cfs_rq->runnable_load_avg += contrib_delta; + cfs_rq->usage_load_avg += usage_delta; + } else subtract_blocked_load_contrib(cfs_rq, -contrib_delta); } @@ -2604,6 +2638,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, }
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; + cfs_rq->usage_load_avg += se->avg.usage_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); } @@ -2622,6 +2657,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib; + cfs_rq->usage_load_avg -= se->avg.usage_avg_contrib; if (sleep) { cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); @@ -2959,6 +2995,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) */ update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); + update_entity_load_avg(se, 1); }
update_stats_curr_start(cfs_rq, se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7c0a74e..d625fbb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -340,7 +340,7 @@ struct cfs_rq { * This allows for the description of both thread and group usage (in * the FAIR_GROUP_SCHED case). */ - unsigned long runnable_load_avg, blocked_load_avg; + unsigned long runnable_load_avg, blocked_load_avg, usage_load_avg; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load;
The update of update_rq_runnable_avg interface is missing for CONFIG_FAIR_GROUP_SCHED in the original patch "[PATCH v5 09/12] sched: add usage_load_avg"
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org ---
Hi Peter,
Do you prefer that I sent a new version of "[PATCH v5 09/12] sched: add usage_load_avg" patch with the fix or this fix is enough ?
Regards, Vincent
kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1fd2131..cbe3eff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2462,7 +2462,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable, + runnable); __update_tg_runnable_avg(&rq->avg, &rq->cfs); } #else /* CONFIG_FAIR_GROUP_SCHED */
On Thu, Sep 04, 2014 at 09:34:39AM +0200, Vincent Guittot wrote:
The update of update_rq_runnable_avg interface is missing for CONFIG_FAIR_GROUP_SCHED in the original patch "[PATCH v5 09/12] sched: add usage_load_avg"
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Hi Peter,
Do you prefer that I sent a new version of "[PATCH v5 09/12] sched: add usage_load_avg" patch with the fix or this fix is enough ?
This is fine, I hand edited the previous patch.
On Tue, Aug 26, 2014 at 01:06:52PM +0200, Vincent Guittot wrote:
index 5c2c885..7dfd584 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1073,10 +1073,10 @@ struct sched_avg { * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024. */
- u32 runnable_avg_sum, runnable_avg_period;
- u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;
Seeing how we use runnable_avg_period for both runnable and running, does it make sense to remove the runnable part of it from the name?
Also, 4 byte hole here, not sure we've got anything useful to stuff in it though.
u64 last_runnable_update; s64 decay_count;
- unsigned long load_avg_contrib;
- unsigned long load_avg_contrib, usage_avg_contrib;
};
Man, I should go look at Yuyang's rewrite of this all again. I just tried to figure out the decay stuff and my head hurts ;-)
On 11 September 2014 13:17, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:52PM +0200, Vincent Guittot wrote:
index 5c2c885..7dfd584 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1073,10 +1073,10 @@ struct sched_avg { * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024. */
u32 runnable_avg_sum, runnable_avg_period;
u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;
Seeing how we use runnable_avg_period for both runnable and running, does it make sense to remove the runnable part of it from the name?
It's right
Also, 4 byte hole here, not sure we've got anything useful to stuff in it though.
I can move all u32 declaration at the end of the struct unless it has been put before any u64 for good reason
u64 last_runnable_update; s64 decay_count;
unsigned long load_avg_contrib;
unsigned long load_avg_contrib, usage_avg_contrib;
};
Man, I should go look at Yuyang's rewrite of this all again. I just tried to figure out the decay stuff and my head hurts ;-)
On 11 September 2014 13:17, Peter Zijlstra peterz@infradead.org wrote:
};
Man, I should go look at Yuyang's rewrite of this all again. I just tried to figure out the decay stuff and my head hurts ;-)
Regarding Yuyang's rewrite, i had a patch above his patches to add the running figure that i haven't sent yet
On Tue, Aug 26, 2014 at 12:06:52PM +0100, Vincent Guittot wrote:
Add new statistics which reflect the average time a task is running on the CPU and the sum of the tasks' running on a runqueue. The latter is named usage_avg_contrib.
This patch is based on the usage metric that was proposed in the 1st versions of the per-entity load tracking patchset but that has be removed afterward. This version differs from the original one in the sense that it's not linked to task_group.
The rq's usage_avg_contrib will be used to check if a rq is overloaded or not instead of trying to compute how many task a group of CPUs can handle
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
I should have read this patch before I did almost the same patch for as part a series to introduce scale-invariance which I am about to post :(
The only difference I see is slightly different naming and that, AFAICT, task group usage is not accounted for in this patch. Can we add the support for task groups as well? I can provide a patch based on this one if you want.
Also, since more than half this patch comes directly from PJT's original patch I would add "by Paul Turner pjt@google.com" somewhere in the text above.
[...]
+static inline void __update_task_entity_usage(struct sched_entity *se) +{
- u32 contrib;
- /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
- contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
- contrib /= (se->avg.runnable_avg_period + 1);
- se->avg.usage_avg_contrib = scale_load(contrib);
+}
+static long __update_entity_usage_avg_contrib(struct sched_entity *se) +{
- long old_contrib = se->avg.usage_avg_contrib;
- if (entity_is_task(se))
__update_task_entity_usage(se);
Groups are never updated?
As said above. I have code that does it.
Morten
- return se->avg.usage_avg_contrib - old_contrib;
+}
On 15 September 2014 21:15, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Aug 26, 2014 at 12:06:52PM +0100, Vincent Guittot wrote:
Add new statistics which reflect the average time a task is running on the CPU and the sum of the tasks' running on a runqueue. The latter is named usage_avg_contrib.
This patch is based on the usage metric that was proposed in the 1st versions of the per-entity load tracking patchset but that has be removed afterward. This version differs from the original one in the sense that it's not linked to task_group.
The rq's usage_avg_contrib will be used to check if a rq is overloaded or not instead of trying to compute how many task a group of CPUs can handle
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
I should have read this patch before I did almost the same patch for as part a series to introduce scale-invariance which I am about to post :(
The only difference I see is slightly different naming and that, AFAICT, task group usage is not accounted for in this patch. Can we add the support for task groups as well? I can provide a patch based on this one if you want.
Also, since more than half this patch comes directly from PJT's original patch I would add "by Paul Turner pjt@google.com" somewhere in the text above.
Sorry, It was obvious for me that per-entity load tracking patchset were done by pjt but i can surely replace "the per-entity load tracking patchset" by "the per-entity load tracking patchset by Paul Turner pjt@google.com"
[...]
+static inline void __update_task_entity_usage(struct sched_entity *se) +{
u32 contrib;
/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
contrib /= (se->avg.runnable_avg_period + 1);
se->avg.usage_avg_contrib = scale_load(contrib);
+}
+static long __update_entity_usage_avg_contrib(struct sched_entity *se) +{
long old_contrib = se->avg.usage_avg_contrib;
if (entity_is_task(se))
__update_task_entity_usage(se);
Groups are never updated?
As said above. I have code that does it.
Yes, feel free to send a patch above this one that add group
Vincent
Morten
return se->avg.usage_avg_contrib - old_contrib;
+}
Monitor the utilization level of each group of each sched_domain level. The utilization is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the usage_load_avg to evaluate this utilization level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_capacity in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1fd2131..2f95d1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; }
+static unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_utilization(int cpu) +{ + unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg; + unsigned long capacity = capacity_of(cpu); + + if (usage >= SCHED_LOAD_SCALE) + return capacity + 1; + + return (usage * capacity) >> SCHED_LOAD_SHIFT; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5657,6 +5673,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_utilization; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus; @@ -6011,6 +6028,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; + sgs->group_utilization += get_cpu_utilization(i); sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1) @@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; - sg = sg->next; } while (sg != env->sd->groups);
On Tue, Aug 26, 2014 at 01:06:53PM +0200, Vincent Guittot wrote:
Monitor the utilization level of each group of each sched_domain level. The utilization is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the usage_load_avg to evaluate this utilization level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_capacity in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1fd2131..2f95d1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; } +static unsigned long capacity_orig_of(int cpu) +{
- return cpu_rq(cpu)->cpu_capacity_orig;
+}
static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
This hunk should probably go into patch 6.
@@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; } +static int get_cpu_utilization(int cpu) +{
- unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
- unsigned long capacity = capacity_of(cpu);
- if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
So if I understood patch 9 correct, your changelog is iffy. usage_load_avg should never get > 1 (of whatever unit), no matter how many tasks are on the rq. You can only maximally run all the time.
Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as numerical error handling, nothing more.
Also I'm not entirely sure I like the usage, utilization names/metrics. I would suggest to reverse them. Call the pure running number 'utilization' and this scaled with capacity 'usage' or so.
@@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
- sg = sg->next; } while (sg != env->sd->groups);
I like that extra line of whitespace, it separates the body from the loop itself.
On 11 September 2014 14:34, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:53PM +0200, Vincent Guittot wrote:
Monitor the utilization level of each group of each sched_domain level. The utilization is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the usage_load_avg to evaluate this utilization level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_capacity in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1fd2131..2f95d1c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; }
+static unsigned long capacity_orig_of(int cpu) +{
return cpu_rq(cpu)->cpu_capacity_orig;
+}
static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
This hunk should probably go into patch 6.
@@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_utilization(int cpu) +{
unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
unsigned long capacity = capacity_of(cpu);
if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
So if I understood patch 9 correct, your changelog is iffy. usage_load_avg should never get > 1 (of whatever unit), no matter how many tasks are on the rq. You can only maximally run all the time.
Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as numerical error handling, nothing more.
yes
Also I'm not entirely sure I like the usage, utilization names/metrics. I would suggest to reverse them. Call the pure running number 'utilization' and this scaled with capacity 'usage' or so.
ok. i can invert 'usage' and 'utilization', which will give
s/get_cpu_utilization/get_cpu_usage/ s/sgs->group_utilization/sgs->group_usage/ s/cfs.usage_load_avg/cfs.utilization_load_avg/ s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib s/__update_task_entity_usage/__update_task_entity_utilization s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
@@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity;
sg = sg->next; } while (sg != env->sd->groups);
I like that extra line of whitespace, it separates the body from the loop itself.
Sorry, i don't know why i remove it
On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
Also I'm not entirely sure I like the usage, utilization names/metrics. I would suggest to reverse them. Call the pure running number 'utilization' and this scaled with capacity 'usage' or so.
ok. i can invert 'usage' and 'utilization', which will give
s/get_cpu_utilization/get_cpu_usage/ s/sgs->group_utilization/sgs->group_usage/ s/cfs.usage_load_avg/cfs.utilization_load_avg/ s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib s/__update_task_entity_usage/__update_task_entity_utilization s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
Any other opinions before Vince goes and applies sed on patches? ;-)
On Thu, 11 Sep 2014, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
Also I'm not entirely sure I like the usage, utilization names/metrics. I would suggest to reverse them. Call the pure running number 'utilization' and this scaled with capacity 'usage' or so.
ok. i can invert 'usage' and 'utilization', which will give
s/get_cpu_utilization/get_cpu_usage/ s/sgs->group_utilization/sgs->group_usage/ s/cfs.usage_load_avg/cfs.utilization_load_avg/ s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib s/__update_task_entity_usage/__update_task_entity_utilization s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
Any other opinions before Vince goes and applies sed on patches? ;-)
I don't mind either way, but for sure someone (possibly me) is going to confuse the two soon enough.
Please include in the code some formal definition in the context of the scheduler. A comment block right before the corresponding get_cpu_* accessors should be good enough.
Nicolas
On 11 September 2014 21:17, Nicolas Pitre nicolas.pitre@linaro.org wrote:
Any other opinions before Vince goes and applies sed on patches? ;-)
I don't mind either way, but for sure someone (possibly me) is going to confuse the two soon enough.
Please include in the code some formal definition in the context of the scheduler. A comment block right before the corresponding get_cpu_* accessors should be good enough.
ok
Nicolas
On Thu, Sep 11, 2014 at 03:04:44PM +0100, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
Also I'm not entirely sure I like the usage, utilization names/metrics. I would suggest to reverse them. Call the pure running number 'utilization' and this scaled with capacity 'usage' or so.
ok. i can invert 'usage' and 'utilization', which will give
s/get_cpu_utilization/get_cpu_usage/ s/sgs->group_utilization/sgs->group_usage/
The confusion will have new dimensions added when we introduce scale-invariance too. Then the running number is already scaled by the current P-state compute capacity. But I don't have any better suggestions.
s/cfs.usage_load_avg/cfs.utilization_load_avg/
I don't like using "load" for unweighted metrics. I associate load with something that may be weighted by priority like load_avg_contrib, and utilization with pure cpu utilization as in how many cycles is spend on a particular task. I called it "usage_util_avg" in my own patches, but "util_avg" might be better if we agree that utilization == usage.
s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
util_avg_contrib maybe to keep it shorter.
s/__update_task_entity_usage/__update_task_entity_utilization s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
Maybe use "util" here as well?
Morten
On 15 September 2014 12:45, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Thu, Sep 11, 2014 at 03:04:44PM +0100, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
Also I'm not entirely sure I like the usage, utilization names/metrics. I would suggest to reverse them. Call the pure running number 'utilization' and this scaled with capacity 'usage' or so.
ok. i can invert 'usage' and 'utilization', which will give
s/get_cpu_utilization/get_cpu_usage/ s/sgs->group_utilization/sgs->group_usage/
The confusion will have new dimensions added when we introduce scale-invariance too. Then the running number is already scaled by the current P-state compute capacity. But I don't have any better suggestions.
s/cfs.usage_load_avg/cfs.utilization_load_avg/
I don't like using "load" for unweighted metrics. I associate load with something that may be weighted by priority like load_avg_contrib, and utilization with pure cpu utilization as in how many cycles is spend on a particular task. I called it "usage_util_avg" in my own patches, but "util_avg" might be better if we agree that utilization == usage.
ok. so i don't have the same definition than you. IMHO, load should be used for figures that have used the average of the geometric series used in the per entity load tracking more than the fact that we weight the figures with priority
s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
util_avg_contrib maybe to keep it shorter.
s/__update_task_entity_usage/__update_task_entity_utilization s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
Maybe use "util" here as well?
I agree that utilization can be a bit too long but util sounds a bit too short ans we loose the utilization meaning. so we could use activity instead of utilization
Nevertheless, the most important is that we find a common definition convention
Would the following proposal be ok ?
s/get_cpu_utilization/get_cpu_usage/ s/sgs->group_utilization/sgs->group_usage/ s/cfs.usage_load_avg/cfs.activity_load_avg/ s/se->avg.usage_avg_contrib/se->avg.activity_avg_contrib s/__update_task_entity_usage/__update_task_entity_activity s/__update_entity_usage_avg_contrib/__update_entity_activity_avg_contrib
Vincent
Morten
On Thu, Sep 11, 2014 at 01:34:12PM +0100, Peter Zijlstra wrote:
@@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; } +static int get_cpu_utilization(int cpu) +{
- unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
- unsigned long capacity = capacity_of(cpu);
- if (usage >= SCHED_LOAD_SCALE)
return capacity + 1;
- return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
So if I understood patch 9 correct, your changelog is iffy. usage_load_avg should never get > 1 (of whatever unit), no matter how many tasks are on the rq. You can only maximally run all the time.
Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as numerical error handling, nothing more.
That is not entirely true unless you also classify transient usage spikes due to task migrations as numerical errors as well.
Since each task sched_entity is carrying around 350ms worth of execution history with it between different cpus and cpu utilization is based on the sum of task entity usage_avg_contrib on the runqueue you may get cfs.usage_load_avg > 1 temporarily after task migrations. It will eventually converge to 1.
The same goes for new tasks which are initialized to have a usage_avg_contrib of 1 and may be queued on cpu with tasks already running. In that case cfs.usage_load_avg is temporarily unbounded.
Also I'm not entirely sure I like the usage, utilization names/metrics. I would suggest to reverse them. Call the pure running number 'utilization' and this scaled with capacity 'usage' or so.
I can agree with calling running for utilization, but I'm not convienced about capacity. What does it exactly cover here? I'm confused and jetlagged.
Morten
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_CAPACITY_SCALE. We can now have a better idea of the capacity of a group of CPUs and of the utilization of this group thanks to the rework of group_capacity_orig and the group_utilization. We can now deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 121 ++++++++++++++++++++++------------------------------ 1 file changed, 51 insertions(+), 70 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2f95d1c..80bd64e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5673,13 +5673,13 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_capacity_orig; unsigned long group_utilization; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_has_free_capacity; + int group_out_of_capacity; #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5901,31 +5901,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu) }
/* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). - */ -static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) -{ - /* - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUCAPACITY)) - return 0; - - /* - * If ~90% of the cpu_capacity is still there, we're good. - */ - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29) - return 1; - - return 0; -} - -/* * Group imbalance indicates (and tries to solve) the problem where balancing * groups is inadequate due to tsk_cpus_allowed() constraints. * @@ -5959,38 +5934,37 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgc->imbalance; }
-/* - * Compute the group capacity factor. - * - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit unit capacity with that. - */ -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline int group_has_free_capacity(struct sg_lb_stats *sgs, + struct lb_env *env) { - unsigned int capacity_factor, smt, cpus; - unsigned int capacity, capacity_orig; + if ((sgs->group_capacity_orig * 100) > + (sgs->group_utilization * env->sd->imbalance_pct)) + return 1; + + if (sgs->sum_nr_running < sgs->group_weight) + return 1;
- capacity = group->sgc->capacity; - capacity_orig = group->sgc->capacity_orig; - cpus = group->group_weight; + return 0; +}
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); - capacity_factor = cpus / smt; /* cores */ +static inline int group_is_overloaded(struct sg_lb_stats *sgs, + struct lb_env *env) +{ + if (sgs->sum_nr_running <= sgs->group_weight) + return 0;
- capacity_factor = min_t(unsigned, - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); + if ((sgs->group_capacity_orig * 100) < + (sgs->group_utilization * env->sd->imbalance_pct)) + return 1;
- return capacity_factor; + return 0; }
static enum group_type -group_classify(struct sched_group *group, struct sg_lb_stats *sgs) +group_classify(struct sched_group *group, struct sg_lb_stats *sgs, + struct lb_env *env) { - if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (group_is_overloaded(sgs, env)) return group_overloaded;
if (sg_imbalanced(group)) @@ -6043,6 +6017,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->idle_cpus++; }
+ sgs->group_capacity_orig = group->sgc->capacity_orig; /* Adjust by relative CPU capacity of the group */ sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; @@ -6051,11 +6026,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight; - sgs->group_capacity_factor = sg_capacity_factor(env, group); - sgs->group_type = group_classify(group, sgs);
- if (sgs->group_capacity_factor > sgs->sum_nr_running) - sgs->group_has_free_capacity = 1; + sgs->group_type = group_classify(group, sgs, env); + + sgs->group_out_of_capacity = group_is_overloaded(sgs, env); }
/** @@ -6185,17 +6159,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity factor to one so that we'll try + * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity_factor. The + * these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + group_has_free_capacity(&sds->local_stat, env)) { + if (sgs->sum_nr_running > 1) + sgs->group_out_of_capacity = 1; + sgs->group_capacity = min(sgs->group_capacity, + SCHED_CAPACITY_SCALE); + }
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6373,11 +6351,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->group_type == group_overloaded && local->group_type == group_overloaded) { - load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity_factor); - - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); - load_above_capacity /= busiest->group_capacity; + load_above_capacity = busiest->sum_nr_running * + SCHED_LOAD_SCALE; + if (load_above_capacity > busiest->group_capacity) + load_above_capacity -= busiest->group_capacity; + else + load_above_capacity = ~0UL; }
/* @@ -6440,6 +6419,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6460,8 +6440,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && - !busiest->group_has_free_capacity) + if (env->idle == CPU_NEWLY_IDLE && + group_has_free_capacity(local, env) && + busiest->group_out_of_capacity) goto force_balance;
/* @@ -6519,7 +6500,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long capacity, capacity_factor, wl; + unsigned long capacity, wl; enum fbq_type rt;
rq = cpu_rq(i); @@ -6548,9 +6529,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
capacity = capacity_of(i); - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6558,7 +6536,10 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) + + if (rq->nr_running == 1 && wl > env->imbalance && + ((capacity * env->sd->imbalance_pct) >= + (rq->cpu_capacity_orig * 100))) continue;
/*
On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_CAPACITY_SCALE. We can now have a better idea of the capacity of a group of CPUs and of the utilization of this group thanks to the rework of group_capacity_orig and the group_utilization. We can now deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
A few minor changes I did while going through it.
--- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5669,7 +5669,7 @@ struct sg_lb_stats { unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_out_of_capacity; + int group_no_capacity; #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5931,37 +5931,37 @@ static inline int sg_imbalanced(struct s return group->sgc->imbalance; }
-static inline int group_has_free_capacity(struct sg_lb_stats *sgs, - struct lb_env *env) +static inline bool +group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) { if ((sgs->group_capacity_orig * 100) > (sgs->group_utilization * env->sd->imbalance_pct)) - return 1; + return true;
if (sgs->sum_nr_running < sgs->group_weight) - return 1; + return true;
- return 0; + return false; }
-static inline int group_is_overloaded(struct sg_lb_stats *sgs, - struct lb_env *env) +static inline bool +group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) { if (sgs->sum_nr_running <= sgs->group_weight) - return 0; + return false;
if ((sgs->group_capacity_orig * 100) < (sgs->group_utilization * env->sd->imbalance_pct)) - return 1; + return true;
- return 0; + return false; }
-static enum group_type -group_classify(struct sched_group *group, struct sg_lb_stats *sgs, - struct lb_env *env) +static enum group_type group_classify(struct lb_env *env, + struct sched_group *group, + struct sg_lb_stats *sgs) { - if (group_is_overloaded(sgs, env)) + if (group_is_overloaded(env, sgs)) return group_overloaded;
if (sg_imbalanced(group)) @@ -6024,9 +6024,8 @@ static inline void update_sg_lb_stats(st
sgs->group_weight = group->group_weight;
- sgs->group_type = group_classify(group, sgs, env); - - sgs->group_out_of_capacity = group_is_overloaded(sgs, env); + sgs->group_type = group_classify(env, group, sgs); + sgs->group_no_capacity = group_is_overloaded(env, sgs); }
/** @@ -6157,9 +6156,9 @@ static inline void update_sd_lb_stats(st * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - group_has_free_capacity(&sds->local_stat, env)) { + group_has_capacity(env, &sds->local_stat)) { if (sgs->sum_nr_running > 1) - sgs->group_out_of_capacity = 1; + sgs->group_no_capacity = 1; sgs->group_capacity = min(sgs->group_capacity, SCHED_CAPACITY_SCALE); } @@ -6430,9 +6429,8 @@ static struct sched_group *find_busiest_ goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && - group_has_free_capacity(local, env) && - busiest->group_out_of_capacity) + if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && + busiest->group_no_capacity) goto force_balance;
/*
On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
struct lb_env *env)
{
- if ((sgs->group_capacity_orig * 100) >
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
- if (sgs->sum_nr_running < sgs->group_weight)
return 1;
- return 0;
+} +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
struct lb_env *env)
+{
- if (sgs->sum_nr_running <= sgs->group_weight)
return 0;
- if ((sgs->group_capacity_orig * 100) <
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
- return 0;
}
I'm confused about the utilization vs capacity_orig. I see how we should maybe scale things with the capacity when comparing between CPUs/groups, but not on the same CPU/group.
I would have expected something simple like:
static inline bool group_has_capacity() { /* Is there a spare cycle? */ if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE) return true;
/* Are there less tasks than logical CPUs? */ if (sgs->sum_nr_running < sgs->group_weight) return true;
return false; }
Where group_utilization a pure sum of running_avg.
Now this has a hole when there are RT tasks on the system, in that case the utilization will never hit 1, but we could fix that another way. I don't think the capacity_orig thing is right.
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
struct lb_env *env)
{
if ((sgs->group_capacity_orig * 100) >
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
if (sgs->sum_nr_running < sgs->group_weight)
return 1;
return 0;
+}
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
struct lb_env *env)
+{
if (sgs->sum_nr_running <= sgs->group_weight)
return 0;
if ((sgs->group_capacity_orig * 100) <
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
return 0;
}
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
maybe scale things with the capacity when comparing between CPUs/groups, but not on the same CPU/group.
I would have expected something simple like:
static inline bool group_has_capacity() { /* Is there a spare cycle? */ if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE) return true;
/* Are there less tasks than logical CPUs? */ if (sgs->sum_nr_running < sgs->group_weight) return true; return false;
}
Where group_utilization a pure sum of running_avg.
Now this has a hole when there are RT tasks on the system, in that case the utilization will never hit 1, but we could fix that another way. I don't think the capacity_orig thing is right.
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
struct lb_env *env)
{
if ((sgs->group_capacity_orig * 100) >
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
if (sgs->sum_nr_running < sgs->group_weight)
return 1;
return 0;
+}
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
struct lb_env *env)
+{
if (sgs->sum_nr_running <= sgs->group_weight)
return 0;
if ((sgs->group_capacity_orig * 100) <
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
return 0;
}
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
Yes, but.. so I suppose we need to add DVFS accounting and remove cpufreq from the capacity thing. Otherwise I don't see it make sense.
On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
struct lb_env *env)
{
if ((sgs->group_capacity_orig * 100) >
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
if (sgs->sum_nr_running < sgs->group_weight)
return 1;
return 0;
+}
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
struct lb_env *env)
+{
if (sgs->sum_nr_running <= sgs->group_weight)
return 0;
if ((sgs->group_capacity_orig * 100) <
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
return 0;
}
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
Yes, but.. so I suppose we need to add DVFS accounting and remove cpufreq from the capacity thing. Otherwise I don't see it make sense.
That is to say, please also explain these details in the changelog, and preferably also in a XXX/FIXME/TODO comment near wherever so we don't forget about it.
On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
Yes, but.. so I suppose we need to add DVFS accounting and remove cpufreq from the capacity thing. Otherwise I don't see it make sense.
OK, I've reconsidered _again_, I still don't get it.
So fundamentally I think its wrong to scale with the capacity; it just doesn't make any sense. Consider big.little stuff, their CPUs are inherently asymmetric in capacity, but that doesn't matter one whit for utilization numbers. If a core is fully consumed its fully consumed, no matter how much work it can or can not do.
So the only thing that needs correcting is the fact that these statistics are based on clock_task and some of that time can end up in other scheduling classes, at which point we'll never get 100% even though we're 'saturated'. But correcting for that using capacity doesn't 'work'.
On Mon, 15 Sep 2014, Peter Zijlstra wrote:
On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
Yes, but.. so I suppose we need to add DVFS accounting and remove cpufreq from the capacity thing. Otherwise I don't see it make sense.
OK, I've reconsidered _again_, I still don't get it.
So fundamentally I think its wrong to scale with the capacity; it just doesn't make any sense. Consider big.little stuff, their CPUs are inherently asymmetric in capacity, but that doesn't matter one whit for utilization numbers. If a core is fully consumed its fully consumed, no matter how much work it can or can not do.
Let's suppose a task running on a 1GHz CPU producing a load of 100.
The same task on a 100MHz CPU would produce a load of 1000 because that CPU is 10x slower. So to properly evaluate the load of a task when moving it around, we want to normalize its load based on the CPU performance. In this case the correction factor would be 0.1.
Given those normalized loads, we need to scale CPU capacity as well. If the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
In theory the 100MHz CPU could handle only 5 of those tasks, meaning it has a normalized capacity of 500, but only if the load metric is already normalized as well.
Or am I completely missing the point here?
Nicolas
So the only thing that needs correcting is the fact that these statistics are based on clock_task and some of that time can end up in other scheduling classes, at which point we'll never get 100% even though we're 'saturated'. But correcting for that using capacity doesn't 'work'.
On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
On Mon, 15 Sep 2014, Peter Zijlstra wrote:
Let's suppose a task running on a 1GHz CPU producing a load of 100.
The same task on a 100MHz CPU would produce a load of 1000 because that CPU is 10x slower. So to properly evaluate the load of a task when moving it around, we want to normalize its load based on the CPU performance. In this case the correction factor would be 0.1.
Given those normalized loads, we need to scale CPU capacity as well. If the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
In theory the 100MHz CPU could handle only 5 of those tasks, meaning it has a normalized capacity of 500, but only if the load metric is already normalized as well.
Or am I completely missing the point here?
So I was thinking of the usage as per the next patch; where we decide if a cpu is 'full' or not based on the utilization measure. For this measure we're not interested in inter CPU relations at all, and any use of capacity scaling simply doesn't make sense.
But I think you asking this question shows a 'bigger' problem in that the Changelogs are entirely failing at describing the actual problem and proposed solution. Because if that were clear, I don't think we would be having this particular discussion.
On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
On Mon, 15 Sep 2014, Peter Zijlstra wrote:
Let's suppose a task running on a 1GHz CPU producing a load of 100.
The same task on a 100MHz CPU would produce a load of 1000 because that CPU is 10x slower. So to properly evaluate the load of a task when moving it around, we want to normalize its load based on the CPU performance. In this case the correction factor would be 0.1.
Given those normalized loads, we need to scale CPU capacity as well. If the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
In theory the 100MHz CPU could handle only 5 of those tasks, meaning it has a normalized capacity of 500, but only if the load metric is already normalized as well.
Or am I completely missing the point here?
So I was thinking of the usage as per the next patch; where we decide if a cpu is 'full' or not based on the utilization measure. For this measure we're not interested in inter CPU relations at all, and any use of capacity scaling simply doesn't make sense.
Right. You don't need to scale capacity to determine whether a cpu is full or not if you don't have DVFS, but I don't think it hurts if it is done right. We need the scaling to figure out how much capacity is available.
But I think you asking this question shows a 'bigger' problem in that the Changelogs are entirely failing at describing the actual problem and proposed solution. Because if that were clear, I don't think we would be having this particular discussion.
Yes, the bigger problem of scaling things with DVFS and taking big.LITTLE into account is not addressed in this patch set. This is the scale-invariance problem that we discussed at Ksummit.
I will send a few patches shortly that provides the usage scale-invariance. The last bit missing is then to scale capacity too (in the right places), such that we can determine whether as cpu is full or not, and how much spare capacity it has by comparing scale usage (utilization) with scaled capacity. Like in Nico's example above.
Morten
On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
On Mon, 15 Sep 2014, Peter Zijlstra wrote:
Let's suppose a task running on a 1GHz CPU producing a load of 100.
The same task on a 100MHz CPU would produce a load of 1000 because that CPU is 10x slower. So to properly evaluate the load of a task when moving it around, we want to normalize its load based on the CPU performance. In this case the correction factor would be 0.1.
Given those normalized loads, we need to scale CPU capacity as well. If the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
In theory the 100MHz CPU could handle only 5 of those tasks, meaning it has a normalized capacity of 500, but only if the load metric is already normalized as well.
Or am I completely missing the point here?
So I was thinking of the usage as per the next patch; where we decide if a cpu is 'full' or not based on the utilization measure. For this measure we're not interested in inter CPU relations at all, and any use of capacity scaling simply doesn't make sense.
Right. You don't need to scale capacity to determine whether a cpu is full or not if you don't have DVFS, but I don't think it hurts if it is done right. We need the scaling to figure out how much capacity is available.
But I think you asking this question shows a 'bigger' problem in that the Changelogs are entirely failing at describing the actual problem and proposed solution. Because if that were clear, I don't think we would be having this particular discussion.
Yes, the bigger problem of scaling things with DVFS and taking big.LITTLE into account is not addressed in this patch set. This is the scale-invariance problem that we discussed at Ksummit.
big.LITTLE is factored in this patch set, but DVFS is not. My bad.
Morten
On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
Right. You don't need to scale capacity to determine whether a cpu is full or not if you don't have DVFS, but I don't think it hurts if it is done right. We need the scaling to figure out how much capacity is available.
Maybe I'm particularly dense, but I'm not seeing how the proposed thing is 'right' in any way.
There are no words on what specifically is addressed and how it is achieved.
On 15 September 2014 13:42, Peter Zijlstra peterz@infradead.org wrote:
On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
Yes, but.. so I suppose we need to add DVFS accounting and remove cpufreq from the capacity thing. Otherwise I don't see it make sense.
OK, I've reconsidered _again_, I still don't get it.
So fundamentally I think its wrong to scale with the capacity; it just doesn't make any sense. Consider big.little stuff, their CPUs are inherently asymmetric in capacity, but that doesn't matter one whit for utilization numbers. If a core is fully consumed its fully consumed, no matter how much work it can or can not do.
So the only thing that needs correcting is the fact that these statistics are based on clock_task and some of that time can end up in other scheduling classes, at which point we'll never get 100% even though we're 'saturated'. But correcting for that using capacity doesn't 'work'.
I'm not sure to catch your last point because the capacity is the only figures that take into account the "time" consumed by other classes. Have you got in mind another way to take into account the other classes ?
So we have cpu_capacity that is the capacity that can be currently used by cfs class We have cfs.usage_load_avg that is the sum of running time of cfs tasks on the CPU and reflect the % of usage of this CPU by CFS tasks We have to use the same metrics to compare available capacity for CFS and current cfs usage
Now we have to use the same unit so we can either weight the cpu_capacity_orig with the cfs.usage_load_avg and compare it with cpu_capacity or with divide cpu_capacity by cpu_capacity_orig and scale it into the SCHED_LOAD_SCALE range. Is It what you are proposing ?
Vincent
On 16 September 2014 00:14, Vincent Guittot vincent.guittot@linaro.org wrote:
On 15 September 2014 13:42, Peter Zijlstra peterz@infradead.org wrote:
On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
Yes, but.. so I suppose we need to add DVFS accounting and remove cpufreq from the capacity thing. Otherwise I don't see it make sense.
OK, I've reconsidered _again_, I still don't get it.
So fundamentally I think its wrong to scale with the capacity; it just doesn't make any sense. Consider big.little stuff, their CPUs are inherently asymmetric in capacity, but that doesn't matter one whit for utilization numbers. If a core is fully consumed its fully consumed, no matter how much work it can or can not do.
So the only thing that needs correcting is the fact that these statistics are based on clock_task and some of that time can end up in other scheduling classes, at which point we'll never get 100% even though we're 'saturated'. But correcting for that using capacity doesn't 'work'.
I'm not sure to catch your last point because the capacity is the only figures that take into account the "time" consumed by other classes. Have you got in mind another way to take into account the other classes ?
So we have cpu_capacity that is the capacity that can be currently used by cfs class We have cfs.usage_load_avg that is the sum of running time of cfs tasks on the CPU and reflect the % of usage of this CPU by CFS tasks We have to use the same metrics to compare available capacity for CFS and current cfs usage
Now we have to use the same unit so we can either weight the cpu_capacity_orig with the cfs.usage_load_avg and compare it with cpu_capacity or with divide cpu_capacity by cpu_capacity_orig and scale it into the SCHED_LOAD_SCALE range. Is It what you are proposing ?
For the latter, we need to keep the sgs->group_capacity_orig in order to check if a group is overloaded whereas the 1st solution don't need it anymore (once the correction i mentioned previously)
Vincent
Vincent
On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
On 15 September 2014 13:42, Peter Zijlstra peterz@infradead.org wrote:
OK, I've reconsidered _again_, I still don't get it.
So fundamentally I think its wrong to scale with the capacity; it just doesn't make any sense. Consider big.little stuff, their CPUs are inherently asymmetric in capacity, but that doesn't matter one whit for utilization numbers. If a core is fully consumed its fully consumed, no matter how much work it can or can not do.
So the only thing that needs correcting is the fact that these statistics are based on clock_task and some of that time can end up in other scheduling classes, at which point we'll never get 100% even though we're 'saturated'. But correcting for that using capacity doesn't 'work'.
I'm not sure to catch your last point because the capacity is the only figures that take into account the "time" consumed by other classes. Have you got in mind another way to take into account the other classes ?
So that was the entire point of stuffing capacity in? Note that that point was not at all clear.
This is very much like 'all we have is a hammer, and therefore everything is a nail'. The rt fraction is a 'small' part of what the capacity is.
So we have cpu_capacity that is the capacity that can be currently used by cfs class We have cfs.usage_load_avg that is the sum of running time of cfs tasks on the CPU and reflect the % of usage of this CPU by CFS tasks We have to use the same metrics to compare available capacity for CFS and current cfs usage
-ENOPARSE
Now we have to use the same unit so we can either weight the cpu_capacity_orig with the cfs.usage_load_avg and compare it with cpu_capacity or with divide cpu_capacity by cpu_capacity_orig and scale it into the SCHED_LOAD_SCALE range. Is It what you are proposing ?
I'm so not getting it; orig vs capacity still includes arch_scale_freq_capacity(), so that is not enough to isolate the rt fraction.
On 17 September 2014 15:25, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
On 15 September 2014 13:42, Peter Zijlstra peterz@infradead.org wrote:
OK, I've reconsidered _again_, I still don't get it.
So fundamentally I think its wrong to scale with the capacity; it just doesn't make any sense. Consider big.little stuff, their CPUs are inherently asymmetric in capacity, but that doesn't matter one whit for utilization numbers. If a core is fully consumed its fully consumed, no matter how much work it can or can not do.
So the only thing that needs correcting is the fact that these statistics are based on clock_task and some of that time can end up in other scheduling classes, at which point we'll never get 100% even though we're 'saturated'. But correcting for that using capacity doesn't 'work'.
I'm not sure to catch your last point because the capacity is the only figures that take into account the "time" consumed by other classes. Have you got in mind another way to take into account the other classes ?
So that was the entire point of stuffing capacity in? Note that that point was not at all clear.
This is very much like 'all we have is a hammer, and therefore everything is a nail'. The rt fraction is a 'small' part of what the capacity is.
So we have cpu_capacity that is the capacity that can be currently used by cfs class We have cfs.usage_load_avg that is the sum of running time of cfs tasks on the CPU and reflect the % of usage of this CPU by CFS tasks We have to use the same metrics to compare available capacity for CFS and current cfs usage
-ENOPARSE
Now we have to use the same unit so we can either weight the cpu_capacity_orig with the cfs.usage_load_avg and compare it with cpu_capacity or with divide cpu_capacity by cpu_capacity_orig and scale it into the SCHED_LOAD_SCALE range. Is It what you are proposing ?
I'm so not getting it; orig vs capacity still includes arch_scale_freq_capacity(), so that is not enough to isolate the rt fraction.
This patch does not try to solve any scale invariance issue. This patch removes capacity_factor because it rarely works correctly. capacity_factor tries to compute how many tasks a group of CPUs can handle at the time we are doing the load balance. The capacity_factor is hardly working for SMT system: it sometimes works for big cores and but fails to do the right thing for little cores.
Below are two examples to illustrate the problem that this patch solves:
capacity_factor makes the assumption that max capacity of a CPU is SCHED_CAPACITY_SCALE and the load of a thread is always is SCHED_LOAD_SCALE. It compares the output of these figures with the sum of nr_running to decide if a group is overloaded or not.
But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be seen as overloaded if we have only one task per CPU.
Then, if the default capacity of a CPU is greater than SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system that prevent the apparition of ghost CPUs) but if one CPU is fully used by a rt task (and its capacity is reduced to nearly nothing), the capacity factor of the group will still be 4 (div_round_closest(3*1512/1024) = 5).
So, this patch tries to solve this issue by removing capacity_factor and replacing it with the 2 following metrics : -the available CPU capacity for CFS tasks which is the one currently used by load_balance -the capacity that are effectively used by CFS tasks on the CPU. For that, i have re-introduced the usage_avg_contrib which is in the range [0..SCHED_CPU_LOAD] whatever the capacity of the CPU on which the task is running, is. This usage_avg_contrib doesn't solve the scaling in-variance problem, so i have to scale the usage with original capacity in get_cpu_utilization (that will become get_cpu_usage in the next version) in order to compare it with available capacity.
Once the scaling invariance will have been added in usage_avg_contrib, we can remove the scale by cpu_capacity_orig in get_cpu_utilization. But the scaling invariance will come in another patchset.
Hope that this explanation makes the goal of this patchset clearer. And I can add this explanation in the commit log if you found it clear enough
Vincent
On 14/09/14 12:41, Peter Zijlstra wrote:
On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
On 11 September 2014 18:15, Peter Zijlstra peterz@infradead.org wrote:
On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
struct lb_env *env)
{
if ((sgs->group_capacity_orig * 100) >
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
if (sgs->sum_nr_running < sgs->group_weight)
return 1;
return 0;
+}
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
struct lb_env *env)
+{
if (sgs->sum_nr_running <= sgs->group_weight)
return 0;
if ((sgs->group_capacity_orig * 100) <
(sgs->group_utilization * env->sd->imbalance_pct))
return 1;
return 0;
}
I'm confused about the utilization vs capacity_orig. I see how we should
1st point is that I should compare utilization vs capacity and not capacity_orig. I should have replaced capacity_orig by capacity in the functions above when i move the utilization statistic from rq->avg.runnable_avg_sum to cfs.usage_load_avg. rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas cfs.usage_load_avg integrates only cfs tasks
With this change, we don't need sgs->group_capacity_orig anymore but only sgs->group_capacity. So sgs->group_capacity_orig can be removed as it's no more used in the code as sg_capacity_factor has been removed
Yes, but.. so I suppose we need to add DVFS accounting and remove cpufreq from the capacity thing. Otherwise I don't see it make sense.
My understanding is that uArch scaling of capacacity_orig (therefore of capacity too) is done by calling arch_scale_cpu_capacity and frequency scaling for capacity is done by calling arch_scale_freq_capacity in update_cpu_capacity. I understand that this patch-set does not provide an implementation of arch_scale_freq_capacity though.
The uArch and frequency scaling of load & utilization will be added later. I know that Morten is currently working on a 'uArch and frequency invariant load & utilization tracking' patch-set.
Although I don't know exactly what you mean by DVFS accounting and remove cpufreq from the capacity here.
-- Dietmar
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7c3b237..ab5ab60 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6093,6 +6093,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */
if (sd->flags & SD_SHARE_CPUCAPACITY) { + sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
linaro-kernel@lists.linaro.org