Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_capacity (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_capacity_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_capacity_orig/capacity_orig) and the current capacity (cpu_capacity/capacity) of CPUs and sched_groups. A new function arch_scale_cpu_capacity has been created and replace arch_scale_smt_capacity, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we detect when the group is fully utilized
Now that we have the original capacity of CPUS and their activity/utilization, we can evaluate more accuratly the capacity and the level of utilization of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance.
Tests results: I have put below results of 4 kind of tests: - hackbench -l 500 -s 4096 - perf bench sched pipe -l 400000 - scp of 100MB file on the platform - ebizzy with various number of threads on 4 kernels : - tip = tip/sched/core - step1 = tip + patches(1-8) - patchset = tip + whole patchset - patchset+irq = tip + this patchset + irq accounting
each test has been run 6 times and the figure below show the stdev and the diff compared to the tip kernel
Dual A7 tip | +step1 | +patchset | patchset+irq stdev | results stdev | results stdev | results stdev hackbench (lower is better) (+/-)0.64% | -0.19% (+/-)0.73% | 0.58% (+/-)1.29% | 0.20% (+/-)1.00% perf (lower is better) (+/-)0.28% | 1.22% (+/-)0.17% | 1.29% (+/-)0.06% | 2.85% (+/-)0.33% scp (+/-)4.81% | 2.61% (+/-)0.28% | 2.39% (+/-)0.22% | 82.18% (+/-)3.30% ebizzy -t 1 (+/-)2.31% | -1.32% (+/-)1.90% | -0.79% (+/-)2.88% | 3.10% (+/-)2.32% ebizzy -t 2 (+/-)0.70% | 8.29% (+/-)6.66% | 1.93% (+/-)5.47% | 2.72% (+/-)5.72% ebizzy -t 4 (+/-)3.54% | 5.57% (+/-)8.00% | 0.36% (+/-)9.00% | 2.53% (+/-)3.17% ebizzy -t 6 (+/-)2.36% | -0.43% (+/-)3.29% | -1.93% (+/-)3.47% | 0.57% (+/-)0.75% ebizzy -t 8 (+/-)1.65% | -0.45% (+/-)0.93% | -1.95% (+/-)1.52% | -1.18% (+/-)1.61% ebizzy -t 10 (+/-)2.55% | -0.98% (+/-)3.06% | -1.18% (+/-)6.17% | -2.33% (+/-)3.28% ebizzy -t 12 (+/-)6.22% | 0.17% (+/-)5.63% | 2.98% (+/-)7.11% | 1.19% (+/-)4.68% ebizzy -t 14 (+/-)5.38% | -0.14% (+/-)5.33% | 2.49% (+/-)4.93% | 1.43% (+/-)6.55% Quad A15 tip | +patchset1 | +patchset2 | patchset+irq stdev | results stdev | results stdev | results stdev hackbench (lower is better) (+/-)0.78% | 0.87% (+/-)1.72% | 0.91% (+/-)2.02% | 3.30% (+/-)2.02% perf (lower is better) (+/-)2.03% | -0.31% (+/-)0.76% | -2.38% (+/-)1.37% | 1.42% (+/-)3.14% scp (+/-)0.04% | 0.51% (+/-)1.37% | 1.79% (+/-)0.84% | 1.77% (+/-)0.38% ebizzy -t 1 (+/-)0.41% | 2.05% (+/-)0.38% | 2.08% (+/-)0.24% | 0.17% (+/-)0.62% ebizzy -t 2 (+/-)0.78% | 0.60% (+/-)0.63% | 0.43% (+/-)0.48% | 1.61% (+/-)0.38% ebizzy -t 4 (+/-)0.58% | -0.10% (+/-)0.97% | -0.65% (+/-)0.76% | -0.75% (+/-)0.86% ebizzy -t 6 (+/-)0.31% | 1.07% (+/-)1.12% | -0.16% (+/-)0.87% | -0.76% (+/-)0.22% ebizzy -t 8 (+/-)0.95% | -0.30% (+/-)0.85% | -0.79% (+/-)0.28% | -1.66% (+/-)0.21% ebizzy -t 10 (+/-)0.31% | 0.04% (+/-)0.97% | -1.44% (+/-)1.54% | -0.55% (+/-)0.62% ebizzy -t 12 (+/-)8.35% | -1.89% (+/-)7.64% | 0.75% (+/-)5.30% | -1.18% (+/-)8.16% ebizzy -t 14 (+/-)13.17% | 6.22% (+/-)4.71% | 5.25% (+/-)9.14% | 5.87% (+/-)5.77%
I haven't been able to fully test the patchset for a SMT system to check that the regression that has been reported by Preethi has been solved but the various tests that i have done, don't show any regression so far. The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level should have fix the regression.
The usage_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the usage_avg_contrib that is based on the new implementation [3] but haven't provide the patches and results as [3] is still under review. I can provide change above [3] to change how usage_avg_contrib is computed and adapt to new mecanism.
TODO: manage conflict with the next version of [4]
Change since V3: - add usage_avg_contrib statistic which sums the running time of tasks on a rq - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization - fix replacement power by capacity - update some comments
Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics
Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2014/3/19/377 [3] https://lkml.org/lkml/2014/7/18/110 [4] https://lkml.org/lkml/2014/7/25/589
Vincent Guittot (12): sched: fix imbalance flag reset sched: remove a wake_affine condition sched: fix avg_load computation sched: Allow all archs to set the capacity_orig ARM: topology: use new cpu_capacity interface sched: add per rq cpu_capacity_orig sched: test the cpu's capacity in wake affine sched: move cfs task on a CPU with higher capacity sched: add usage_load_avg sched: get CPU's utilization statistic sched: replace capacity_factor by utilization sched: add SD_PREFER_SIBLING for SMT level
arch/arm/kernel/topology.c | 4 +- include/linux/sched.h | 4 +- kernel/sched/core.c | 3 +- kernel/sched/fair.c | 350 ++++++++++++++++++++++++++------------------- kernel/sched/sched.h | 3 +- 5 files changed, 207 insertions(+), 157 deletions(-)
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 923fe32..7eb9126 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6672,10 +6672,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1; - } else if (*group_imbalance) - *group_imbalance = 0; }
/* All tasks on this runqueue were pinned by CPU affinity */ @@ -6686,7 +6684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; } - goto out_balanced; + goto out_all_pinned; } }
@@ -6760,6 +6758,23 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out;
out_balanced: + /* + * We reach balance although we may have faced some affinity + * constraints. Clear the imbalance flag if it was set. + */ + if (sd_parent) { + int *group_imbalance = &sd_parent->groups->sgc->imbalance; + + if (*group_imbalance) + *group_imbalance = 0; + } + +out_all_pinned: + /* + * We reach balance because all tasks are pinned at this level so + * we can't migrate them. Let the imbalance flag set so parent level + * can try to migrate them. + */ schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
Hi Vincent, On 7/29/14, 1:51 AM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set.
The waiting task is the third task or one the '2 tasks on 1 CPU' ?
Regards, Wanpeng Li
Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 923fe32..7eb9126 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6672,10 +6672,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1;
} else if (*group_imbalance)
}*group_imbalance = 0;
/* All tasks on this runqueue were pinned by CPU affinity */ @@ -6686,7 +6684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; }
goto out_balanced;
} }goto out_all_pinned;
@@ -6760,6 +6758,23 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out; out_balanced:
- /*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
- if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
- }
+out_all_pinned:
- /*
* We reach balance because all tasks are pinned at this level so
* we can't migrate them. Let the imbalance flag set so parent level
* can try to migrate them.
schedstat_inc(sd, lb_balanced[idle]);*/
sd->nr_balance_failed = 0;
On 23 November 2014 at 11:25, Wanpeng Li kernellwp@gmail.com wrote:
Hi Vincent, On 7/29/14, 1:51 AM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set.
The waiting task is the third task or one the '2 tasks on 1 CPU' ?
The waiting task is one of the 2 tasks on 1 CPU (the worker)
Regards, Vincent
Regards, Wanpeng Li
Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 923fe32..7eb9126 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6672,10 +6672,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance
- {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance
*group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0; } /* All tasks on this runqueue were pinned by CPU affinity
*/ @@ -6686,7 +6684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; }
goto out_balanced;
@@ -6760,6 +6758,23 @@ static int load_balance(int this_cpu, struct rqgoto out_all_pinned; } }
*this_rq, goto out; out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
}
+out_all_pinned:
/*
* We reach balance because all tasks are pinned at this level so
* we can't migrate them. Let the imbalance flag set so parent
level
* can try to migrate them.
*/ schedstat_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0;
Hi Vincent, On 7/29/14, 1:51 AM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 923fe32..7eb9126 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6672,10 +6672,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0;
As you mentioned above " We need to reset of the imbalance flag as soon as we have reached a balanced state. " I think the codes before your patch have already do this, where I miss? Great thanks for your patient. ;-)
Regards, Wanpeng Li
}
/* All tasks on this runqueue were pinned by CPU affinity */ @@ -6686,7 +6684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; }
goto out_balanced;
} }goto out_all_pinned;
@@ -6760,6 +6758,23 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out; out_balanced:
- /*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
- if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
- }
+out_all_pinned:
- /*
* We reach balance because all tasks are pinned at this level so
* we can't migrate them. Let the imbalance flag set so parent level
* can try to migrate them.
schedstat_inc(sd, lb_balanced[idle]);*/
sd->nr_balance_failed = 0;
On 25 November 2014 at 00:47, Wanpeng Li kernellwp@gmail.com wrote:
Hi Vincent, On 7/29/14, 1:51 AM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 923fe32..7eb9126 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6672,10 +6672,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance
- {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance
*group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0;
As you mentioned above " We need to reset of the imbalance flag as soon as we have reached a balanced state. " I think the codes before your patch have already do this, where I miss? Great thanks for your patient. ;-)
The previous code was called only when busiest->nr_running > 1. The background activity will be on the rq only 1 tick per few seconds and we will set qroup_imbalance when the background activity is on the rq. Then, during the next load balances, the qroup_imbalance is still set but we can't clear qroup_imbalance because we have only 1 task per rq
Regards, Vincent
Regards, Wanpeng Li
} /* All tasks on this runqueue were pinned by CPU affinity
*/ @@ -6686,7 +6684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; }
goto out_balanced;
@@ -6760,6 +6758,23 @@ static int load_balance(int this_cpu, struct rqgoto out_all_pinned; } }
*this_rq, goto out; out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
}
+out_all_pinned:
/*
* We reach balance because all tasks are pinned at this level so
* we can't migrate them. Let the imbalance flag set so parent
level
* can try to migrate them.
*/ schedstat_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0;
Hi Vincent, On 11/25/14, 5:04 PM, Vincent Guittot wrote:
On 25 November 2014 at 00:47, Wanpeng Li kernellwp@gmail.com wrote:
Hi Vincent, On 7/29/14, 1:51 AM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 923fe32..7eb9126 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6672,10 +6672,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance
- {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance
*group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0;
As you mentioned above " We need to reset of the imbalance flag as soon as we have reached a balanced state. " I think the codes before your patch have already do this, where I miss? Great thanks for your patient. ;-)
The previous code was called only when busiest->nr_running > 1. The background activity will be on the rq only 1 tick per few seconds and we will set qroup_imbalance when the background activity is on the rq. Then, during the next load balances, the qroup_imbalance is still set but we can't clear qroup_imbalance because we have only 1 task per rq
There is no load balance I think since busiest->nr_running > 1 is not true even if the patch is not applied.
Regards, Wanpeng Li
Regards, Vincent
Regards, Wanpeng Li
} /* All tasks on this runqueue were pinned by CPU affinity
*/ @@ -6686,7 +6684,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; }
goto out_balanced;
@@ -6760,6 +6758,23 @@ static int load_balance(int this_cpu, struct rqgoto out_all_pinned; } }
*this_rq, goto out; out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
}
+out_all_pinned:
/*
* We reach balance because all tasks are pinned at this level so
* we can't migrate them. Let the imbalance flag set so parent
level
* can try to migrate them.
*/ schedstat_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0;
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found clear description in the previous commits' log. Futhermore, the comment of the condition refers to the task_hot function that was used before being replaced by the current condition: /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and * there is no bad imbalance. */
If we look more deeply the below condition this_load + target_load(prev_cpu, idx) <= tl_per_task
When sync is clear, we have : tl_per_task = runnable_load_avg / nr_running this_load = max(runnable_load_avg, cpuload[idx]) target_load = max(runnable_load_avg', cpuload'[idx])
It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the condition. This implies that runnable_load_avg == 0 too because of the condition: this_load <= load but if this _load is null, balanced is already set and the test is redundant.
If sync is set, it's not as straight forward as above (especially if cgroup are involved) but the policy should be similar as we have removed a task that's going to sleep in order to get a more accurate load and this_load values.
The current conclusion is that these additional condition don't give any benefit so we can remove them.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 30 ++++++------------------------ 1 file changed, 6 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7eb9126..57f8d8c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4285,7 +4285,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu; - unsigned long tl_per_task; struct task_group *tg; unsigned long weight; int balanced; @@ -4343,32 +4342,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) balanced = this_eff_load <= prev_eff_load; } else balanced = true; - - /* - * If the currently running task will sleep within - * a reasonable amount of time then attract this newly - * woken task: - */ - if (sync && balanced) - return 1; - schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); - tl_per_task = cpu_avg_load_per_task(this_cpu);
- if (balanced || - (this_load <= load && - this_load + target_load(prev_cpu, idx) <= tl_per_task)) { - /* - * This domain has SD_WAKE_AFFINE and - * p is cache cold in this domain, and - * there is no bad imbalance. - */ - schedstat_inc(sd, ttwu_move_affine); - schedstat_inc(p, se.statistics.nr_wakeups_affine); + if (!balanced) + return 0;
- return 1; - } - return 0; + schedstat_inc(sd, ttwu_move_affine); + schedstat_inc(p, se.statistics.nr_wakeups_affine); + + return 1; }
/*
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 57f8d8c..647d0a6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4095,7 +4095,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = ACCESS_ONCE(rq->nr_running); + unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg;
if (nr_running) @@ -5909,7 +5909,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; - sgs->sum_nr_running += rq->nr_running; + sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1) *overload = true;
capacity_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_capacity that can be also used by non SMT platform to set capacity_orig.
The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to keep backward compatibility in the use of capacity_orig.
arch_scale_smt_capacity and default_scale_smt_capacity have been removed as they were not use elsewhere than in arch_scale_cpu_capacity.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 25 ++++++------------------- 1 file changed, 6 insertions(+), 19 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 647d0a6..00ff5f3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5642,19 +5642,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu) return default_scale_capacity(sd, cpu); }
-static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu) +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; - unsigned long smt_gain = sd->smt_gain; + if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) + return sd->smt_gain / sd->span_weight;
- smt_gain /= weight; - - return smt_gain; -} - -unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu) -{ - return default_scale_smt_capacity(sd, cpu); + return SCHED_CAPACITY_SCALE; }
static unsigned long scale_rt_capacity(int cpu) @@ -5693,18 +5686,12 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) { - if (sched_feat(ARCH_CAPACITY)) - capacity *= arch_scale_smt_capacity(sd, cpu); - else - capacity *= default_scale_smt_capacity(sd, cpu); + capacity *= arch_scale_cpu_capacity(sd, cpu);
- capacity >>= SCHED_CAPACITY_SHIFT; - } + capacity >>= SCHED_CAPACITY_SHIFT;
sdg->sgc->capacity_orig = capacity;
Use the new arch_scale_cpu_capacity in order to reflect the original capacity of a CPU instead of arch_scale_freq_capacity which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index e35d880..89cfdd6 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu) set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n", - cpu, arch_scale_freq_capacity(NULL, cpu)); + cpu, arch_scale_cpu_capacity(NULL, cpu)); }
#else
This new field cpu_capacity_orig reflects the available capacity of a CPUs unlike the cpu_capacity which reflects the current capacity that can be altered by frequency and rt tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e1a2f31..fc5a275 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6976,7 +6976,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq->cpu_capacity = SCHED_CAPACITY_SCALE; + rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 00ff5f3..f6258bc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5693,6 +5693,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0191ed5..5a8ef50 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -571,6 +571,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_capacity; + unsigned long cpu_capacity_orig;
unsigned char idle_balance; /* For active balancing */
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f6258bc..6843016 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4284,6 +4284,7 @@ static int wake_wide(struct task_struct *p) static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; + s64 this_eff_load, prev_eff_load; int idx, this_cpu, prev_cpu; struct task_group *tg; unsigned long weight; @@ -4327,21 +4328,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) * Otherwise check if either cpus are near enough in load to allow this * task to be woken on this_cpu. */ - if (this_load > 0) { - s64 this_eff_load, prev_eff_load; + this_eff_load = 100; + this_eff_load *= capacity_of(prev_cpu); + + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= capacity_of(this_cpu);
- this_eff_load = 100; - this_eff_load *= capacity_of(prev_cpu); + if (this_load > 0) { this_eff_load *= this_load + effective_load(tg, this_cpu, weight, weight);
- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; - prev_eff_load *= capacity_of(this_cpu); prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); + } + + balanced = this_eff_load <= prev_eff_load;
- balanced = this_eff_load <= prev_eff_load; - } else - balanced = true; schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
if (!balanced)
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 49 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6843016..1cde8dd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5969,6 +5969,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
+ /* + * The group capacity is reduced probably because of activity from other + * sched class or interrupts which use part of the available capacity + */ + if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity * + env->sd->imbalance_pct)) + return true; + return false; }
@@ -6455,13 +6463,23 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) { + int src_cpu = env->src_cpu;
/* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */ - if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) + if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu) + return 1; + + /* + * If the CPUs share their cache and the src_cpu's capacity is + * reduced because of other sched_class or IRQs, we trig an + * active balance to move the task + */ + if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) * + sd->imbalance_pct)) return 1; }
@@ -6563,6 +6581,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu; + ld_moved = 0; if (busiest->nr_running > 1) { /* @@ -6572,7 +6592,6 @@ static int load_balance(int this_cpu, struct rq *this_rq, * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7262,10 +7281,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
/* * Current heuristic for kicking the idle load balancer in the presence - * of an idle cpu is the system. + * of an idle cpu in the system. * - This rq has more than one task. - * - At any scheduler domain level, this cpu's scheduler group has multiple - * busy cpu's exceeding the group's capacity. + * - This rq has at least one CFS task and the capacity of the CPU is + * significantly reduced because of RT tasks or IRQs. + * - At parent of LLC scheduler domain level, this cpu's scheduler group has + * multiple busy cpu. * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ @@ -7275,9 +7296,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu; + bool kick = false;
if (unlikely(rq->idle_balance)) - return 0; + return false;
/* * We may be recently in ticked or tickless idle mode. At the first @@ -7291,38 +7313,45 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus))) - return 0; + return false;
if (time_before(now, nohz.next_balance)) - return 0; + return false;
if (rq->nr_running >= 2) - goto need_kick; + return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); - if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1) - goto need_kick_unlock; + if (nr_busy > 1) { + kick = true; + goto unlock; + } + }
- sd = rcu_dereference(per_cpu(sd_asym, cpu)); + sd = rcu_dereference(rq->sd); + if (sd) { + if ((rq->cfs.h_nr_running >= 1) && + ((rq->cpu_capacity * sd->imbalance_pct) < + (rq->cpu_capacity_orig * 100))) { + kick = true; + goto unlock; + } + }
+ sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) - goto need_kick_unlock; + kick = true;
+unlock: rcu_read_unlock(); - return 0; - -need_kick_unlock: - rcu_read_unlock(); -need_kick: - return 1; + return kick; } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/28/2014 01:51 PM, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 49 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6843016..1cde8dd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5969,6 +5969,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
- /* + * The group capacity is reduced probably because of
activity from other + * sched class or interrupts which use part of the available capacity + */ + if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity * + env->sd->imbalance_pct)) + return true; + return false; }
Isn't this already reflected in avg_load, by having avg_load increase due to the capacity decreasing when a cpu is busy with non-CFS loads?
Also, this part of update_sd_pick_busiest will not be reached in the !SD_ASYM_PACKING case once my patch is applied, so this is a small conflict between our series :)
- -- All rights reversed
On 28 July 2014 20:43, Rik van Riel riel@redhat.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/28/2014 01:51 PM, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 49 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6843016..1cde8dd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5969,6 +5969,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
/* + * The group capacity is reduced probably because of
activity from other + * sched class or interrupts which use part of the available capacity + */ + if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity * + env->sd->imbalance_pct)) + return true; + return false; }
Isn't this already reflected in avg_load, by having avg_load increase due to the capacity decreasing when a cpu is busy with non-CFS loads?
Yes, the avg_load should probably increase but it doesn't mean that it will be selected for a load balancing.
Also, this part of update_sd_pick_busiest will not be reached in the !SD_ASYM_PACKING case once my patch is applied, so this is a small conflict between our series :)
yes, this will conflict with your change but the conflict should be obvious to solve as it only adds a new condition for picking the group
All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT1pm+AAoJEM553pKExN6DysEH/1/s2dY0EWM4mVctRAcyASc+ qD5yQgDZEbnIzuUldtTRNwlAxHSaexLI7oF418RAaUV3oue+OQIesJPhKDVrR2+J fLo4fhtutuF1SHJ5Zo2fiGBIUI+GuLspT2fXiG2UxXQXtYioVdDeB+cjo6H3xQ3D R2hR+WeSsLznwFhRnufI1neleIRpqk/Nw1wfdXCyE03kM478rCQjlygMUx6eqURn jblr7GY7jmtzhYFPY9qnE0za/WHWUVAf4RXSjCcuYwZqdhbzmHPKpJyiC3cl9XGz kNzAjqVzvSyCBblaOnJfrRyCWmBdatp5eZQCzZK9/l+YpbkkzQJNlUMAAVV1eNw= =qD2A -----END PGP SIGNATURE-----
Add new statistics which reflect the average time a task is running on the CPU and the sum of the tasks' running on a runqueue. The latter is named usage_avg_contrib.
The rq's usage_avg_contrib will be used to check if a rq is overloaded or not instead of trying to compute how many task a group of CPUs can handle
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 4 ++-- kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 +- 3 files changed, 45 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 0376b05..6893d94 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1073,10 +1073,10 @@ struct sched_avg { * above by 1024/(1-y). Thus we only need a u32 to store them for all * choices of y < 1-2^(-32)*1024. */ - u32 runnable_avg_sum, runnable_avg_period; + u32 runnable_avg_sum, runnable_avg_period, running_avg_sum; u64 last_runnable_update; s64 decay_count; - unsigned long load_avg_contrib; + unsigned long load_avg_contrib, usage_avg_contrib; };
#ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1cde8dd..8bd57df 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -676,7 +676,7 @@ void init_task_runnable_average(struct task_struct *p)
p->se.avg.decay_count = 0; slice = sched_slice(task_cfs_rq(p), &p->se) >> 10; - p->se.avg.runnable_avg_sum = slice; + p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice; p->se.avg.runnable_avg_period = slice; __update_task_entity_contrib(&p->se); } @@ -2292,7 +2292,8 @@ static u32 __compute_runnable_contrib(u64 n) */ static __always_inline int __update_entity_runnable_avg(u64 now, struct sched_avg *sa, - int runnable) + int runnable, + int running) { u64 delta, periods; u32 runnable_contrib; @@ -2331,6 +2332,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, delta_w = 1024 - delta_w; if (runnable) sa->runnable_avg_sum += delta_w; + if (running) + sa->running_avg_sum += delta_w; sa->runnable_avg_period += delta_w;
delta -= delta_w; @@ -2341,6 +2344,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, periods + 1); + sa->running_avg_sum = decay_load(sa->running_avg_sum, + periods + 1); sa->runnable_avg_period = decay_load(sa->runnable_avg_period, periods + 1);
@@ -2348,12 +2353,16 @@ static __always_inline int __update_entity_runnable_avg(u64 now, runnable_contrib = __compute_runnable_contrib(periods); if (runnable) sa->runnable_avg_sum += runnable_contrib; + if (running) + sa->running_avg_sum += runnable_contrib; sa->runnable_avg_period += runnable_contrib; }
/* Remainder of delta accrued against u_0` */ if (runnable) sa->runnable_avg_sum += delta; + if (running) + sa->running_avg_sum += delta; sa->runnable_avg_period += delta;
return decayed; @@ -2493,6 +2502,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se) return se->avg.load_avg_contrib - old_contrib; }
+ +static inline void __update_task_entity_usage(struct sched_entity *se) +{ + u32 contrib; + + /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */ + contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE); + contrib /= (se->avg.runnable_avg_period + 1); + se->avg.usage_avg_contrib = scale_load(contrib); +} + +static long __update_entity_usage_avg_contrib(struct sched_entity *se) +{ + long old_contrib = se->avg.usage_avg_contrib; + + if (entity_is_task(se)) + __update_task_entity_usage(se); + + return se->avg.usage_avg_contrib - old_contrib; +} + static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq, long load_contrib) { @@ -2509,7 +2539,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, int update_cfs_rq) { struct cfs_rq *cfs_rq = cfs_rq_of(se); - long contrib_delta; + long contrib_delta, usage_delta; u64 now;
/* @@ -2521,16 +2551,20 @@ static inline void update_entity_load_avg(struct sched_entity *se, else now = cfs_rq_clock_task(group_cfs_rq(se));
- if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) + if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq, + cfs_rq->curr == se)) return;
contrib_delta = __update_entity_load_avg_contrib(se); + usage_delta = __update_entity_usage_avg_contrib(se);
if (!update_cfs_rq) return;
- if (se->on_rq) + if (se->on_rq) { cfs_rq->runnable_load_avg += contrib_delta; + cfs_rq->usage_load_avg += usage_delta; + } else subtract_blocked_load_contrib(cfs_rq, -contrib_delta); } @@ -2607,6 +2641,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, }
cfs_rq->runnable_load_avg += se->avg.load_avg_contrib; + cfs_rq->usage_load_avg += se->avg.usage_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); } @@ -2625,6 +2660,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, update_cfs_rq_blocked_load(cfs_rq, !sleep);
cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib; + cfs_rq->usage_load_avg -= se->avg.usage_avg_contrib; if (sleep) { cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); @@ -2962,6 +2998,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) */ update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); + update_entity_load_avg(se, 1); }
update_stats_curr_start(cfs_rq, se); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 5a8ef50..e5ab9b1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -336,7 +336,7 @@ struct cfs_rq { * This allows for the description of both thread and group usage (in * the FAIR_GROUP_SCHED case). */ - unsigned long runnable_load_avg, blocked_load_avg; + unsigned long runnable_load_avg, blocked_load_avg, usage_load_avg; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load;
Monitor the utilization level of each group of each sched_domain level. The utilization is the amount of cpu_capacity that is currently used on a CPU or group of CPUs. We use the usage_load_avg to evaluate this utilization level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_capacity in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8bd57df..8524760 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4129,6 +4129,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; }
+static unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4517,6 +4522,17 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+static int get_cpu_utilization(int cpu) +{ + unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg; + unsigned long capacity = capacity_of(cpu); + + if (usage >= SCHED_LOAD_SCALE) + return capacity + 1; + + return (usage * capacity) >> SCHED_LOAD_SHIFT; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5596,6 +5612,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_utilization; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus; @@ -5935,6 +5952,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; + sgs->group_utilization += get_cpu_utilization(i); sgs->sum_nr_running += rq->cfs.h_nr_running;
if (rq->nr_running > 1) @@ -6108,7 +6126,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; - sg = sg->next; } while (sg != env->sd->groups);
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_CAPACITY_SCALE. We can now have a better idea of the capacity of a group of CPUs and of the utilization of this group thanks to the rework of group_capacity_orig and the group_utilization. We can now deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 115 +++++++++++++++++++++------------------------------- 1 file changed, 47 insertions(+), 68 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8524760..292ee7c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5612,13 +5612,13 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_capacity_orig; unsigned long group_utilization; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; unsigned int idle_cpus; unsigned int group_weight; int group_imb; /* Is there an imbalance in the group ? */ - int group_has_free_capacity; + int group_out_of_capacity; #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5838,31 +5838,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu) }
/* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). - */ -static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) -{ - /* - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUCAPACITY)) - return 0; - - /* - * If ~90% of the cpu_capacity is still there, we're good. - */ - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29) - return 1; - - return 0; -} - -/* * Group imbalance indicates (and tries to solve) the problem where balancing * groups is inadequate due to tsk_cpus_allowed() constraints. * @@ -5896,32 +5871,30 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgc->imbalance; }
-/* - * Compute the group capacity factor. - * - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit unit capacity with that. - */ -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline int group_has_free_capacity(struct sg_lb_stats *sgs, + struct lb_env *env) { - unsigned int capacity_factor, smt, cpus; - unsigned int capacity, capacity_orig; + if ((sgs->group_capacity_orig * 100) > + (sgs->group_utilization * env->sd->imbalance_pct)) + return 1;
- capacity = group->sgc->capacity; - capacity_orig = group->sgc->capacity_orig; - cpus = group->group_weight; + if (sgs->sum_nr_running < sgs->group_weight) + return 1;
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); - capacity_factor = cpus / smt; /* cores */ + return 0; +}
- capacity_factor = min_t(unsigned, - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); +static inline int group_is_overloaded(struct sg_lb_stats *sgs, + struct lb_env *env) +{ + if (sgs->sum_nr_running <= sgs->group_weight) + return 0;
- return capacity_factor; + if ((sgs->group_capacity_orig * 100) < + (sgs->group_utilization * env->sd->imbalance_pct)) + return 1; + + return 0; }
/** @@ -5967,6 +5940,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->idle_cpus++; }
+ sgs->group_capacity_orig = group->sgc->capacity_orig; /* Adjust by relative CPU capacity of the group */ sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; @@ -5977,10 +5951,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_weight = group->group_weight;
sgs->group_imb = sg_imbalanced(group); - sgs->group_capacity_factor = sg_capacity_factor(env, group);
- if (sgs->group_capacity_factor > sgs->sum_nr_running) - sgs->group_has_free_capacity = 1; + sgs->group_out_of_capacity = group_is_overloaded(sgs, env); }
/** @@ -6004,7 +5976,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->avg_load <= sds->busiest_stat.avg_load) return false;
- if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (sgs->group_out_of_capacity) return true;
if (sgs->group_imb) @@ -6105,17 +6077,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity factor to one so that we'll try + * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity_factor. The + * these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + group_has_free_capacity(&sds->local_stat, env)) { + if (sgs->sum_nr_running > 1) + sgs->group_out_of_capacity = 1; + sgs->group_capacity = min(sgs->group_capacity, + SCHED_CAPACITY_SCALE); + }
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6294,11 +6270,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * Except of course for the group_imb case, since then we might * have to drop below capacity to reach cpu-load equilibrium. */ - load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity_factor); - - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); - load_above_capacity /= busiest->group_capacity; + load_above_capacity = busiest->sum_nr_running * + SCHED_LOAD_SCALE; + if (load_above_capacity > busiest->group_capacity) + load_above_capacity -= busiest->group_capacity; + else + load_above_capacity = ~0UL; }
/* @@ -6361,6 +6338,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6381,8 +6359,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && - !busiest->group_has_free_capacity) + if (env->idle == CPU_NEWLY_IDLE && + group_has_free_capacity(local, env) && + busiest->group_out_of_capacity) goto force_balance;
/* @@ -6440,7 +6419,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long capacity, capacity_factor, wl; + unsigned long capacity, wl; enum fbq_type rt;
rq = cpu_rq(i); @@ -6469,9 +6448,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
capacity = capacity_of(i); - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6479,7 +6455,10 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) + + if (rq->nr_running == 1 && wl > env->imbalance && + ((capacity * env->sd->imbalance_pct) >= + (rq->cpu_capacity_orig * 100))) continue;
/*
add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fc5a275..3e4eea4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6080,6 +6080,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */
if (sd->flags & SD_SHARE_CPUCAPACITY) { + sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/28/2014 01:51 PM, Vincent Guittot wrote:
The usage_avg_contrib is based on the current implementation of the load avg tracking. I also have a version of the usage_avg_contrib that is based on the new implementation [3] but haven't provide the patches and results as [3] is still under review. I can provide change above [3] to change how usage_avg_contrib is computed and adapt to new mecanism.
TODO: manage conflict with the next version of [4]
FWIW, your series makes sense to me, and only the first hunk of patch 8/12 looks like it would conflict with my patches, in that your patch changes code that will only be reached in exceptional circumstances.
- -- All rights reversed
linaro-kernel@lists.linaro.org