Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power has been created and replace arch_scale_smt_power, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_POWER_SCALE. We don't try anymore to evaluate the number of available cores based on the group_capacity but instead we detect when the group is fully utilized
Now that we have the original capacity of CPUS and their activity/utilization, we can evaluate more accuratly the capacity and the level of utilization of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we could certainly take advantage of this new statistic in several other places of the load balance.
Tests results: I have put below results of 3 kind of tests: - hackbench -l 500 -s 4096 - scp of 100MB file on the platform - ebizzy with various number of threads on 3 kernel
tip = tip/sched/core patch = tip + this patchset patch+irq = tip + this patchset + irq accounting
each test has been run 6 times and the figure below show the stdev and the diff compared to the tip kernel
Dual cortex A7 tip | patch | patch+irq stdev | diff stdev | diff stdev hackbench (+/-)1.02% | +0.42%(+/-)1.29% | -0.65%(+/-)0.44% scp (+/-)0.41% | +0.18%(+/-)0.10% | +78.05%(+/-)0.70% ebizzy -t 1 (+/-)1.72% | +1.43%(+/-)1.62% | +2.58%(+/-)2.11% ebizzy -t 2 (+/-)0.42% | +0.06%(+/-)0.45% | +1.45%(+/-)4.05% ebizzy -t 4 (+/-)0.73% | +8.39%(+/-)13.25% | +4.25%(+/-)10.08% ebizzy -t 6 (+/-)10.30% | +2.19%(+/-)3.59% | +0.58%(+/-)1.80% ebizzy -t 8 (+/-)1.45% | -0.05%(+/-)2.18% | +2.53%(+/-)3.40% ebizzy -t 10 (+/-)3.78% | -2.71%(+/-)2.79% | -3.16%(+/-)3.06% ebizzy -t 12 (+/-)3.21% | +1.13%(+/-)2.02% | -1.13%(+/-)4.43% ebizzy -t 14 (+/-)2.05% | +0.15%(+/-)3.47% | -2.08%(+/-)1.40% uad cortex A15 tip | patch | patch+irq stdev | diff stdev | diff stdev hackbench (+/-)0.55% | -0.58%(+/-)0.90% | +0.62%(+/-)0.43% scp (+/-)0.21% | -0.10%(+/-)0.10% | +5.70%(+/-)0.53% ebizzy -t 1 (+/-)0.42% | -0.58%(+/-)0.48% | -0.29%(+/-)0.18% ebizzy -t 2 (+/-)0.52% | -0.83%(+/-)0.20% | -2.07%(+/-)0.35% ebizzy -t 4 (+/-)0.22% | -1.39%(+/-)0.49% | -1.78%(+/-)0.67% ebizzy -t 6 (+/-)0.44% | -0.78%(+/-)0.15% | -1.79%(+/-)1.10% ebizzy -t 8 (+/-)0.43% | +0.13%(+/-)0.92% | -0.17%(+/-)0.67% ebizzy -t 10 (+/-)0.71% | +0.10%(+/-)0.93% | -0.36%(+/-)0.77% ebizzy -t 12 (+/-)0.65% | -1.07%(+/-)1.13% | -1.13%(+/-)0.70% ebizzy -t 14 (+/-)0.92% | -0.28%(+/-)1.25% | +2.84%(+/-)9.33%
I haven't been able to fully test the patchset for a SMT system to check that the regression that has been reported by Preethi has been solved but the various tests that i have done, don't show any regression so far. The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level should have fix the regression.
Change since V2: - rebase on top of capacity renaming - fix wake_affine statistic update - rework nohz_kick_needed - optimize the active migration of a task from CPU with reduced capacity - rename group_activity by group_utilization and remove unused total_utilization - repair SD_PREFER_SIBLING and use it for SMT level - reorder patchset to gather patches with same topics
Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2014/3/19/377
Vincent Guittot (12): sched: fix imbalance flag reset sched: remove a wake_affine condition sched: fix avg_load computation sched: Allow all archs to set the power_orig ARM: topology: use new cpu_power interface sched: add per rq cpu_power_orig sched: test the cpu's capacity in wake affine sched: move cfs task on a CPU with higher capacity Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED" sched: get CPU's utilization statistic sched: replace capacity_factor by utilization sched: add SD_PREFER_SIBLING for SMT level
arch/arm/kernel/topology.c | 4 +- kernel/sched/core.c | 3 +- kernel/sched/fair.c | 290 +++++++++++++++++++++++---------------------- kernel/sched/sched.h | 5 +- 4 files changed, 158 insertions(+), 144 deletions(-)
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3c73122..0c48dff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ more_balance: if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1; - } else if (*group_imbalance) - *group_imbalance = 0; }
/* All tasks on this runqueue were pinned by CPU affinity */ @@ -6703,6 +6701,16 @@ more_balance: goto out;
out_balanced: + /* + * We reach balance although we may have faced some affinity + * constraints. Clear the imbalance flag if it was set. + */ + if (sd_parent) { + int *group_imbalance = &sd_parent->groups->sgc->imbalance; + if (*group_imbalance) + *group_imbalance = 0; + } + schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
On 06/30/2014 09:35 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3c73122..0c48dff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ more_balance: if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0;
}
/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6703,6 +6701,16 @@ more_balance: goto out;
out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
}
schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
I am not convinced that we can clear the imbalance flag here. Lets take a simple example. Assume at a particular level of sched_domain, there are two sched_groups with one cpu each. There are 2 tasks on the source cpu, one of which is running(t1) and the other thread(t2) does not have the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the dst_cpu due to affinity constraints. Note that t2 is *not pinned, it just cannot run on the dst_cpu*. In this scenario also we reach the out_balanced tag right? If we set the group_imbalance flag to 0, we are ruling out the possibility of migrating t2 to any other cpu in a higher level sched_domain by saying that all is well, there is no imbalance. This is wrong, isn't it?
My point is that by clearing the imbalance flag in the out_balanced case, you might be overlooking the fact that the tsk_cpus_allowed mask of the tasks on the src_cpu may not be able to run on the dst_cpu in *this* level of sched_domain, but can potentially run on a cpu at any higher level of sched_domain. By clearing the flag, we are not encouraging load balance at that level for t2.
Am I missing something?
Regards Preeti U Murthy
On 8 July 2014 05:13, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 06/30/2014 09:35 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3c73122..0c48dff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ more_balance: if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0; } /* All tasks on this runqueue were pinned by CPU affinity */
@@ -6703,6 +6701,16 @@ more_balance: goto out;
out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
}
schedstat_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0;
I am not convinced that we can clear the imbalance flag here. Lets take a simple example. Assume at a particular level of sched_domain, there are two sched_groups with one cpu each. There are 2 tasks on the source cpu, one of which is running(t1) and the other thread(t2) does not have the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the dst_cpu due to affinity constraints. Note that t2 is *not pinned, it just cannot run on the dst_cpu*. In this scenario also we reach the out_balanced tag right? If we set the group_imbalance flag to 0, we are
No we will not. If we have 2 tasks on 1 CPU in one sched_group and the other group with an idle CPU, we are not balanced so we will not go to out_balanced and the group_imbalance will staty set until we reach a balanced state (by migrating t1).
ruling out the possibility of migrating t2 to any other cpu in a higher level sched_domain by saying that all is well, there is no imbalance. This is wrong, isn't it?
My point is that by clearing the imbalance flag in the out_balanced case, you might be overlooking the fact that the tsk_cpus_allowed mask of the tasks on the src_cpu may not be able to run on the dst_cpu in *this* level of sched_domain, but can potentially run on a cpu at any higher level of sched_domain. By clearing the flag, we are not
The imbalance flag is per sched_domain level so we will not clear group_imbalance flag of other levels if the imbalance is also detected at a higher level it will migrate t2
Regards, Vincent
encouraging load balance at that level for t2.
Am I missing something?
Regards Preeti U Murthy
Hi Vincent,
On 07/08/2014 03:42 PM, Vincent Guittot wrote:
On 8 July 2014 05:13, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 06/30/2014 09:35 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3c73122..0c48dff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ more_balance: if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0; } /* All tasks on this runqueue were pinned by CPU affinity */
@@ -6703,6 +6701,16 @@ more_balance: goto out;
out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
}
schedstat_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0;
I am not convinced that we can clear the imbalance flag here. Lets take a simple example. Assume at a particular level of sched_domain, there are two sched_groups with one cpu each. There are 2 tasks on the source cpu, one of which is running(t1) and the other thread(t2) does not have the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the dst_cpu due to affinity constraints. Note that t2 is *not pinned, it just cannot run on the dst_cpu*. In this scenario also we reach the out_balanced tag right? If we set the group_imbalance flag to 0, we are
No we will not. If we have 2 tasks on 1 CPU in one sched_group and the other group with an idle CPU, we are not balanced so we will not go to out_balanced and the group_imbalance will staty set until we reach a balanced state (by migrating t1).
In the example that I mention above, t1 and t2 are on the rq of cpu0; while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in its cpus allowed mask. So during load balance, cpu1 tries to pull t2, cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to out_balanced. Note that there are only two sched groups at this level of sched domain.one with cpu0 and the other with cpu1. In this scenario we do not try to do active load balancing, atleast thats what the code does now if LBF_ALL_PINNED flag is set.
ruling out the possibility of migrating t2 to any other cpu in a higher level sched_domain by saying that all is well, there is no imbalance. This is wrong, isn't it?
My point is that by clearing the imbalance flag in the out_balanced case, you might be overlooking the fact that the tsk_cpus_allowed mask of the tasks on the src_cpu may not be able to run on the dst_cpu in *this* level of sched_domain, but can potentially run on a cpu at any higher level of sched_domain. By clearing the flag, we are not
The imbalance flag is per sched_domain level so we will not clear group_imbalance flag of other levels if the imbalance is also detected at a higher level it will migrate t2
Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level.
Elaborating on this, when cpu2 in another socket,lets say, begins load balancing and update_sd_pick_busiest() is called, the group with cpu0 and cpu1 may not be picked as a potential imbalanced group. Had we not cleared the imbalance flag for this group, we could have balanced out t2 to cpu2/3.
Is the scenario I am describing clear?
Regards Preeti U Murthy
Regards, Vincent
encouraging load balance at that level for t2.
Am I missing something?
Regards Preeti U Murthy
On 9 July 2014 05:54, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 07/08/2014 03:42 PM, Vincent Guittot wrote:
[ snip]
out_balanced:
/*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
}
schedstat_inc(sd, lb_balanced[idle]); sd->nr_balance_failed = 0;
I am not convinced that we can clear the imbalance flag here. Lets take a simple example. Assume at a particular level of sched_domain, there are two sched_groups with one cpu each. There are 2 tasks on the source cpu, one of which is running(t1) and the other thread(t2) does not have the dst_cpu in the tsk_allowed_mask. Now no task can be migrated to the dst_cpu due to affinity constraints. Note that t2 is *not pinned, it just cannot run on the dst_cpu*. In this scenario also we reach the out_balanced tag right? If we set the group_imbalance flag to 0, we are
No we will not. If we have 2 tasks on 1 CPU in one sched_group and the other group with an idle CPU, we are not balanced so we will not go to out_balanced and the group_imbalance will staty set until we reach a balanced state (by migrating t1).
In the example that I mention above, t1 and t2 are on the rq of cpu0; while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in its cpus allowed mask. So during load balance, cpu1 tries to pull t2, cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to
That's where I disagree: my understanding of can_migrate_task is that the LBF_ALL_PINNED will be cleared before returning false when checking t1 because we are testing all tasks even the running task
out_balanced. Note that there are only two sched groups at this level of sched domain.one with cpu0 and the other with cpu1. In this scenario we do not try to do active load balancing, atleast thats what the code does now if LBF_ALL_PINNED flag is set.
ruling out the possibility of migrating t2 to any other cpu in a higher level sched_domain by saying that all is well, there is no imbalance. This is wrong, isn't it?
My point is that by clearing the imbalance flag in the out_balanced case, you might be overlooking the fact that the tsk_cpus_allowed mask of the tasks on the src_cpu may not be able to run on the dst_cpu in *this* level of sched_domain, but can potentially run on a cpu at any higher level of sched_domain. By clearing the flag, we are not
The imbalance flag is per sched_domain level so we will not clear group_imbalance flag of other levels if the imbalance is also detected at a higher level it will migrate t2
Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level.
The imbalance is per sched_domain level so it will not have any side effect on the next level
Regards, Vincent
Elaborating on this, when cpu2 in another socket,lets say, begins load balancing and update_sd_pick_busiest() is called, the group with cpu0 and cpu1 may not be picked as a potential imbalanced group. Had we not cleared the imbalance flag for this group, we could have balanced out t2 to cpu2/3.
Is the scenario I am describing clear?
Regards Preeti U Murthy
Regards, Vincent
encouraging load balance at that level for t2.
Am I missing something?
Regards Preeti U Murthy
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
In the example that I mention above, t1 and t2 are on the rq of cpu0; while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in its cpus allowed mask. So during load balance, cpu1 tries to pull t2, cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to out_balanced. Note that there are only two sched groups at this level of sched domain.one with cpu0 and the other with cpu1. In this scenario we do not try to do active load balancing, atleast thats what the code does now if LBF_ALL_PINNED flag is set.
I think Vince is right in saying that in this scenario ALL_PINNED won't be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also include the current running task.
And can_migrate_task() only checks for current after the pinning bits.
Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level.
And this is where Vince is wrong; note how update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly one level up.
So what we can do I suppose is clear 'group->sgc->imbalance' at out_balanced.
In any case, the entirely of this group imbalance crap is just that, crap. Its a terribly difficult situation and the current bits more or less fudge around some of the common cases. Also see the comment near sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of hacks trying to deal with hard cases.
A 'good' solution would be prohibitively expensive I fear.
On 07/09/2014 04:13 PM, Peter Zijlstra wrote:
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
In the example that I mention above, t1 and t2 are on the rq of cpu0; while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in its cpus allowed mask. So during load balance, cpu1 tries to pull t2, cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to out_balanced. Note that there are only two sched groups at this level of sched domain.one with cpu0 and the other with cpu1. In this scenario we do not try to do active load balancing, atleast thats what the code does now if LBF_ALL_PINNED flag is set.
I think Vince is right in saying that in this scenario ALL_PINNED won't be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also include the current running task.
Hmm.. really? Because while dequeueing a task from the rq so as to schedule it on a cpu, we delete its entry from the list of cfs_tasks on the rq.
list_del_init(&se->group_node) in account_entity_dequeue() does that.
And can_migrate_task() only checks for current after the pinning bits.
Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level.
And this is where Vince is wrong; note how update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly one level up.
One level up? The group->sgc->imbalance flag is checked during update_sg_lb_stats(). This flag is *set during the load balancing at a lower level sched domain*.IOW, when the 'group' formed the sched domain.
So what we can do I suppose is clear 'group->sgc->imbalance' at out_balanced.
You mean 'set'? If we clear it we will have no clue about imbalances at lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED case. This might prevent us from balancing out these tasks to other groups at a higher level domain. update_sd_pick_busiest() specifically relies on this flag to choose the busiest group.
In any case, the entirely of this group imbalance crap is just that, crap. Its a terribly difficult situation and the current bits more or less fudge around some of the common cases. Also see the comment near sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of hacks trying to deal with hard cases.
A 'good' solution would be prohibitively expensive I fear.
So the problem that Vincent is trying to bring to the fore with this patchset is that, when the busy group has cpus with only 1 running task but imbalance flag set due to affinity constraints, we unnecessarily try to do active balancing on this group. Active load balancing will not succeed when there is only one task. To solve this issue will a simple check on (busiest->nr_running > 1) in addition to !(ld_moved) before doing active load balancing not work?
Thanks
Regards Preeti U Murthy
On Wed, Jul 09, 2014 at 05:11:20PM +0530, Preeti U Murthy wrote:
On 07/09/2014 04:13 PM, Peter Zijlstra wrote:
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
In the example that I mention above, t1 and t2 are on the rq of cpu0; while t1 is running on cpu0, t2 is on the rq but does not have cpu1 in its cpus allowed mask. So during load balance, cpu1 tries to pull t2, cannot do so, and hence LBF_ALL_PINNED flag is set and it jumps to out_balanced. Note that there are only two sched groups at this level of sched domain.one with cpu0 and the other with cpu1. In this scenario we do not try to do active load balancing, atleast thats what the code does now if LBF_ALL_PINNED flag is set.
I think Vince is right in saying that in this scenario ALL_PINNED won't be set. move_tasks() will iterate cfs_rq::cfs_tasks, that list will also include the current running task.
Hmm.. really? Because while dequeueing a task from the rq so as to schedule it on a cpu, we delete its entry from the list of cfs_tasks on the rq.
list_del_init(&se->group_node) in account_entity_dequeue() does that.
But set_next_entity() doesn't call account_entity_dequeue(), only __dequeue_entity() to take it out of the rb-tree.
And can_migrate_task() only checks for current after the pinning bits.
Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level.
And this is where Vince is wrong; note how update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly one level up.
One level up? The group->sgc->imbalance flag is checked during update_sg_lb_stats(). This flag is *set during the load balancing at a lower level sched domain*.IOW, when the 'group' formed the sched domain.
sd_parent is one level up.
So what we can do I suppose is clear 'group->sgc->imbalance' at out_balanced.
You mean 'set'? If we clear it we will have no clue about imbalances at lower level sched domains due to pinning. Specifically in LBF_ALL_PINNED case. This might prevent us from balancing out these tasks to other groups at a higher level domain. update_sd_pick_busiest() specifically relies on this flag to choose the busiest group.
No, clear, in load_balance. So set one level up, clear the current level.
On 9 July 2014 12:43, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
[snip]
Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level.
And this is where Vince is wrong; note how update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly one level up.
I forgot this behavior when studying preeti use case
So what we can do I suppose is clear 'group->sgc->imbalance' at out_balanced.
In any case, the entirely of this group imbalance crap is just that, crap. Its a terribly difficult situation and the current bits more or less fudge around some of the common cases. Also see the comment near sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of hacks trying to deal with hard cases.
A 'good' solution would be prohibitively expensive I fear.
I have tried to summarized several use cases that have been discussed for this patch
The 1st use case is the one that i described in the commit message of this patch: If we have a sporadic imbalance that set the imbalance flag, we don't clear it after and it generates spurious and useless active load balance
Then preeti came with the following use case : we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups 2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task. When CPU1's sched_group is doing load balance, the imbalance should be set. That's still happen with this patchset because the LBF_ALL_PINNED flag will be cleared thanks to task A.
Preeti also explained me the following use cases on irc:
If we have both tasks A and B that can't run on CPU1, the LBF_ALL_PINNED will stay set. As we can't do anything, we conclude that we are balanced, we go to out_balanced and we clear the imbalance flag. But we should not consider that as a balanced state but as a all tasks pinned state instead and we should let the imbalance flag set. If we now have 2 additional CPUs which are in the cpumask of task A and/or B at the parent sched_domain level , we should migrate one task in this group but this will not happen (with this patch) because the sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2 CPUs) and the imbalance flag has been cleared as described previously.
I'm going to send a new revision of the patchset with the correction
Vincent
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3c73122..a836198 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1; - } else if (*group_imbalance) - *group_imbalance = 0; }
/* All tasks on this runqueue were pinned by CPU affinity */ @@ -6629,7 +6627,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; } - goto out_balanced; + goto out_all_pinned; } }
@@ -6703,6 +6701,22 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out;
out_balanced: + /* + * We reach balance although we may have faced some affinity + * constraints. Clear the imbalance flag if it was set. + */ + if (sd_parent) { + int *group_imbalance = &sd_parent->groups->sgc->imbalance; + if (*group_imbalance) + *group_imbalance = 0; + } + +out_all_pinned: + /* + * We reach balance because all tasks are pinned at this level so + * we can't migrate them. Let the imbalance flag set so parent level + * can try to migrate them. + */ schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
Hi Vincent,
On 07/10/2014 03:00 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state. If all tasks are pinned, we don't consider that as a balanced state and let the imbalance flag set.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d3c73122..a836198 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1;
} else if (*group_imbalance)
*group_imbalance = 0;
}
/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6629,7 +6627,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.loop_break = sched_nr_migrate_break; goto redo; }
goto out_balanced;
} }goto out_all_pinned;
@@ -6703,6 +6701,22 @@ static int load_balance(int this_cpu, struct rq *this_rq, goto out;
out_balanced:
- /*
* We reach balance although we may have faced some affinity
* constraints. Clear the imbalance flag if it was set.
*/
- if (sd_parent) {
int *group_imbalance = &sd_parent->groups->sgc->imbalance;
if (*group_imbalance)
*group_imbalance = 0;
- }
+out_all_pinned:
/*
* We reach balance because all tasks are pinned at this level so
* we can't migrate them. Let the imbalance flag set so parent level
* can try to migrate them.
*/
schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
This patch looks good to me.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com
Hi Peter, Vincent,
On 07/10/2014 02:44 PM, Vincent Guittot wrote:
On 9 July 2014 12:43, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jul 09, 2014 at 09:24:54AM +0530, Preeti U Murthy wrote:
[snip]
Continuing with the above explanation; when LBF_ALL_PINNED flag is set,and we jump to out_balanced, we clear the imbalance flag for the sched_group comprising of cpu0 and cpu1,although there is actually an imbalance. t2 could still be migrated to say cpu2/cpu3 (t2 has them in its cpus allowed mask) in another sched group when load balancing is done at the next sched domain level.
And this is where Vince is wrong; note how update_sg_lb_stats()/sg_imbalance() uses group->sgc->imbalance, but load_balance() sets: sd_parent->groups->sgc->imbalance, so explicitly one level up.
I forgot this behavior when studying preeti use case
So what we can do I suppose is clear 'group->sgc->imbalance' at out_balanced.
In any case, the entirely of this group imbalance crap is just that, crap. Its a terribly difficult situation and the current bits more or less fudge around some of the common cases. Also see the comment near sg_imbalanced(). Its not a solid and 'correct' anything. Its a bunch of hacks trying to deal with hard cases.
A 'good' solution would be prohibitively expensive I fear.
I have tried to summarized several use cases that have been discussed for this patch
The 1st use case is the one that i described in the commit message of this patch: If we have a sporadic imbalance that set the imbalance flag, we don't clear it after and it generates spurious and useless active load balance
Then preeti came with the following use case : we have a sched_domain made of CPU0 and CPU1 in 2 different sched_groups 2 tasks A and B are on CPU0, B can't run on CPU1, A is the running task. When CPU1's sched_group is doing load balance, the imbalance should be set. That's still happen with this patchset because the LBF_ALL_PINNED flag will be cleared thanks to task A.
Preeti also explained me the following use cases on irc:
If we have both tasks A and B that can't run on CPU1, the LBF_ALL_PINNED will stay set. As we can't do anything, we conclude that we are balanced, we go to out_balanced and we clear the imbalance flag. But we should not consider that as a balanced state but as a all tasks pinned state instead and we should let the imbalance flag set. If we now have 2 additional CPUs which are in the cpumask of task A and/or B at the parent sched_domain level , we should migrate one task in this group but this will not happen (with this patch) because the sched_group made of CPU0 and CPU1 is not overloaded (2 tasks for 2 CPUs) and the imbalance flag has been cleared as described previously.
Peter,
The above paragraph describes my concern with regard to clearing the imbalance flag at a given level of sched domain in case of pinned tasks in the below conversation. https://lkml.org/lkml/2014/7/9/454.
You are right about iterating through all tasks including the current task during load balancing.
Thanks
Regards Preeti U Murthy
I'm going to send a new revision of the patchset with the correction
Vincent
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
- -- All rights reversed
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/08/2014 11:05 PM, Rik van Riel wrote:
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
Never mind that, Preeti explained the failure mode in more detail on irc, and I am no longer convinced this change is a good idea.
- -- All rights reversed
On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote:
+++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ more_balance:
In order to avoid labels being used as functions; you can use:
.quiltrc: QUILT_DIFF_OPTS="-F ^[[:alpha:]$_].*[^:]$"
.gitconfig: [diff "default"] xfuncname = "^[[:alpha:]$_].*[^:]$"
On 9 July 2014 12:14, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:32PM +0200, Vincent Guittot wrote:
+++ b/kernel/sched/fair.c @@ -6615,10 +6615,8 @@ more_balance:
In order to avoid labels being used as functions; you can use:
.quiltrc: QUILT_DIFF_OPTS="-F ^[[:alpha:]$_].*[^:]$"
.gitconfig: [diff "default"] xfuncname = "^[[:alpha:]$_].*[^:]$"
Thanks, i'm going to add it to my .gitconfig
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found clear description in the previous commits' log. Futhermore, the comment of the condition refers to the task_hot function that was used before being replaced by the current condition: /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and * there is no bad imbalance. */
If we look more deeply the below condition this_load + target_load(prev_cpu, idx) <= tl_per_task
When sync is clear, we have : tl_per_task = runnable_load_avg / nr_running this_load = max(runnable_load_avg, cpuload[idx]) target_load = max(runnable_load_avg', cpuload'[idx])
It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the condition. This implies that runnable_load_avg == 0 too because of the condition: this_load <= load but if this _load is null, balanced is already set and the test is redundant.
If sync is set, it's not as straight forward as above (especially if cgroup are involved) but the policy should be similar as we have removed a task that's going to sleep in order to get a more accurate load and this_load values.
The current conclusion is that these additional condition don't give any benefit so we can remove them.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 30 ++++++------------------------ 1 file changed, 6 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0c48dff..c6dba48 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4241,7 +4241,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu; - unsigned long tl_per_task; struct task_group *tg; unsigned long weight; int balanced; @@ -4299,32 +4298,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) balanced = this_eff_load <= prev_eff_load; } else balanced = true; - - /* - * If the currently running task will sleep within - * a reasonable amount of time then attract this newly - * woken task: - */ - if (sync && balanced) - return 1; - schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); - tl_per_task = cpu_avg_load_per_task(this_cpu);
- if (balanced || - (this_load <= load && - this_load + target_load(prev_cpu, idx) <= tl_per_task)) { - /* - * This domain has SD_WAKE_AFFINE and - * p is cache cold in this domain, and - * there is no bad imbalance. - */ - schedstat_inc(sd, ttwu_move_affine); - schedstat_inc(p, se.statistics.nr_wakeups_affine); + if (!balanced) + return 0;
- return 1; - } - return 0; + schedstat_inc(sd, ttwu_move_affine); + schedstat_inc(p, se.statistics.nr_wakeups_affine); + + return 1; }
/*
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found clear description in the previous commits' log. Futhermore, the comment of the condition refers to the task_hot function that was used before being replaced by the current condition: /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and * there is no bad imbalance. */
If we look more deeply the below condition this_load + target_load(prev_cpu, idx) <= tl_per_task
When sync is clear, we have : tl_per_task = runnable_load_avg / nr_running this_load = max(runnable_load_avg, cpuload[idx]) target_load = max(runnable_load_avg', cpuload'[idx])
It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the condition. This implies that runnable_load_avg == 0 too because of the condition: this_load <= load but if this _load is null, balanced is already set and the test is redundant.
If sync is set, it's not as straight forward as above (especially if cgroup are involved) but the policy should be similar as we have removed a task that's going to sleep in order to get a more accurate load and this_load values.
The current conclusion is that these additional condition don't give any benefit so we can remove them.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
- -- All rights reversed
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity (cpu_power) and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c6dba48..148b277 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4051,7 +4051,7 @@ static unsigned long capacity_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = ACCESS_ONCE(rq->nr_running); + unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg;
if (nr_running) @@ -5865,7 +5865,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; - sgs->sum_nr_running += rq->nr_running; + sgs->sum_nr_running += rq->cfs.h_nr_running; #ifdef CONFIG_NUMA_BALANCING sgs->nr_numa_running += rq->nr_numa_running; sgs->nr_preferred_running += rq->nr_preferred_running;
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity (cpu_power) and they will be tracked in the CPU's utilization (group_utilization) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
- -- All rights reversed
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 148b277..f0bba5f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5614,6 +5614,20 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu) return default_scale_smt_capacity(sd, cpu); }
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{ + unsigned long weight = sd->span_weight; + + if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) { + if (sched_feat(ARCH_CAPACITY)) + return arch_scale_smt_capacity(sd, cpu); + else + return default_scale_smt_capacity(sd, cpu); + } + + return SCHED_CAPACITY_SCALE; +} + static unsigned long scale_rt_capacity(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5650,18 +5664,12 @@ static unsigned long scale_rt_capacity(int cpu)
static void update_cpu_capacity(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; unsigned long capacity = SCHED_CAPACITY_SCALE; struct sched_group *sdg = sd->groups;
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) { - if (sched_feat(ARCH_CAPACITY)) - capacity *= arch_scale_smt_capacity(sd, cpu); - else - capacity *= default_scale_smt_capacity(sd, cpu); + capacity *= arch_scale_cpu_capacity(sd, cpu);
- capacity >>= SCHED_CAPACITY_SHIFT; - } + capacity >>= SCHED_CAPACITY_SHIFT;
sdg->sgc->capacity_orig = capacity;
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
- -- All rights reversed
On Mon, Jun 30, 2014 at 06:05:35PM +0200, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 148b277..f0bba5f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5614,6 +5614,20 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu) return default_scale_smt_capacity(sd, cpu); } +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{
- unsigned long weight = sd->span_weight;
- if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
if (sched_feat(ARCH_CAPACITY))
return arch_scale_smt_capacity(sd, cpu);
else
return default_scale_smt_capacity(sd, cpu);
- }
- return SCHED_CAPACITY_SCALE;
+}
The only caller of arch_scale_smt_capacity is now an arch_ function itself; which makes it entirely redundant, no?
Also, sched_feat() and default_scale_smt_capability() aren't available to arch implementations of this function and can thus not retain semantics.
Seeing how we currently don't have any arch_scale_smt_capacity implementations other than the default we can easily replace it entirely with something like:
unsigned long __weak arch_scale_cpu_capacity() { if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) return sd->smt_gain / sd->span_weight;
return SCHED_CAPACITY_SCALE; }
Hmm?
On 9 July 2014 12:57, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:35PM +0200, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 148b277..f0bba5f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5614,6 +5614,20 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu) return default_scale_smt_capacity(sd, cpu); }
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) +{
unsigned long weight = sd->span_weight;
if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
if (sched_feat(ARCH_CAPACITY))
return arch_scale_smt_capacity(sd, cpu);
else
return default_scale_smt_capacity(sd, cpu);
}
return SCHED_CAPACITY_SCALE;
+}
The only caller of arch_scale_smt_capacity is now an arch_ function itself; which makes it entirely redundant, no?
Also, sched_feat() and default_scale_smt_capability() aren't available to arch implementations of this function and can thus not retain semantics.
Seeing how we currently don't have any arch_scale_smt_capacity implementations other than the default we can easily replace it entirely with something like:
unsigned long __weak arch_scale_cpu_capacity() { if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1)) return sd->smt_gain / sd->span_weight;
return SCHED_CAPACITY_SCALE;
}
Hmm?
That's fair. i'm going to clean up and remove unused function
Use the new arch_scale_cpu_power in order to reflect the original capacity of a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index d42a7db..2310bfb 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu) set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n", - cpu, arch_scale_freq_capacity(NULL, cpu)); + cpu, arch_scale_cpu_capacity(NULL, cpu)); }
#else
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
Use the new arch_scale_cpu_power in order to reflect the original capacity of a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
- -- All rights reversed
On Mon, Jun 30, 2014 at 6:05 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
Use the new arch_scale_cpu_power in order to reflect the original capacity of
s/arch_scale_cpu_power/arch_scale_cpu_capacity and similar renames in the commit logs across the entire patchset to take into account all the renames in the code.
a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
here too...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index d42a7db..2310bfb 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu) set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
cpu, arch_scale_freq_capacity(NULL, cpu));
cpu, arch_scale_cpu_capacity(NULL, cpu));
}
#else
1.9.1
linaro-kernel mailing list linaro-kernel@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-kernel
On 9 July 2014 09:49, Amit Kucheria amit.kucheria@linaro.org wrote:
On Mon, Jun 30, 2014 at 6:05 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
Use the new arch_scale_cpu_power in order to reflect the original capacity of
s/arch_scale_cpu_power/arch_scale_cpu_capacity and similar renames in the commit logs across the entire patchset to take into account all the renames in the code.
good catch, the commit log passes through the find and replace. I will change the other commit message as well
a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
here too...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index d42a7db..2310bfb 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu) set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
cpu, arch_scale_freq_capacity(NULL, cpu));
cpu, arch_scale_cpu_capacity(NULL, cpu));
}
#else
1.9.1
linaro-kernel mailing list linaro-kernel@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-kernel
This new field cpu_power_orig reflects the available capacity of a CPUs unlike the cpu_power which reflects the current capacity that can be altered by frequency and rt tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 54f5722..92dd6d1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6964,7 +6964,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq->cpu_capacity = SCHED_CAPACITY_SCALE; + rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f0bba5f..7dff578 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5671,6 +5671,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
+ cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity;
if (sched_feat(ARCH_CAPACITY)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2f86361..7b311de 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -568,6 +568,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_capacity; + unsigned long cpu_capacity_orig;
unsigned char idle_balance; /* For active balancing */
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
This new field cpu_power_orig reflects the available capacity of a CPUs unlike the cpu_power which reflects the current capacity that can be altered by frequency and rt tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
- -- All rights reversed
On Mon, Jun 30, 2014 at 6:05 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
This new field cpu_power_orig reflects the available capacity of a CPUs unlike the cpu_power which reflects the current capacity that can be altered by frequency and rt tasks.
s/cpu_power/cpu_capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 54f5722..92dd6d1 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6964,7 +6964,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL;
rq->cpu_capacity = SCHED_CAPACITY_SCALE;
rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f0bba5f..7dff578 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5671,6 +5671,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
capacity >>= SCHED_CAPACITY_SHIFT;
cpu_rq(cpu)->cpu_capacity_orig = capacity; sdg->sgc->capacity_orig = capacity; if (sched_feat(ARCH_CAPACITY))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2f86361..7b311de 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -568,6 +568,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig; unsigned char idle_balance; /* For active balancing */
-- 1.9.1
linaro-kernel mailing list linaro-kernel@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-kernel
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one which share its cache.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7dff578..a23c938 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4240,6 +4240,7 @@ static int wake_wide(struct task_struct *p) static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; + s64 this_eff_load, prev_eff_load; int idx, this_cpu, prev_cpu; struct task_group *tg; unsigned long weight; @@ -4283,21 +4284,23 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) * Otherwise check if either cpus are near enough in load to allow this * task to be woken on this_cpu. */ - if (this_load > 0) { - s64 this_eff_load, prev_eff_load; + this_eff_load = 100; + this_eff_load *= capacity_of(prev_cpu); + + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= capacity_of(this_cpu);
- this_eff_load = 100; - this_eff_load *= capacity_of(prev_cpu); + if (this_load > 0) { this_eff_load *= this_load + effective_load(tg, this_cpu, weight, weight);
- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; - prev_eff_load *= capacity_of(this_cpu); prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); + } else if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) { + this_eff_load = 0; + } + + balanced = this_eff_load <= prev_eff_load;
- balanced = this_eff_load <= prev_eff_load; - } else - balanced = true; schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
if (!balanced)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/30/2014 12:05 PM, Vincent Guittot wrote:
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one which share its cache.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Acked-by: Rik van Riel riel@redhat.com
- -- All rights reversed
On Mon, Jun 30, 2014 at 06:05:38PM +0200, Vincent Guittot wrote:
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one which share its cache.
The above, doesn't seem to explain:
- } else if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
this_eff_load = 0;
- }
- balanced = this_eff_load <= prev_eff_load;
this. Why would you unconditionally allow wake_affine across cache domains?
On 10 July 2014 13:06, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:38PM +0200, Vincent Guittot wrote:
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and/or the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one which share its cache.
The above, doesn't seem to explain:
} else if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
this_eff_load = 0;
}
balanced = this_eff_load <= prev_eff_load;
this. Why would you unconditionally allow wake_affine across cache domains?
The current policy is to use this_cpu if this_load <= 0. I want to keep the current wake affine policy for all sched_domain that doesn't share cache so if the involved CPUs don't share cache, I clear this_eff_load to force balanced to be true. But when CPUs share their cache, we not only look at idleness but also at available capacity of prev and local CPUs.
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity.
As a sidenote, this will note generate more spurious ilb because we already trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new test
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 58 +++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 41 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a23c938..742ad88 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5944,6 +5944,14 @@ static bool update_sd_pick_busiest(struct lb_env *env, return true; }
+ /* + * The group capacity is reduced probably because of activity from other + * sched class or interrupts which use part of the available capacity + */ + if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity * + env->sd->imbalance_pct)) + return true; + return false; }
@@ -6421,13 +6429,24 @@ static int need_active_balance(struct lb_env *env) struct sched_domain *sd = env->sd;
if (env->idle == CPU_NEWLY_IDLE) { + int src_cpu = env->src_cpu;
/* * ASYM_PACKING needs to force migrate tasks from busy but * higher numbered CPUs in order to pack all tasks in the * lowest numbered CPUs. */ - if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu) + if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu) + return 1; + + /* + * If the CPUs share their cache and the src_cpu's capacity is + * reduced because of other sched_class or IRQs, we trig an + * active balance to move the task + */ + if ((sd->flags & SD_SHARE_PKG_RESOURCES) + && ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) * + sd->imbalance_pct))) return 1; }
@@ -6529,6 +6548,8 @@ redo:
schedstat_add(sd, lb_imbalance[idle], env.imbalance);
+ env.src_cpu = busiest->cpu; + ld_moved = 0; if (busiest->nr_running > 1) { /* @@ -6538,7 +6559,6 @@ redo: * correctly treated as an imbalance. */ env.flags |= LBF_ALL_PINNED; - env.src_cpu = busiest->cpu; env.src_rq = busiest; env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
@@ -7233,9 +7253,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu; + bool kick = false;
if (unlikely(rq->idle_balance)) - return 0; + return false;
/* * We may be recently in ticked or tickless idle mode. At the first @@ -7249,38 +7270,41 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus))) - return 0; + return false;
if (time_before(now, nohz.next_balance)) - return 0; + return false;
if (rq->nr_running >= 2) - goto need_kick; + return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); - if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
- if (nr_busy > 1) - goto need_kick_unlock; + if (nr_busy > 1) { + kick = true; + goto unlock; + } + + if ((rq->cfs.h_nr_running >= 1) + && ((rq->cpu_capacity * sd->imbalance_pct) < + (rq->cpu_capacity_orig * 100))) { + kick = true; + goto unlock; + } }
sd = rcu_dereference(per_cpu(sd_asym, cpu)); - if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) - goto need_kick_unlock; + kick = true;
+unlock: rcu_read_unlock(); - return 0; - -need_kick_unlock: - rcu_read_unlock(); -need_kick: - return 1; + return kick; } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((sd->flags & SD_SHARE_PKG_RESOURCES)
&& ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct))) return 1;
Why is this tied to shared caches?
On 10 July 2014 13:18, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((sd->flags & SD_SHARE_PKG_RESOURCES)
&& ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct))) return 1;
Why is this tied to shared caches?
It's just to limit the change of the policy to a level that can have benefit without performance regression. I'm not sure that we can do that at any level without risk
On Thu, Jul 10, 2014 at 04:03:51PM +0200, Vincent Guittot wrote:
On 10 July 2014 13:18, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((sd->flags & SD_SHARE_PKG_RESOURCES)
&& ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct))) return 1;
Why is this tied to shared caches?
It's just to limit the change of the policy to a level that can have benefit without performance regression. I'm not sure that we can do that at any level without risk
Similar to the other change; so both details _should_ have been in the changelogs etc..
In any case, its feels rather arbitrary to me. What about machines where there's no cache sharing at all (the traditional SMP systems). This thing you're trying to do still seems to make sense there.
On 11 July 2014 16:51, Peter Zijlstra peterz@infradead.org wrote:
On Thu, Jul 10, 2014 at 04:03:51PM +0200, Vincent Guittot wrote:
On 10 July 2014 13:18, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
/*
* If the CPUs share their cache and the src_cpu's capacity is
* reduced because of other sched_class or IRQs, we trig an
* active balance to move the task
*/
if ((sd->flags & SD_SHARE_PKG_RESOURCES)
&& ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct))) return 1;
Why is this tied to shared caches?
It's just to limit the change of the policy to a level that can have benefit without performance regression. I'm not sure that we can do that at any level without risk
Similar to the other change; so both details _should_ have been in the changelogs etc..
i'm going to add details in the v4
In any case, its feels rather arbitrary to me. What about machines where there's no cache sharing at all (the traditional SMP systems). This thing you're trying to do still seems to make sense there.
ok, I thought that traditional SMP systems have this flag set at core level. I mean ARM platforms have the flag for CPUs in the same cluster (which include current ARM SMP system) and the corei7 of my laptop has the flag at the cores level.
On Fri, Jul 11, 2014 at 05:17:44PM +0200, Vincent Guittot wrote:
In any case, its feels rather arbitrary to me. What about machines where there's no cache sharing at all (the traditional SMP systems). This thing you're trying to do still seems to make sense there.
ok, I thought that traditional SMP systems have this flag set at core level.
Yeah, with 1 core, so its effectively disabled.
I mean ARM platforms have the flag for CPUs in the same cluster (which include current ARM SMP system) and the corei7 of my laptop has the flag at the cores level.
So I can see 'small' parts reducing shared caches in order to improve idle performance.
The point being that LLC seems a somewhat arbitrary measure for this.
Can we try and see what happens if you remove the limit. Its always best to try the simpler things first and only make it more complex if we have to.
On 14 July 2014 15:51, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Jul 11, 2014 at 05:17:44PM +0200, Vincent Guittot wrote:
In any case, its feels rather arbitrary to me. What about machines where there's no cache sharing at all (the traditional SMP systems). This thing you're trying to do still seems to make sense there.
ok, I thought that traditional SMP systems have this flag set at core level.
Yeah, with 1 core, so its effectively disabled.
I mean ARM platforms have the flag for CPUs in the same cluster (which include current ARM SMP system) and the corei7 of my laptop has the flag at the cores level.
So I can see 'small' parts reducing shared caches in order to improve idle performance.
The point being that LLC seems a somewhat arbitrary measure for this.
Can we try and see what happens if you remove the limit. Its always best to try the simpler things first and only make it more complex if we have to.
ok. i will remove the condition in the next version
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
if ((sd->flags & SD_SHARE_PKG_RESOURCES)
&& ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct)))
if ((rq->cfs.h_nr_running >= 1)
&& ((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
That's just weird indentation and operators should be at the end not at the beginning of a line-break.
On 10 July 2014 13:24, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
if ((sd->flags & SD_SHARE_PKG_RESOURCES)
&& ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
sd->imbalance_pct)))
if ((rq->cfs.h_nr_running >= 1)
&& ((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
That's just weird indentation and operators should be at the end not at the beginning of a line-break.
Ok, i'm going to fix it
On Mon, Jun 30, 2014 at 06:05:39PM +0200, Vincent Guittot wrote:
You 'forgot' to update the comment that goes with nohz_kick_needed().
@@ -7233,9 +7253,10 @@ static inline int nohz_kick_needed(struct rq *rq) struct sched_domain *sd; struct sched_group_capacity *sgc; int nr_busy, cpu = rq->cpu;
- bool kick = false;
if (unlikely(rq->idle_balance))
return false;
/*
- We may be recently in ticked or tickless idle mode. At the first
@@ -7249,38 +7270,41 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus)))
return false;
if (time_before(now, nohz.next_balance))
return false;
if (rq->nr_running >= 2)
return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (sd) { sgc = sd->groups->sgc; nr_busy = atomic_read(&sgc->nr_busy_cpus);
if (nr_busy > 1) {
kick = true;
goto unlock;
}
if ((rq->cfs.h_nr_running >= 1)
&& ((rq->cpu_capacity * sd->imbalance_pct) <
(rq->cpu_capacity_orig * 100))) {
kick = true;
goto unlock;
}
Again, why only for shared caches?
} sd = rcu_dereference(per_cpu(sd_asym, cpu)); if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu))
kick = true;
+unlock: rcu_read_unlock();
- return kick;
}
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 13 ++++++------- kernel/sched/sched.h | 4 ++-- 2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 742ad88..bdc5cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2428,19 +2428,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) se->avg.load_avg_contrib >>= NICE_0_SHIFT; } } - -static inline void update_rq_runnable_avg(struct rq *rq, int runnable) -{ - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); - __update_tg_runnable_avg(&rq->avg, &rq->cfs); -} #else /* CONFIG_FAIR_GROUP_SCHED */ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, int force_update) {} static inline void __update_tg_runnable_avg(struct sched_avg *sa, struct cfs_rq *cfs_rq) {} static inline void __update_group_entity_contrib(struct sched_entity *se) {} -static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {} #endif /* CONFIG_FAIR_GROUP_SCHED */
static inline void __update_task_entity_contrib(struct sched_entity *se) @@ -2539,6 +2532,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); }
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable) +{ + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + __update_tg_runnable_avg(&rq->avg, &rq->cfs); +} + /* Add the load generated by se into cfs_rq's child load-average */ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7b311de..fb53166 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -542,8 +542,6 @@ struct rq { #ifdef CONFIG_FAIR_GROUP_SCHED /* list of leaf cfs_rq on this cpu: */ struct list_head leaf_cfs_rq_list; - - struct sched_avg avg; #endif /* CONFIG_FAIR_GROUP_SCHED */
/* @@ -634,6 +632,8 @@ struct rq { #ifdef CONFIG_SMP struct llist_head wake_list; #endif + + struct sched_avg avg; };
static inline int cpu_of(struct rq *rq)
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks.
But this rq->avg is not the one that is migration aware, right? So why use this?
We already compensate cpu_capacity for !fair tasks, so I don't see why we can't use the migration aware one (and kill this one as Yuyang keeps proposing) and compensate with the capacity factor.
On 10 July 2014 15:16, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks.
But this rq->avg is not the one that is migration aware, right? So why use this?
Yes, it's not the one that is migration aware
We already compensate cpu_capacity for !fair tasks, so I don't see why we can't use the migration aware one (and kill this one as Yuyang keeps proposing) and compensate with the capacity factor.
The 1st point is that cpu_capacity is compensated by both !fair_tasks and frequency scaling and we should not take into account frequency scaling for detecting overload
What we have now is the the weighted load avg that is the sum of the weight load of entities on the run queue. This is not usable to detect overload because of the weight. An unweighted version of this figure would be more usefull but it's not as accurate as the one I use IMHO. The example that has been discussed during the review of the last version has shown some limitations
With the following schedule pattern from Morten's example
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
The scheduler will see the following values: Task A unweighted load value is 47% Task B unweight load is 60% The maximum Sum of unweighted load is 104% rq->avg load is 60%
And the real CPU load is 50%
So we will have opposite decision depending of the used values: the rq->avg or the Sum of unweighted load
The sum of unweighted load has the main advantage of showing immediately what will be the relative impact of adding/removing a task. In the example, we can see that removing task A or B will remove around half the CPU load but it's not so good for giving the current utilization of the CPU
Vincent
On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote:
On 10 July 2014 15:16, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks.
But this rq->avg is not the one that is migration aware, right? So why use this?
Yes, it's not the one that is migration aware
We already compensate cpu_capacity for !fair tasks, so I don't see why we can't use the migration aware one (and kill this one as Yuyang keeps proposing) and compensate with the capacity factor.
The 1st point is that cpu_capacity is compensated by both !fair_tasks and frequency scaling and we should not take into account frequency scaling for detecting overload
dvfs could help? Also we should not use arch_scale_freq_capacity() for things like cpufreq-ondemand etc. Because for those the compute capacity is still the max. We should only use it when we hard limit things.
What we have now is the the weighted load avg that is the sum of the weight load of entities on the run queue. This is not usable to detect overload because of the weight. An unweighted version of this figure would be more usefull but it's not as accurate as the one I use IMHO. The example that has been discussed during the review of the last version has shown some limitations
With the following schedule pattern from Morten's example
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
The scheduler will see the following values: Task A unweighted load value is 47% Task B unweight load is 60% The maximum Sum of unweighted load is 104% rq->avg load is 60%
And the real CPU load is 50%
So we will have opposite decision depending of the used values: the rq->avg or the Sum of unweighted load
The sum of unweighted load has the main advantage of showing immediately what will be the relative impact of adding/removing a task. In the example, we can see that removing task A or B will remove around half the CPU load but it's not so good for giving the current utilization of the CPU
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
On 11 July 2014 17:13, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote:
On 10 July 2014 15:16, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks.
But this rq->avg is not the one that is migration aware, right? So why use this?
Yes, it's not the one that is migration aware
We already compensate cpu_capacity for !fair tasks, so I don't see why we can't use the migration aware one (and kill this one as Yuyang keeps proposing) and compensate with the capacity factor.
The 1st point is that cpu_capacity is compensated by both !fair_tasks and frequency scaling and we should not take into account frequency scaling for detecting overload
dvfs could help? Also we should not use arch_scale_freq_capacity() for things like cpufreq-ondemand etc. Because for those the compute capacity is still the max. We should only use it when we hard limit things.
In my mind, arch_scale_cpu_freq was intend to scale the capacity of the CPU according to the current dvfs operating point. As it's no more use anywhere now that we have arch_scale_cpu, we could probably remove it .. and see when it will become used.
What we have now is the the weighted load avg that is the sum of the weight load of entities on the run queue. This is not usable to detect overload because of the weight. An unweighted version of this figure would be more usefull but it's not as accurate as the one I use IMHO. The example that has been discussed during the review of the last version has shown some limitations
With the following schedule pattern from Morten's example
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
The scheduler will see the following values: Task A unweighted load value is 47% Task B unweight load is 60% The maximum Sum of unweighted load is 104% rq->avg load is 60%
And the real CPU load is 50%
So we will have opposite decision depending of the used values: the rq->avg or the Sum of unweighted load
The sum of unweighted load has the main advantage of showing immediately what will be the relative impact of adding/removing a task. In the example, we can see that removing task A or B will remove around half the CPU load but it's not so good for giving the current utilization of the CPU
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ?
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
In my mind, arch_scale_cpu_freq was intend to scale the capacity of the CPU according to the current dvfs operating point. As it's no more use anywhere now that we have arch_scale_cpu, we could probably remove it .. and see when it will become used.
I probably should have written comments when I wrote that code, but it was meant to be used only where, as described above, we limit things. Ondemand and such, which will temporarily decrease freq, will ramp it up again at demand, and therefore lowering the capacity will skew things.
You'll put less load on because its run slower, and then you'll run it slower because there's less load on -> cyclic FAIL.
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ?
Sure, I think we killed it there because there wasn't an actual use for it and I'm always in favour of stripping everything to their bare bones, esp big and complex things.
And then later, add things back once we have need for it.
On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
In my mind, arch_scale_cpu_freq was intend to scale the capacity of the CPU according to the current dvfs operating point. As it's no more use anywhere now that we have arch_scale_cpu, we could probably remove it .. and see when it will become used.
I probably should have written comments when I wrote that code, but it was meant to be used only where, as described above, we limit things. Ondemand and such, which will temporarily decrease freq, will ramp it up again at demand, and therefore lowering the capacity will skew things.
You'll put less load on because its run slower, and then you'll run it slower because there's less load on -> cyclic FAIL.
Agreed. We can't use a frequency scaled compute capacity for all load-balancing decisions. However, IMHO, it would be useful to have know the current compute capacity in addition to the max compute capacity when considering energy costs. So we would have something like:
* capacity_max: cpu capacity at highest frequency.
* capacity_cur: cpu capacity at current frequency.
* capacity_avail: cpu capacity currently available. Basically capacity_cur taking rt, deadline, and irq accounting into account.
capacity_max should probably include rt, deadline, and irq accounting as well. Or we need both?
Based on your description arch_scale_freq_capacity() can't be abused to implement capacity_cur (and capacity_avail) unless it is repurposed. Nobody seems to implement it. Otherwise we would need something similar to update capacity_cur (and capacity_avail).
As a side note, we can potentially get into a similar fail cycle already due to the lack of scale invariance in the entity load tracking.
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ?
Sure, I think we killed it there because there wasn't an actual use for it and I'm always in favour of stripping everything to their bare bones, esp big and complex things.
And then later, add things back once we have need for it.
I think it is a useful addition to the set of utilization metrics. I don't think it is universally more accurate than runnable_avg. Actually quite the opposite when the cpu is overloaded. But for partially loaded cpus it is very useful if you don't want to factor in waiting time on the rq.
Morten
On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote:
On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
In my mind, arch_scale_cpu_freq was intend to scale the capacity of the CPU according to the current dvfs operating point. As it's no more use anywhere now that we have arch_scale_cpu, we could probably remove it .. and see when it will become used.
I probably should have written comments when I wrote that code, but it was meant to be used only where, as described above, we limit things. Ondemand and such, which will temporarily decrease freq, will ramp it up again at demand, and therefore lowering the capacity will skew things.
You'll put less load on because its run slower, and then you'll run it slower because there's less load on -> cyclic FAIL.
Agreed. We can't use a frequency scaled compute capacity for all load-balancing decisions. However, IMHO, it would be useful to have know the current compute capacity in addition to the max compute capacity when considering energy costs. So we would have something like:
capacity_max: cpu capacity at highest frequency.
capacity_cur: cpu capacity at current frequency.
capacity_avail: cpu capacity currently available. Basically capacity_cur taking rt, deadline, and irq accounting into account.
capacity_max should probably include rt, deadline, and irq accounting as well. Or we need both?
I'm struggling to fully grasp your intent. We need DVFS like accounting for sure, and that means a current freq hook, but I'm not entirely sure how that relates to capacity.
Based on your description arch_scale_freq_capacity() can't be abused to implement capacity_cur (and capacity_avail) unless it is repurposed. Nobody seems to implement it. Otherwise we would need something similar to update capacity_cur (and capacity_avail).
Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT capacity thing for x86 but never finished that. The naive implementation suffered the same FAIL loop as above because APERF stops on idle. So when idle your capacity drops to nothing, leading to no new work, leading to more idle etc.
I never got around to fixing that -- adding an idle filter, and ever since things have somewhat bitrotted.
As a side note, we can potentially get into a similar fail cycle already due to the lack of scale invariance in the entity load tracking.
Yah, I think that got mentioned a long while ago.
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ?
Sure, I think we killed it there because there wasn't an actual use for it and I'm always in favour of stripping everything to their bare bones, esp big and complex things.
And then later, add things back once we have need for it.
I think it is a useful addition to the set of utilization metrics. I don't think it is universally more accurate than runnable_avg. Actually quite the opposite when the cpu is overloaded. But for partially loaded cpus it is very useful if you don't want to factor in waiting time on the rq.
Well, different things different names. Utilization as per literature is simply the fraction of CPU time actually used. In that sense running_avg is about right for that. Our current runnable_avg is entirely different (as I think we all agree by now).
But yes, for application the tipping point is u == 1, up until that point pure utilization makes sense, after that our runnable_avg makes more sense.
On Mon, Jul 14, 2014 at 02:20:52PM +0100, Peter Zijlstra wrote:
On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote:
On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote:
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
In my mind, arch_scale_cpu_freq was intend to scale the capacity of the CPU according to the current dvfs operating point. As it's no more use anywhere now that we have arch_scale_cpu, we could probably remove it .. and see when it will become used.
I probably should have written comments when I wrote that code, but it was meant to be used only where, as described above, we limit things. Ondemand and such, which will temporarily decrease freq, will ramp it up again at demand, and therefore lowering the capacity will skew things.
You'll put less load on because its run slower, and then you'll run it slower because there's less load on -> cyclic FAIL.
Agreed. We can't use a frequency scaled compute capacity for all load-balancing decisions. However, IMHO, it would be useful to have know the current compute capacity in addition to the max compute capacity when considering energy costs. So we would have something like:
capacity_max: cpu capacity at highest frequency.
capacity_cur: cpu capacity at current frequency.
capacity_avail: cpu capacity currently available. Basically capacity_cur taking rt, deadline, and irq accounting into account.
capacity_max should probably include rt, deadline, and irq accounting as well. Or we need both?
I'm struggling to fully grasp your intent. We need DVFS like accounting for sure, and that means a current freq hook, but I'm not entirely sure how that relates to capacity.
We can abstract all the factors that affect current compute capacity (frequency, P-states, big.LITTLE,...) in the scheduler by having something like capacity_{cur,avail} to tell us how much capacity does a particular cpu have in its current state. Assuming that implement scale invariance for entity load tracking (we are working on that), we can directly compare task utilization with compute capacity for balancing decisions. For example, we can figure out how much spare capacity a cpu has in its current state by simply:
spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t)
If you put more than spare_capacity(cpu) worth of task utilization on the cpu, you will cause the cpu (and any affected cpus) to change P-state and potentially be less energy-efficient.
Does that make any sense?
Instead of dealing with frequencies directly in the scheduler code, we can abstract it by just having scalable compute capacity.
Based on your description arch_scale_freq_capacity() can't be abused to implement capacity_cur (and capacity_avail) unless it is repurposed. Nobody seems to implement it. Otherwise we would need something similar to update capacity_cur (and capacity_avail).
Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT capacity thing for x86 but never finished that. The naive implementation suffered the same FAIL loop as above because APERF stops on idle. So when idle your capacity drops to nothing, leading to no new work, leading to more idle etc.
I never got around to fixing that -- adding an idle filter, and ever since things have somewhat bitrotted.
I see.
As a side note, we can potentially get into a similar fail cycle already due to the lack of scale invariance in the entity load tracking.
Yah, I think that got mentioned a long while ago.
It did :-)
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ?
Sure, I think we killed it there because there wasn't an actual use for it and I'm always in favour of stripping everything to their bare bones, esp big and complex things.
And then later, add things back once we have need for it.
I think it is a useful addition to the set of utilization metrics. I don't think it is universally more accurate than runnable_avg. Actually quite the opposite when the cpu is overloaded. But for partially loaded cpus it is very useful if you don't want to factor in waiting time on the rq.
Well, different things different names. Utilization as per literature is simply the fraction of CPU time actually used. In that sense running_avg is about right for that. Our current runnable_avg is entirely different (as I think we all agree by now).
But yes, for application the tipping point is u == 1, up until that point pure utilization makes sense, after that our runnable_avg makes more sense.
Agreed.
If you really care about latency/performance you might be interested in comparing running_avg and runnable_avg even for u < 1. If the running_avg/runnable_avg ratio is significantly less than one, tasks are waiting on the rq to be scheduled.
On Mon, Jul 14, 2014 at 03:04:35PM +0100, Morten Rasmussen wrote:
I'm struggling to fully grasp your intent. We need DVFS like accounting for sure, and that means a current freq hook, but I'm not entirely sure how that relates to capacity.
We can abstract all the factors that affect current compute capacity (frequency, P-states, big.LITTLE,...) in the scheduler by having something like capacity_{cur,avail} to tell us how much capacity does a particular cpu have in its current state. Assuming that implement scale invariance for entity load tracking (we are working on that), we can directly compare task utilization with compute capacity for balancing decisions. For example, we can figure out how much spare capacity a cpu has in its current state by simply:
spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t)
If you put more than spare_capacity(cpu) worth of task utilization on the cpu, you will cause the cpu (and any affected cpus) to change P-state and potentially be less energy-efficient.
Does that make any sense?
Instead of dealing with frequencies directly in the scheduler code, we can abstract it by just having scalable compute capacity.
Ah, ok. Same thing then.
But yes, for application the tipping point is u == 1, up until that point pure utilization makes sense, after that our runnable_avg makes more sense.
Agreed.
If you really care about latency/performance you might be interested in comparing running_avg and runnable_avg even for u < 1. If the running_avg/runnable_avg ratio is significantly less than one, tasks are waiting on the rq to be scheduled.
Indeed, that gives a measure of queueing.
On 11 July 2014 22:12, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote:
In my mind, arch_scale_cpu_freq was intend to scale the capacity of the CPU according to the current dvfs operating point. As it's no more use anywhere now that we have arch_scale_cpu, we could probably remove it .. and see when it will become used.
I probably should have written comments when I wrote that code, but it was meant to be used only where, as described above, we limit things. Ondemand and such, which will temporarily decrease freq, will ramp it up again at demand, and therefore lowering the capacity will skew things.
You'll put less load on because its run slower, and then you'll run it slower because there's less load on -> cyclic FAIL.
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ?
Sure, I think we killed it there because there wasn't an actual use for it and I'm always in favour of stripping everything to their bare bones, esp big and complex things.
And then later, add things back once we have need for it.
Ok, i'm going to look how to add it back taking nto account current Yuyang's rework of load_avg
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[...]
In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations.
I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ?
Are you referring to '[RFC PATCH 14/14] sched: implement usage tracking' https://lkml.org/lkml/2012/2/1/769 from Paul Turner?
__update_entity_runnable_avg() has an additional parameter 'running' so that it can be called for
a) sched_entities in update_entity_load_avg():
__update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se))
b) rq's in update_rq_runnable_avg():
__update_entity_runnable_avg(..., runnable, runnable);
I can see how it gives us two different signals for a sched_entity but for a rq?
Do I miss something here?
-- Dietmar
[...]
On Mon, Jul 14, 2014 at 06:54:11PM +0100, Dietmar Eggemann wrote:
__update_entity_runnable_avg() has an additional parameter 'running' so that it can be called for
a) sched_entities in update_entity_load_avg():
__update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se))
b) rq's in update_rq_runnable_avg():
__update_entity_runnable_avg(..., runnable, runnable);
I can see how it gives us two different signals for a sched_entity but for a rq?
For rq,
__update_entity_runnable_avg(..., runnable, runnable > 0)
Then, first one would be effectively CPU ConCurrency (fair and !fair) and second one would be CPU (has task) running (or about CPU utilization for fair and !fair), :)
Thanks, Yuyang
On Fri, Jul 11, 2014 at 08:51:06AM +0100, Vincent Guittot wrote:
On 10 July 2014 15:16, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote:
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks.
But this rq->avg is not the one that is migration aware, right? So why use this?
Yes, it's not the one that is migration aware
We already compensate cpu_capacity for !fair tasks, so I don't see why we can't use the migration aware one (and kill this one as Yuyang keeps proposing) and compensate with the capacity factor.
The 1st point is that cpu_capacity is compensated by both !fair_tasks and frequency scaling and we should not take into account frequency scaling for detecting overload
What we have now is the the weighted load avg that is the sum of the weight load of entities on the run queue. This is not usable to detect overload because of the weight. An unweighted version of this figure would be more usefull but it's not as accurate as the one I use IMHO.
IMHO there is no perfect utilization metric, but I think it is fundamentally wrong to use a metric that is migration unaware to make migration decisions. I mentioned that during the last review as well. It is like having a very fast controller with a really slow (large delay) feedback loop. There is a high risk of getting an unstable balance when you load-balance rate is faster than the feedback delay.
The example that has been discussed during the review of the last version has shown some limitations
With the following schedule pattern from Morten's example
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
The scheduler will see the following values: Task A unweighted load value is 47% Task B unweight load is 60% The maximum Sum of unweighted load is 104% rq->avg load is 60%
And the real CPU load is 50%
So we will have opposite decision depending of the used values: the rq->avg or the Sum of unweighted load
The sum of unweighted load has the main advantage of showing immediately what will be the relative impact of adding/removing a task. In the example, we can see that removing task A or B will remove around half the CPU load but it's not so good for giving the current utilization of the CPU
You forgot to mention the issues with rq->avg that were brought up last time :-)
Here is an load-balancing example:
Task A, B, C, and D are all running/runnable constantly. To avoid decimals we assume the sched tick to have a 9 ms period. We have four cpus in a single sched_domain.
rq == rq->avg uw == unweighted tracked load
cpu0: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: run rq rq B: rq run rq C: rq rq run D: rq rq rq run run run run run run rq: 100% 100% 100% 100% 100% 100% 100% 100% 100% uw: 400% 400% 400% 100% 100% 100% 100% 100% 100%
cpu1: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: run rq run rq run rq B: rq run rq run rq run C: D: rq: 0% 0% 0% 0% 6% 12% 18% 23% 28% uw: 0% 0% 0% 200% 200% 200% 200% 200% 200%
cpu2: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: B: C: run run run run run run D: rq: 0% 0% 0% 0% 6% 12% 18% 23% 28% uw: 0% 0% 0% 100% 100% 100% 100% 100% 100%
cpu3: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: B: C: D: rq: 0% 0% 0% 0% 0% 0% 0% 0% 0% uw: 0% 0% 0% 0% 0% 0% 0% 0% 0%
A periodic load-balance occurs on cpu1 after 9 ms. cpu0 rq->avg indicates overload. Consequently cpu1 pulls task A and B.
Shortly after (<1 ms) cpu2 does a periodic load-balance. cpu0 rq->avg hasn't changed so cpu0 still appears overloaded. cpu2 pulls task C.
Shortly after (<1 ms) cpu3 does a periodic load-balance. cpu0 rq->avg still indicates overload so cpu3 tries to pull tasks but fails since there is only task D left.
9 ms later the sched tick causes periodic load-balances on all the cpus. cpu0 rq->avg still indicates that it has the highest load since cpu1 rq->avg has not had time to indicate overload. Consequently cpu1, 2, and 3 will try to pull from that and fail. The balance will only change once cpu1 rq->avg has increased enough to indicate overload.
Unweighted load will on the other hand indicate the load changes instantaneously, so cpu3 would observe the overload of cpu1 immediately and pull task A or B.
In this example using rq->avg leads to imbalance whereas unweighted load would not. Correct me if I missed anything.
Coming back to the previous example. I'm not convinced that inflation of the unweighted load sum when tasks overlap in time is a bad thing. I have mentioned this before. The average cpu utilization over the 40ms period is 50%. However the true compute capacity demand is 200% for the first 15ms of the period, 100% for the next 5ms and 0% for the remaining 25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is factored into the unweighted load whereas rq->avg would give you the same utilization no matter if the tasks are overlapped or not. Hence unweighted load would give us an indication that the mix of tasks isn't optimal even if the cpu has spare cycles.
If you don't care about overlap and latency, the unweighted sum of task running time (that Peter has proposed a number of times) is better metric, IMHO. As long the cpu isn't fully utilized.
Morten
On 11 July 2014 18:13, Morten Rasmussen morten.rasmussen@arm.com wrote:
[snip]
In this example using rq->avg leads to imbalance whereas unweighted load would not. Correct me if I missed anything.
You just miss to take into account how the imbalance is computed
Coming back to the previous example. I'm not convinced that inflation of the unweighted load sum when tasks overlap in time is a bad thing. I have mentioned this before. The average cpu utilization over the 40ms period is 50%. However the true compute capacity demand is 200% for the first 15ms of the period, 100% for the next 5ms and 0% for the remaining 25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is factored into the unweighted load whereas rq->avg would give you the same utilization no matter if the tasks are overlapped or not. Hence unweighted load would give us an indication that the mix of tasks isn't optimal even if the cpu has spare cycles.
If you don't care about overlap and latency, the unweighted sum of task running time (that Peter has proposed a number of times) is better metric, IMHO. As long the cpu isn't fully utilized.
Morten
On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote:
On 11 July 2014 18:13, Morten Rasmussen morten.rasmussen@arm.com wrote:
[snip]
In this example using rq->avg leads to imbalance whereas unweighted load would not. Correct me if I missed anything.
You just miss to take into account how the imbalance is computed
I don't think so. I'm aware that the imbalance is calculated based on the runnable_load_avg of the cpus. But if you pick the wrong cpus to compare to begin with, it doesn't matter.
Morten
On 15 July 2014 11:32, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote:
On 11 July 2014 18:13, Morten Rasmussen morten.rasmussen@arm.com wrote:
[snip]
In this example using rq->avg leads to imbalance whereas unweighted load would not. Correct me if I missed anything.
You just miss to take into account how the imbalance is computed
I don't think so. I'm aware that the imbalance is calculated based on the runnable_load_avg of the cpus. But if you pick the wrong cpus to compare to begin with, it doesn't matter.
The average load of your sched_domain is 1 task per CPU so CPU1 will pull only 1 task and not 2
Morten
Monitor the utilization level of each group of each sched_domain level. The utilization is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this utilization level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bdc5cb9..a6b4b25 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4047,6 +4047,11 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; }
+static unsigned long capacity_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_capacity_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4437,6 +4442,18 @@ done: return target; }
+static int get_cpu_utilization(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + if (sum >= period) + return capacity_orig_of(cpu) + 1; + + return (sum * capacity_orig_of(cpu)) / period; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5517,6 +5534,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_utilization; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity_factor; unsigned int idle_cpus; @@ -5876,6 +5894,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; + sgs->group_utilization += get_cpu_utilization(i); sgs->sum_nr_running += rq->cfs.h_nr_running; #ifdef CONFIG_NUMA_BALANCING sgs->nr_numa_running += rq->nr_numa_running; @@ -6043,7 +6062,6 @@ next_group: /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; - sg = sg->next; } while (sg != env->sd->groups);
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_POWER_SCALE. We can have a better idea of the capacity of a group of CPUs and of the utilization of this group thanks to the rework of group_power_orig and group_utilization. We can deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 107 +++++++++++++++++++--------------------------------- 1 file changed, 38 insertions(+), 69 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a6b4b25..cf65284 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5534,13 +5534,13 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_capacity; + unsigned long group_capacity_orig; unsigned long group_utilization; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity_factor; unsigned int idle_cpus; unsigned int group_weight; int group_imb; /* Is there an imbalance in the group ? */ - int group_has_free_capacity; + int group_out_of_capacity; #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5781,31 +5781,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu) }
/* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). - */ -static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) -{ - /* - * Only siblings can have significantly less than SCHED_CAPACITY_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUCAPACITY)) - return 0; - - /* - * If ~90% of the cpu_capacity is still there, we're good. - */ - if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29) - return 1; - - return 0; -} - -/* * Group imbalance indicates (and tries to solve) the problem where balancing * groups is inadequate due to tsk_cpus_allowed() constraints. * @@ -5839,32 +5814,24 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgc->imbalance; }
-/* - * Compute the group capacity factor. - * - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit unit capacity with that. - */ -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) +static inline int group_has_free_capacity(struct sg_lb_stats *sgs, + struct lb_env *env) { - unsigned int capacity_factor, smt, cpus; - unsigned int capacity, capacity_orig; - - capacity = group->sgc->capacity; - capacity_orig = group->sgc->capacity_orig; - cpus = group->group_weight; + if ((sgs->group_capacity_orig * 100) > (sgs->group_utilization * env->sd->imbalance_pct) + || (sgs->sum_nr_running < sgs->group_weight)) + return 1;
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ - smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); - capacity_factor = cpus / smt; /* cores */ + return 0; +}
- capacity_factor = min_t(unsigned, - capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group); +static inline int group_is_overloaded(struct sg_lb_stats *sgs, + struct lb_env *env) +{ + if ((sgs->group_capacity_orig * 100) < (sgs->group_utilization * env->sd->imbalance_pct) + && (sgs->sum_nr_running > sgs->group_weight)) + return 1;
- return capacity_factor; + return 0; }
/** @@ -5905,6 +5872,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->idle_cpus++; }
+ sgs->group_capacity_orig = group->sgc->capacity_orig; /* Adjust by relative CPU capacity of the group */ sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; @@ -5915,10 +5883,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_weight = group->group_weight;
sgs->group_imb = sg_imbalanced(group); - sgs->group_capacity_factor = sg_capacity_factor(env, group);
- if (sgs->group_capacity_factor > sgs->sum_nr_running) - sgs->group_has_free_capacity = 1; + sgs->group_out_of_capacity = group_is_overloaded(sgs, env); }
/** @@ -5942,7 +5908,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->avg_load <= sds->busiest_stat.avg_load) return false;
- if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (sgs->group_out_of_capacity) return true;
if (sgs->group_imb) @@ -6041,17 +6007,20 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity factor to one so that we'll try + * first, lower the sg capacity to one so that we'll try * and move all the excess tasks away. We lower the capacity * of a group only if the local group has the capacity to fit - * these excess tasks, i.e. nr_running < group_capacity_factor. The + * these excess tasks, i.e. group_capacity > 0. The * extra check prevents the case where you always pull from the * heaviest group when it is already under-utilized (possible * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_free_capacity) - sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); + group_has_free_capacity(&sds->local_stat, env)) { + if (sgs->sum_nr_running > 1) + sgs->group_out_of_capacity = 1; + sgs->group_capacity = min(sgs->group_capacity, SCHED_CAPACITY_SCALE); + }
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6223,11 +6192,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * Except of course for the group_imb case, since then we might * have to drop below capacity to reach cpu-load equilibrium. */ - load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity_factor); - - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); - load_above_capacity /= busiest->group_capacity; + load_above_capacity = busiest->sum_nr_running * SCHED_LOAD_SCALE; + if (load_above_capacity > busiest->group_capacity) + load_above_capacity -= busiest->group_capacity; + else + load_above_capacity = ~0UL; }
/* @@ -6290,6 +6259,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6310,8 +6280,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && - !busiest->group_has_free_capacity) + if (env->idle == CPU_NEWLY_IDLE && group_has_free_capacity(local, env) && + busiest->group_out_of_capacity) goto force_balance;
/* @@ -6369,7 +6339,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long capacity, capacity_factor, wl; + unsigned long capacity, wl; enum fbq_type rt;
rq = cpu_rq(i); @@ -6398,9 +6368,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, continue;
capacity = capacity_of(i); - capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE); - if (!capacity_factor) - capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i);
@@ -6408,7 +6375,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu capacity. */ - if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance) + + if (rq->nr_running == 1 && wl > env->imbalance && + ((capacity * env->sd->imbalance_pct) >= (rq->cpu_capacity_orig * 100))) continue;
/*
Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that the scheduler will put at least 1 task per core.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 92dd6d1..9103669 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6068,6 +6068,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) */
if (sd->flags & SD_SHARE_CPUCAPACITY) { + sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
linaro-kernel@lists.linaro.org