Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power has been created and replace arch_scale_smt_power, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_POWER_SCALE.
Now that we have the original capacity of a CPUS and its activity/utilization, we can evaluate more accuratly the capacity of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we can certainly take advantage of this new statistic in several other places of the load balance.
TODO: - align variable's and field's name with the renaming [3]
Tests results: I have put below results of 2 tests: - hackbench -l 500 -s 4096 - scp of 100MB file on the platform
on a dual cortex-A7 hackbench scp tip/master 25.75s(+/-0.25) 5.16MB/s(+/-1.49) + patches 1,2 25.89s(+/-0.31) 5.18MB/s(+/-1.45) + patches 3-10 25.68s(+/-0.22) 7.00MB/s(+/-1.88) + irq accounting 25.80s(+/-0.25) 8.06MB/s(+/-0.05)
on a quad cortex-A15 hackbench scp tip/master 15.69s(+/-0.16) 9.70MB/s(+/-0.04) + patches 1,2 15.53s(+/-0.13) 9.72MB/s(+/-0.05) + patches 3-10 15.56s(+/-0.22) 9.88MB/s(+/-0.05) + irq accounting 15.99s(+/-0.08) 10.37MB/s(+/-0.03)
The improvement of scp bandwidth happens when tasks and irq are using different CPU which is a bit random without irq accounting config
Change since V1: - add 3 fixes - correct some commit messages - replace capacity computation by activity - take into account current cpu capacity
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2014/3/19/377 [3] https://lkml.org/lkml/2014/5/14/622
Vincent Guittot (11): sched: fix imbalance flag reset sched: remove a wake_affine condition sched: fix avg_load computation sched: Allow all archs to set the power_orig ARM: topology: use new cpu_power interface sched: add per rq cpu_power_orig Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED" sched: get CPU's activity statistic sched: test the cpu's capacity in wake affine sched: move cfs task on a CPU with higher capacity sched: replace capacity by activity
arch/arm/kernel/topology.c | 4 +- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 229 ++++++++++++++++++++++----------------------- kernel/sched/sched.h | 5 +- 4 files changed, 118 insertions(+), 122 deletions(-)
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
We need to reset of the imbalance flag as soon as we have reached a balanced state.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c9617b7..9587ed1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6610,10 +6610,8 @@ more_balance: if (sd_parent) { int *group_imbalance = &sd_parent->groups->sgp->imbalance;
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) { + if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) *group_imbalance = 1; - } else if (*group_imbalance) - *group_imbalance = 0; }
/* All tasks on this runqueue were pinned by CPU affinity */ @@ -6698,6 +6696,16 @@ more_balance: goto out;
out_balanced: + /* + * We reach balance although we may have faced some affinity + * constraints. Clear the imbalance flag if it was set. + */ + if (sd_parent) { + int *group_imbalance = &sd_parent->groups->sgp->imbalance; + if (*group_imbalance) + *group_imbalance = 0; + } + schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
Hi Vincent,
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
Why do we do active balancing today when there is at-most 1 task on the busiest cpu? Shouldn't we be skipping load balancing altogether? If we do active balancing when the number of tasks = 1, it will lead to a ping pong right?
Regards Preeti U Murthy
On 25 May 2014 12:33, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
Why do we do active balancing today when there is at-most 1 task on the busiest cpu? Shouldn't we be skipping load balancing altogether? If we do active balancing when the number of tasks = 1, it will lead to a ping pong right?
That's the purpose of the patch to prevent this useless active load balance. When the imbalance flag is set, an active load balance is triggered whatever the load balance is because of pinned tasks that prevents a balance state.
Vincent
Regards Preeti U Murthy
On 05/26/2014 01:19 PM, Vincent Guittot wrote:
On 25 May 2014 12:33, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
Why do we do active balancing today when there is at-most 1 task on the busiest cpu? Shouldn't we be skipping load balancing altogether? If we do active balancing when the number of tasks = 1, it will lead to a ping pong right?
That's the purpose of the patch to prevent this useless active load balance. When the imbalance flag is set, an active load balance is triggered whatever the load balance is because of pinned tasks that prevents a balance state.
No I mean this:
sched:Do not continue load balancing when the busiest cpu has one running task
From: Preeti U Murthy preeti@linux.vnet.ibm.com
--- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c9617b7..b175333 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6626,6 +6626,8 @@ more_balance: } goto out_balanced; } + } else { + goto out; }
if (!ld_moved) {
}
Regards Preeti U Murthy
Vincent
Regards Preeti U Murthy
On 26 May 2014 11:16, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 05/26/2014 01:19 PM, Vincent Guittot wrote:
On 25 May 2014 12:33, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
The imbalance flag can stay set whereas there is no imbalance.
Let assume that we have 3 tasks that run on a dual cores /dual cluster system. We will have some idle load balance which are triggered during tick. Unfortunately, the tick is also used to queue background work so we can reach the situation where short work has been queued on a CPU which already runs a task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle CPU) and will try to pull the waiting task on the idle CPU. The waiting task is a worker thread that is pinned on a CPU so an imbalance due to pinned task is detected and the imbalance flag is set. Then, we will not be able to clear the flag because we have at most 1 task on each CPU but the imbalance flag will trig to useless active load balance between the idle CPU and the busy CPU.
Why do we do active balancing today when there is at-most 1 task on the busiest cpu? Shouldn't we be skipping load balancing altogether? If we do active balancing when the number of tasks = 1, it will lead to a ping pong right?
That's the purpose of the patch to prevent this useless active load balance. When the imbalance flag is set, an active load balance is triggered whatever the load balance is because of pinned tasks that prevents a balance state.
No I mean this:
sched:Do not continue load balancing when the busiest cpu has one running task
But you can have situation where you have to migrate the task even if the busiest CPU has only 1 task. The use of imbalance flag is one example. You can also be in a situation where the busiest group has too much load compared to the local group and the busiest CPU even with 1 task (at this instant) has been selected to move task away from the busiest group
Vincent
From: Preeti U Murthy preeti@linux.vnet.ibm.com
kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c9617b7..b175333 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6626,6 +6626,8 @@ more_balance: } goto out_balanced; }
} else {
goto out; } if (!ld_moved) {
}
Regards Preeti U Murthy
Vincent
Regards Preeti U Murthy
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found description of it in the previous commits' log. Futhermore, the comment of the condition refers to task_hot function that was used before being replaced by the current condition: /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and * there is no bad imbalance. */
If we look more deeply the below condition this_load + target_load(prev_cpu, idx) <= tl_per_task
When sync is clear, we have : tl_per_task = runnable_load_avg / nr_running this_load = max(runnable_load_avg, cpuload[idx]) target_load = max(runnable_load_avg', cpuload'[idx])
It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the condition. This implies that runnable_load_avg == 0 too because of the condition: this_load <= load but if this _load is null, balanced is already set and the test is redundant.
If sync is set, it's not as straight forward as above (especially if cgroup are involved) but the policy should be similar as we have removed a task that's going to sleep in order to get a more accurate load and this_load values.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 25 +++++++------------------ 1 file changed, 7 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9587ed1..30240ab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu; - unsigned long tl_per_task; struct task_group *tg; unsigned long weight; int balanced; @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) balanced = this_eff_load <= prev_eff_load; } else balanced = true; + schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
+ if (!balanced) + return 0; /* * If the currently running task will sleep within * a reasonable amount of time then attract this newly * woken task: */ - if (sync && balanced) + if (sync) return 1;
- schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); - tl_per_task = cpu_avg_load_per_task(this_cpu); - - if (balanced || - (this_load <= load && - this_load + target_load(prev_cpu, idx) <= tl_per_task)) { - /* - * This domain has SD_WAKE_AFFINE and - * p is cache cold in this domain, and - * there is no bad imbalance. - */ - schedstat_inc(sd, ttwu_move_affine); - schedstat_inc(p, se.statistics.nr_wakeups_affine); + schedstat_inc(sd, ttwu_move_affine); + schedstat_inc(p, se.statistics.nr_wakeups_affine);
- return 1; - } - return 0; + return 1; }
/*
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found description of it in the previous commits' log.
commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
int try_to_wake_up():
in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate).
if ((tl <= load && - tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || - 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { + tl + target_load(cpu, idx) <= tl_per_task) || + 100*(tl + p->load_weight) <= imbalance*load) {
commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
+ if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
So back when the code got introduced, it read:
target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
So while the first line makes some sense, the second line is still somewhat challenging.
I read the second line something like: if there's less than one full task running on the combined cpus.
Now for idx==0 this is hard, because even when sync=1 you can only make it true if both cpus are completely idle, in which case you really want to move to the waking cpu I suppose.
One task running will have it == SCHED_LOAD_SCALE.
But for idx>0 this can trigger in all kinds of situations of light load.
On 27 May 2014 14:48, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found description of it in the previous commits' log.
commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
tl + target_load(cpu, idx) <= tl_per_task) ||
100*(tl + p->load_weight) <= imbalance*load) {
The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 where task_hot had been replaced by + if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
but as explained, i haven't found a clear explanation of this condition
commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
So back when the code got introduced, it read:
target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
So while the first line makes some sense, the second line is still somewhat challenging.
I read the second line something like: if there's less than one full task running on the combined cpus.
ok. your explanation makes sense
Now for idx==0 this is hard, because even when sync=1 you can only make it true if both cpus are completely idle, in which case you really want to move to the waking cpu I suppose.
This use case is already taken into account by
if (this_load > 0) .. else balance = true
One task running will have it == SCHED_LOAD_SCALE.
But for idx>0 this can trigger in all kinds of situations of light load.
target_load is the max between load for idx == 0 and load for the selected idx so we have even less chance to match the condition : both cpu are completely idle
On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
On 27 May 2014 14:48, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found description of it in the previous commits' log.
commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
tl + target_load(cpu, idx) <= tl_per_task) ||
100*(tl + p->load_weight) <= imbalance*load) {
The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 where task_hot had been replaced by
- if ((tl <= load &&
- tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
- 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
but as explained, i haven't found a clear explanation of this condition
Yeah, that's the commit I had below; but I suppose we could ask Nick if we really want, I've heard he still replies to email, even though he's locked up in a basement somewhere :-)
commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
So back when the code got introduced, it read:
target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
So while the first line makes some sense, the second line is still somewhat challenging.
I read the second line something like: if there's less than one full task running on the combined cpus.
ok. your explanation makes sense
Maybe, its still slightly weird :-)
Now for idx==0 this is hard, because even when sync=1 you can only make it true if both cpus are completely idle, in which case you really want to move to the waking cpu I suppose.
This use case is already taken into account by
if (this_load > 0) .. else balance = true
Agreed.
One task running will have it == SCHED_LOAD_SCALE.
But for idx>0 this can trigger in all kinds of situations of light load.
target_load is the max between load for idx == 0 and load for the selected idx so we have even less chance to match the condition : both cpu are completely idle
Ah, yes, I forgot to look at the target_load() thing and missed the max, yes that all makes it entirely less likely.
On 27 May 2014 17:39, Peter Zijlstra peterz@infradead.org wrote:
On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
On 27 May 2014 14:48, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found description of it in the previous commits' log.
commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
tl + target_load(cpu, idx) <= tl_per_task) ||
100*(tl + p->load_weight) <= imbalance*load) {
The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 where task_hot had been replaced by
- if ((tl <= load &&
- tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
- 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
but as explained, i haven't found a clear explanation of this condition
Yeah, that's the commit I had below; but I suppose we could ask Nick if we really want, I've heard he still replies to email, even though he's locked up in a basement somewhere :-)
ok, I have added him in the list
Nick,
While doing some rework on the wake affine part of the scheduler, i failed to catch the use case that takes advantage of a condition that you added some while ago with the commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
Could you help us to clarify the 2 first lines of the test that you added ?
+ if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
Regards, Vincent
commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
So back when the code got introduced, it read:
target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
So while the first line makes some sense, the second line is still somewhat challenging.
I read the second line something like: if there's less than one full task running on the combined cpus.
ok. your explanation makes sense
Maybe, its still slightly weird :-)
Now for idx==0 this is hard, because even when sync=1 you can only make it true if both cpus are completely idle, in which case you really want to move to the waking cpu I suppose.
This use case is already taken into account by
if (this_load > 0) .. else balance = true
Agreed.
One task running will have it == SCHED_LOAD_SCALE.
But for idx>0 this can trigger in all kinds of situations of light load.
target_load is the max between load for idx == 0 and load for the selected idx so we have even less chance to match the condition : both cpu are completely idle
Ah, yes, I forgot to look at the target_load() thing and missed the max, yes that all makes it entirely less likely.
Using another email address for Nick
On 27 May 2014 18:14, Vincent Guittot vincent.guittot@linaro.org wrote:
On 27 May 2014 17:39, Peter Zijlstra peterz@infradead.org wrote:
On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
On 27 May 2014 14:48, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found description of it in the previous commits' log.
commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
tl + target_load(cpu, idx) <= tl_per_task) ||
100*(tl + p->load_weight) <= imbalance*load) {
The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 where task_hot had been replaced by
- if ((tl <= load &&
- tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
- 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
but as explained, i haven't found a clear explanation of this condition
Yeah, that's the commit I had below; but I suppose we could ask Nick if we really want, I've heard he still replies to email, even though he's locked up in a basement somewhere :-)
ok, I have added him in the list
Nick,
While doing some rework on the wake affine part of the scheduler, i failed to catch the use case that takes advantage of a condition that you added some while ago with the commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
Could you help us to clarify the 2 first lines of the test that you added ? + if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
Regards, Vincent
commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
So back when the code got introduced, it read:
target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
So while the first line makes some sense, the second line is still somewhat challenging.
I read the second line something like: if there's less than one full task running on the combined cpus.
ok. your explanation makes sense
Maybe, its still slightly weird :-)
Now for idx==0 this is hard, because even when sync=1 you can only make it true if both cpus are completely idle, in which case you really want to move to the waking cpu I suppose.
This use case is already taken into account by
if (this_load > 0) .. else balance = true
Agreed.
One task running will have it == SCHED_LOAD_SCALE.
But for idx>0 this can trigger in all kinds of situations of light load.
target_load is the max between load for idx == 0 and load for the selected idx so we have even less chance to match the condition : both cpu are completely idle
Ah, yes, I forgot to look at the target_load() thing and missed the max, yes that all makes it entirely less likely.
Hi Vincent & Peter,
On 28/05/14 07:49, Vincent Guittot wrote: [...]
Nick,
While doing some rework on the wake affine part of the scheduler, i failed to catch the use case that takes advantage of a condition that you added some while ago with the commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
Could you help us to clarify the 2 first lines of the test that you added ?
if ((tl <= load &&
tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
Regards, Vincent
commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
So back when the code got introduced, it read:
target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
Shouldn't this be
target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) && target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE
"[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48) mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents the load contribution of a single task. So I read the second part as if the sum of the load of this_cpu and prev_cpu is smaller or equal to the (maximal) load contribution (maximal possible effect) of a single task.
There is even a comment in "[PATCH] sched: tweak affine wakeups" (a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when SCHED_LOAD_SCALE gets subtracted from tl = this_load = target_load(this_cpu, idx):
+ * If sync wakeup then subtract the (maximum possible) + * effect of the currently running task from the load + * of the current CPU:
"[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/
+static inline unsigned long cpu_avg_load_per_task(int cpu) +{ + runqueue_t *rq = cpu_rq(cpu); + unsigned long n = rq->nr_running; + + return n ? rq->raw_weighted_load / n : SCHED_LOAD_SCALE;
-- Dietmar
So while the first line makes some sense, the second line is still somewhat challenging.
I read the second line something like: if there's less than one full task running on the combined cpus.
ok. your explanation makes sense
Maybe, its still slightly weird :-)
[...]
On 28 May 2014 17:09, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
Hi Vincent & Peter,
On 28/05/14 07:49, Vincent Guittot wrote: [...]
Nick,
While doing some rework on the wake affine part of the scheduler, i failed to catch the use case that takes advantage of a condition that you added some while ago with the commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
Could you help us to clarify the 2 first lines of the test that you added ?
if ((tl <= load &&
tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
Regards, Vincent
commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
if ((tl <= load &&
tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
So back when the code got introduced, it read:
target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
Shouldn't this be
target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) && target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE
yes, there was a typo mistake in Peter's explanation
"[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48) mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents the load contribution of a single task. So I read the second part as if the sum of the load of this_cpu and prev_cpu is smaller or equal to the (maximal) load contribution (maximal possible effect) of a single task.
There is even a comment in "[PATCH] sched: tweak affine wakeups" (a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when SCHED_LOAD_SCALE gets subtracted from tl = this_load = target_load(this_cpu, idx):
- If sync wakeup then subtract the (maximum possible)
- effect of the currently running task from the load
- of the current CPU:
"[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/
+static inline unsigned long cpu_avg_load_per_task(int cpu) +{
runqueue_t *rq = cpu_rq(cpu);
unsigned long n = rq->nr_running;
return n ? rq->raw_weighted_load / n : SCHED_LOAD_SCALE;
-- Dietmar
So while the first line makes some sense, the second line is still somewhat challenging.
I read the second line something like: if there's less than one full task running on the combined cpus.
ok. your explanation makes sense
Maybe, its still slightly weird :-)
[...]
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
If sync is set, it's not as straight forward as above (especially if cgroup are involved)
avg load with cgroups is 'interesting' alright.
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9587ed1..30240ab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu;
- unsigned long tl_per_task; struct task_group *tg; unsigned long weight; int balanced;
@@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) balanced = this_eff_load <= prev_eff_load; } else balanced = true;
- schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
- if (!balanced)
/*return 0;
*/
- If the currently running task will sleep within
- a reasonable amount of time then attract this newly
- woken task:
- if (sync) return 1;
- schedstat_inc(sd, ttwu_move_affine);
- schedstat_inc(p, se.statistics.nr_wakeups_affine);
- return 1;
}
So I'm not usually one for schedstat nitpicking, but should we fix it in the sync case? That is, for sync we return 1 but do no inc nr_wakeups_affine, even though its going to be an affine wakeup.
On 27 May 2014 15:45, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9587ed1..30240ab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu;
unsigned long tl_per_task; struct task_group *tg; unsigned long weight; int balanced;
@@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) balanced = this_eff_load <= prev_eff_load; } else balanced = true;
schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
if (!balanced)
return 0; /* * If the currently running task will sleep within * a reasonable amount of time then attract this newly * woken task: */
if (sync) return 1;
schedstat_inc(sd, ttwu_move_affine);
schedstat_inc(p, se.statistics.nr_wakeups_affine);
return 1;
}
So I'm not usually one for schedstat nitpicking, but should we fix it in the sync case? That is, for sync we return 1 but do no inc nr_wakeups_affine, even though its going to be an affine wakeup.
ok, i'm going to move schedstat update at the right place
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity (cpu_power) and they will be tracked in the CPU's utilization (group_activity) of the next patches
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 30240ab..6a84114 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4048,7 +4048,7 @@ static unsigned long power_of(int cpu) static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = ACCESS_ONCE(rq->nr_running); + unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running); unsigned long load_avg = rq->cfs.runnable_load_avg;
if (nr_running) @@ -5868,7 +5868,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; - sgs->sum_nr_running += rq->nr_running; + sgs->sum_nr_running += rq->cfs.h_nr_running; #ifdef CONFIG_NUMA_BALANCING sgs->nr_numa_running += rq->nr_numa_running; sgs->nr_preferred_running += rq->nr_preferred_running;
On Fri, May 23, 2014 at 05:52:57PM +0200, Vincent Guittot wrote:
The computation of avg_load and avg_load_per_task should only takes into account the number of cfs tasks. The non cfs task are already taken into account by decreasing the cpu's capacity (cpu_power) and they will be tracked in the CPU's utilization (group_activity) of the next patches
Ah yes, very good. I don't think we had h_nr_running back when I frobbed this last.
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a84114..6dc05f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5618,6 +5618,20 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu) return default_scale_smt_power(sd, cpu); }
+unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu) +{ + unsigned long weight = sd->span_weight; + + if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { + if (sched_feat(ARCH_POWER)) + return arch_scale_smt_power(sd, cpu); + else + return default_scale_smt_power(sd, cpu); + } + + return SCHED_POWER_SCALE; +} + static unsigned long scale_rt_power(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5654,18 +5668,12 @@ static unsigned long scale_rt_power(int cpu)
static void update_cpu_power(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; unsigned long power = SCHED_POWER_SCALE; struct sched_group *sdg = sd->groups;
- if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { - if (sched_feat(ARCH_POWER)) - power *= arch_scale_smt_power(sd, cpu); - else - power *= default_scale_smt_power(sd, cpu); + power *= arch_scale_cpu_power(sd, cpu);
- power >>= SCHED_POWER_SHIFT; - } + power >>= SCHED_POWER_SHIFT;
sdg->sgp->power_orig = power;
On 23/05/14 16:52, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
As you know, besides uarch scaled cpu power for HMP, freq scaled cpu power is important for energy-aware scheduling to achieve freq scale invariance for task load.
I know that your patch-set is not about introducing freq scaled cpu power, but we were discussing how this can be achieved w/ your patch-set in place, so maybe you can share your opinion regarding the easiest way to achieve freq scale invariance with us?
(1) We assume that the current way (update_cpu_power() calls arch_scale_freq_power() to get the avg power(freq) over the time period since the last call to arch_scale_freq_power()) is suitable for us. Do you have another opinion here?
(2) Is the current layout of update_cpu_power() adequate for this, where we scale power_orig related to freq and then related to rt/(irq):
power_orig = scale_cpu(SCHED_POWER_SCALE) power = scale_rt(scale_freq(power_orig))
or do we need an extra power_freq data member on the rq and do:
power_orig = scale_cpu(SCHED_POWER_SCALE) power_freq = scale_freq(power_orig)) power = scale_rt(power_orig))
In other words, do we consider rt/(irq) pressure when calculating freq scale invariant task load or not?
Thanks,
-- Dietmar [...]
On Fri, May 30, 2014 at 03:04:32PM +0100, Dietmar Eggemann wrote:
On 23/05/14 16:52, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
As you know, besides uarch scaled cpu power for HMP, freq scaled cpu power is important for energy-aware scheduling to achieve freq scale invariance for task load.
I know that your patch-set is not about introducing freq scaled cpu power, but we were discussing how this can be achieved w/ your patch-set in place, so maybe you can share your opinion regarding the easiest way to achieve freq scale invariance with us?
(1) We assume that the current way (update_cpu_power() calls arch_scale_freq_power() to get the avg power(freq) over the time period since the last call to arch_scale_freq_power()) is suitable for us. Do you have another opinion here?
(2) Is the current layout of update_cpu_power() adequate for this, where we scale power_orig related to freq and then related to rt/(irq):
power_orig = scale_cpu(SCHED_POWER_SCALE) power = scale_rt(scale_freq(power_orig))
or do we need an extra power_freq data member on the rq and do:
power_orig = scale_cpu(SCHED_POWER_SCALE) power_freq = scale_freq(power_orig)) power = scale_rt(power_orig))
In other words, do we consider rt/(irq) pressure when calculating freq scale invariant task load or not?
I don't think you should. The work done depends on the frequency, not on other tasks present on the cpu. The same is true for an over-utilized cpu, a task will run less than the desired amount of time, this is no different from a RT/irq preempting the task and taking its time.
On 30 May 2014 16:04, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/05/14 16:52, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
As you know, besides uarch scaled cpu power for HMP, freq scaled cpu power is important for energy-aware scheduling to achieve freq scale invariance for task load.
I know that your patch-set is not about introducing freq scaled cpu power, but we were discussing how this can be achieved w/ your patch-set in place, so maybe you can share your opinion regarding the easiest way to achieve freq scale invariance with us?
(1) We assume that the current way (update_cpu_power() calls arch_scale_freq_power() to get the avg power(freq) over the time period since the last call to arch_scale_freq_power()) is suitable for us. Do you have another opinion here?
Using power (or power_freq as you mentioned below) is probably the easiest and more straight forward solution. You can use it to scale each element when updating entity runnable. Nevertheless, I see to 2 potential issues: - is power updated often enough to correctly follow the frequency scaling ? we need to compare power update frequency with runnable_avg_sum variation speed and the rate at which we will change the CPU's frequency. - the max value of runnable_avg_sum will be also scaled so a task running on a CPU with less capacity could be seen as a "low" load even if it's an always running tasks. So we need to find a way to reach the max value for such situation
(2) Is the current layout of update_cpu_power() adequate for this, where we scale power_orig related to freq and then related to rt/(irq):
power_orig = scale_cpu(SCHED_POWER_SCALE) power = scale_rt(scale_freq(power_orig))
or do we need an extra power_freq data member on the rq and do:
power_orig = scale_cpu(SCHED_POWER_SCALE) power_freq = scale_freq(power_orig)) power = scale_rt(power_orig))
do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?
In other words, do we consider rt/(irq) pressure when calculating freq scale invariant task load or not?
we should take power_freq which implies a new field
Thanks, Vincent
Thanks,
-- Dietmar [...]
[...]
(1) We assume that the current way (update_cpu_power() calls arch_scale_freq_power() to get the avg power(freq) over the time period since the last call to arch_scale_freq_power()) is suitable for us. Do you have another opinion here?
Using power (or power_freq as you mentioned below) is probably the easiest and more straight forward solution. You can use it to scale each element when updating entity runnable. Nevertheless, I see to 2 potential issues:
- is power updated often enough to correctly follow the frequency
scaling ? we need to compare power update frequency with runnable_avg_sum variation speed and the rate at which we will change the CPU's frequency.
- the max value of runnable_avg_sum will be also scaled so a task
running on a CPU with less capacity could be seen as a "low" load even if it's an always running tasks. So we need to find a way to reach the max value for such situation
I think I mixed two problems together here:
Firstly, we need to scale cpu power in update_cpu_power() regarding uArch, frequency and rt/irq pressure. Here the freq related value we get back from arch_scale_freq_power(..., cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
Secondly, to be able to scale the runnable avg sum of a sched entity (se->avg->runnable_avg_sum), we preferable have a coefficient representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most powerful cpu in the system) and another coefficient (avg freq over 'now - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to be retrieved from the arch in __update_entity_runnable_avg().
(2) Is the current layout of update_cpu_power() adequate for this, where we scale power_orig related to freq and then related to rt/(irq):
power_orig = scale_cpu(SCHED_POWER_SCALE) power = scale_rt(scale_freq(power_orig))
or do we need an extra power_freq data member on the rq and do:
power_orig = scale_cpu(SCHED_POWER_SCALE) power_freq = scale_freq(power_orig)) power = scale_rt(power_orig))
do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?
No, I also think that power=scale_rt(power_freq) is correct.
In other words, do we consider rt/(irq) pressure when calculating freq scale invariant task load or not?
we should take power_freq which implies a new field
[...]
On 4 June 2014 11:42, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
[...]
(1) We assume that the current way (update_cpu_power() calls arch_scale_freq_power() to get the avg power(freq) over the time period since the last call to arch_scale_freq_power()) is suitable for us. Do you have another opinion here?
Using power (or power_freq as you mentioned below) is probably the easiest and more straight forward solution. You can use it to scale each element when updating entity runnable. Nevertheless, I see to 2 potential issues:
- is power updated often enough to correctly follow the frequency
scaling ? we need to compare power update frequency with runnable_avg_sum variation speed and the rate at which we will change the CPU's frequency.
- the max value of runnable_avg_sum will be also scaled so a task
running on a CPU with less capacity could be seen as a "low" load even if it's an always running tasks. So we need to find a way to reach the max value for such situation
I think I mixed two problems together here:
Firstly, we need to scale cpu power in update_cpu_power() regarding uArch, frequency and rt/irq pressure. Here the freq related value we get back from arch_scale_freq_power(..., cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
Secondly, to be able to scale the runnable avg sum of a sched entity (se->avg->runnable_avg_sum), we preferable have a coefficient representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most powerful cpu in the system) and another coefficient (avg freq over 'now
AFAICT, the coefficient representing uArch diffs is already taken into account into power_freq thanks to scale_cpu, isn't it ?
- sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
be retrieved from the arch in __update_entity_runnable_avg().
(2) Is the current layout of update_cpu_power() adequate for this, where we scale power_orig related to freq and then related to rt/(irq):
power_orig = scale_cpu(SCHED_POWER_SCALE) power = scale_rt(scale_freq(power_orig))
or do we need an extra power_freq data member on the rq and do:
power_orig = scale_cpu(SCHED_POWER_SCALE) power_freq = scale_freq(power_orig)) power = scale_rt(power_orig))
do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?
No, I also think that power=scale_rt(power_freq) is correct.
In other words, do we consider rt/(irq) pressure when calculating freq scale invariant task load or not?
we should take power_freq which implies a new field
[...]
[...]
Firstly, we need to scale cpu power in update_cpu_power() regarding uArch, frequency and rt/irq pressure. Here the freq related value we get back from arch_scale_freq_power(..., cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
Secondly, to be able to scale the runnable avg sum of a sched entity (se->avg->runnable_avg_sum), we preferable have a coefficient representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most powerful cpu in the system) and another coefficient (avg freq over 'now
AFAICT, the coefficient representing uArch diffs is already taken into account into power_freq thanks to scale_cpu, isn't it ?
True, but I can't see how the current signature of arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch and freq invariant updating of se->avg->runnable_avg_sum business.
- sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
be retrieved from the arch in __update_entity_runnable_avg().
[...]
On 5 June 2014 10:59, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
[...]
Firstly, we need to scale cpu power in update_cpu_power() regarding uArch, frequency and rt/irq pressure. Here the freq related value we get back from arch_scale_freq_power(..., cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
Secondly, to be able to scale the runnable avg sum of a sched entity (se->avg->runnable_avg_sum), we preferable have a coefficient representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most powerful cpu in the system) and another coefficient (avg freq over 'now
AFAICT, the coefficient representing uArch diffs is already taken into account into power_freq thanks to scale_cpu, isn't it ?
True, but I can't see how the current signature of arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch and freq invariant updating of se->avg->runnable_avg_sum business.
My 1st assumption is that the update of rq->cpu_power (or rq->cpu_power_freq as we discussed earlier) is fast enough compare to frequency scaling so that it can be used to scale the load without major error. In this case, you can use the cpu_power_orig, cpu_power fields. The update period of cpu_power is equal balance_interval which is initialized with the weight of the sched_domain (less or equal to 4ms for most of ARM platform). Most of ARM platforms use a 100 HZ jiffies so the update period will be aligned to 10ms (1 tick).
If this update is not fast enough, the second possibility could be to directly access to current frequency (or something that represent it). This might be possible once the cpufreq will have been consolidated into the scheduler similarly to what is going with cpuidle
Vincent
- sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
be retrieved from the arch in __update_entity_runnable_avg().
[...]
On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
I did a quick test of the patch set with adjusting cpu_power on big.LITTLE (ARM TC2) to reflect the different compute capacities of the A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and without the patches applied, but with non-default cpu_powers.
I didn't see any difference in the load-balance. Three tasks ended up on the two A15s and two tasks ended up on two of the three A7s leaving one unused in both cases.
Using default cpu_power I get one task on each of the five cpus (best throughput). Unless I messed something up, it seems that setting cpu_power doesn't give me the best throughput with these patches applied.
Have you done any tests on big.LITTLE?
Morten
On 3 June 2014 15:22, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
I did a quick test of the patch set with adjusting cpu_power on big.LITTLE (ARM TC2) to reflect the different compute capacities of the A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and without the patches applied, but with non-default cpu_powers.
I didn't see any difference in the load-balance. Three tasks ended up on the two A15s and two tasks ended up on two of the three A7s leaving one unused in both cases.
Using default cpu_power I get one task on each of the five cpus (best throughput). Unless I messed something up, it seems that setting cpu_power doesn't give me the best throughput with these patches applied.
That's normal this patchset is necessary but not enough to solve the issue you mention. We also need to fix the way the imbalance is calculated for such situation. I have planned to push that in another patchset in order to not mix too much thing together
Vincent
Have you done any tests on big.LITTLE?
Morten
On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
On 3 June 2014 15:22, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
I did a quick test of the patch set with adjusting cpu_power on big.LITTLE (ARM TC2) to reflect the different compute capacities of the A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and without the patches applied, but with non-default cpu_powers.
I didn't see any difference in the load-balance. Three tasks ended up on the two A15s and two tasks ended up on two of the three A7s leaving one unused in both cases.
Using default cpu_power I get one task on each of the five cpus (best throughput). Unless I messed something up, it seems that setting cpu_power doesn't give me the best throughput with these patches applied.
That's normal this patchset is necessary but not enough to solve the issue you mention. We also need to fix the way the imbalance is calculated for such situation. I have planned to push that in another patchset in order to not mix too much thing together
Based on the commit messages I was just lead to believe that this was a self-contained patch set that also addressed issues related to handling heterogeneous systems. Maybe it would be worth mentioning that this set is only part of the solution somewhere?
It is a bit unclear to me how these changes, which appear to mainly improve factoring rt and irq time into cpu_power, will solve the cpu_power issues related to heterogeneous systems. Can you share your plans for the follow up patch set? I think it would be better to review the solution as a whole.
I absolutely agree that the imbalance calculation needs to fixed, but I don't think the current rq runnable_avg_sum is the right choice for that purpose for the reasons I pointed out the in other thread.
Morten
On 4 June 2014 13:17, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
On 3 June 2014 15:22, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig.
I did a quick test of the patch set with adjusting cpu_power on big.LITTLE (ARM TC2) to reflect the different compute capacities of the A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and without the patches applied, but with non-default cpu_powers.
I didn't see any difference in the load-balance. Three tasks ended up on the two A15s and two tasks ended up on two of the three A7s leaving one unused in both cases.
Using default cpu_power I get one task on each of the five cpus (best throughput). Unless I messed something up, it seems that setting cpu_power doesn't give me the best throughput with these patches applied.
That's normal this patchset is necessary but not enough to solve the issue you mention. We also need to fix the way the imbalance is calculated for such situation. I have planned to push that in another patchset in order to not mix too much thing together
Based on the commit messages I was just lead to believe that this was a self-contained patch set that also addressed issues related to handling heterogeneous systems. Maybe it would be worth mentioning that this set is only part of the solution somewhere?
So this patch addresses the issue of describing the CPU capacity of an HMP system but doesn't solve the one task per CPU issue alone.
It is a bit unclear to me how these changes, which appear to mainly improve factoring rt and irq time into cpu_power, will solve the cpu_power issues related to heterogeneous systems. Can you share your plans for the follow up patch set? I think it would be better to review the solution as a whole.
I have planned to use load_per_task to calculate the imbalance to pull
I absolutely agree that the imbalance calculation needs to fixed, but I don't think the current rq runnable_avg_sum is the right choice for that purpose for the reasons I pointed out the in other thread.
Morten
Use the new arch_scale_cpu_power in order to reflect the original capacity of a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec..6cc25a8 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu) set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_power %lu\n", - cpu, arch_scale_freq_power(NULL, cpu)); + cpu, arch_scale_cpu_power(NULL, cpu)); }
#else
Hi Vincent,
Why do we have two interfaces arch_scale_freq_power() and arch_scale_cpu_power()? Does it make sense to consolidate them now ?
Regards Preeti U Murthy
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
Use the new arch_scale_cpu_power in order to reflect the original capacity of a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec..6cc25a8 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu) set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
cpu, arch_scale_freq_power(NULL, cpu));
cpu, arch_scale_cpu_power(NULL, cpu));
}
#else
On 25 May 2014 15:22, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
Why do we have two interfaces arch_scale_freq_power() and arch_scale_cpu_power()? Does it make sense to consolidate them now ?
Hi Preeti,
They don't have the same purpose. arch_scale_cpu_power set the max capacity of your CPU whereas arch_scale_freq_power can be used to give the current capacity. ARM platform were using arch_scale_freq_power because it was the only one available for non SMT system but this induces some misunderstanding and some limitation in the characterization of a CPUs. This consolidation is a necessary step so we can now have the max capacity of a CPU and let arch_scale_freq_power for other purpose (or even remove it if useless).
Regards, Vincent
Regards Preeti U Murthy
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
Use the new arch_scale_cpu_power in order to reflect the original capacity of a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec..6cc25a8 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu) set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
cpu, arch_scale_freq_power(NULL, cpu));
cpu, arch_scale_cpu_power(NULL, cpu));
}
#else
On 05/26/2014 01:55 PM, Vincent Guittot wrote:
On 25 May 2014 15:22, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
Why do we have two interfaces arch_scale_freq_power() and arch_scale_cpu_power()? Does it make sense to consolidate them now ?
Hi Preeti,
They don't have the same purpose. arch_scale_cpu_power set the max capacity of your CPU whereas arch_scale_freq_power can be used to give the current capacity. ARM platform were using arch_scale_freq_power because it was the only one available for non SMT system but this induces some misunderstanding and some limitation in the characterization of a CPUs. This consolidation is a necessary step so we can now have the max capacity of a CPU and let arch_scale_freq_power for other purpose (or even remove it if useless).
Ah ok! Thanks :) This was insightful :)
Regards Preeti U Murthy
Regards, Vincent
Regards Preeti U Murthy
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
Use the new arch_scale_cpu_power in order to reflect the original capacity of a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity linked to the frequency.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec..6cc25a8 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);
-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu) set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
cpu, arch_scale_freq_power(NULL, cpu));
cpu, arch_scale_cpu_power(NULL, cpu));
}
#else
This new field cpu_power_orig reflects the available capacity of a CPUs unlike the cpu_power which reflects the current capacity that can be altered by frequency and rt tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 2 +- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 240aa83..686c56f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6946,7 +6946,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq->cpu_power = SCHED_POWER_SCALE; + rq->cpu_power = rq->cpu_power_orig = SCHED_POWER_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6dc05f7..b3dc2b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5675,6 +5675,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
power >>= SCHED_POWER_SHIFT;
+ cpu_rq(cpu)->cpu_power_orig = power; sdg->sgp->power_orig = power;
if (sched_feat(ARCH_POWER)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 600e229..4c74014 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -568,6 +568,7 @@ struct rq { struct sched_domain *sd;
unsigned long cpu_power; + unsigned long cpu_power_orig;
unsigned char idle_balance; /* For active balancing */
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.
We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 13 ++++++------- kernel/sched/sched.h | 4 ++-- 2 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3dc2b3..b7c51be 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2427,19 +2427,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) se->avg.load_avg_contrib >>= NICE_0_SHIFT; } } - -static inline void update_rq_runnable_avg(struct rq *rq, int runnable) -{ - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); - __update_tg_runnable_avg(&rq->avg, &rq->cfs); -} #else /* CONFIG_FAIR_GROUP_SCHED */ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, int force_update) {} static inline void __update_tg_runnable_avg(struct sched_avg *sa, struct cfs_rq *cfs_rq) {} static inline void __update_group_entity_contrib(struct sched_entity *se) {} -static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {} #endif /* CONFIG_FAIR_GROUP_SCHED */
static inline void __update_task_entity_contrib(struct sched_entity *se) @@ -2538,6 +2531,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); }
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable) +{ + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + __update_tg_runnable_avg(&rq->avg, &rq->cfs); +} + /* Add the load generated by se into cfs_rq's child load-average */ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4c74014..929ee6c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -542,8 +542,6 @@ struct rq { #ifdef CONFIG_FAIR_GROUP_SCHED /* list of leaf cfs_rq on this cpu: */ struct list_head leaf_cfs_rq_list; - - struct sched_avg avg; #endif /* CONFIG_FAIR_GROUP_SCHED */
/* @@ -634,6 +632,8 @@ struct rq { #ifdef CONFIG_SMP struct llist_head wake_list; #endif + + struct sched_avg avg; };
static inline int cpu_of(struct rq *rq)
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b7c51be..c01d8b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) return cpu_rq(cpu)->cpu_power; }
+static unsigned long power_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_power_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4438,6 +4443,18 @@ done: return target; }
+static int get_cpu_activity(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + if (sum >= period) + return power_orig_of(cpu) + rq->nr_running - 1; + + return (sum * power_orig_of(cpu)) / period; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5518,6 +5535,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_power; + unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity; unsigned int idle_cpus; @@ -5538,6 +5556,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ + unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest = NULL, .local = NULL, .total_load = 0UL, + .total_activity = 0UL, .total_pwr = 0UL, .busiest_stat = { .avg_load = 0UL, @@ -5876,6 +5896,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx);
sgs->group_load += load; + sgs->group_activity += get_cpu_activity(i); sgs->sum_nr_running += rq->cfs.h_nr_running; #ifdef CONFIG_NUMA_BALANCING sgs->nr_numa_running += rq->nr_numa_running; @@ -6034,6 +6055,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd next_group: /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; + sds->total_activity += sgs->group_activity; sds->total_pwr += sgs->group_power;
sg = sg->next;
On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
+static int get_cpu_activity(int cpu) +{
- struct rq *rq = cpu_rq(cpu);
- u32 sum = rq->avg.runnable_avg_sum;
- u32 period = rq->avg.runnable_avg_period;
- if (sum >= period)
return power_orig_of(cpu) + rq->nr_running - 1;
- return (sum * power_orig_of(cpu)) / period;
+}
While I appreciate the need to signify the overload situation, I don't think adding nr_running makes sense. The amount of tasks has no bearing on the amount of overload.
Also, and note I've not yet seen the use, it strikes me as odd to use the orig power here. I would've thought the current capacity (not the max capacity) is relevant to balancing.
On 27 May 2014 19:32, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
+static int get_cpu_activity(int cpu) +{
struct rq *rq = cpu_rq(cpu);
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;
if (sum >= period)
return power_orig_of(cpu) + rq->nr_running - 1;
return (sum * power_orig_of(cpu)) / period;
+}
While I appreciate the need to signify the overload situation, I don't think adding nr_running makes sense. The amount of tasks has no bearing on the amount of overload.
I agree that it's not the best way to evaluate overload but it's the only simple one that is available without additional computation. I'm going to try to find a better metric or come back the solution which only adds +1 and compute the overload somewhere else
Also, and note I've not yet seen the use, it strikes me as odd to use the orig power here. I would've thought the current capacity (not the max capacity) is relevant to balancing.
activity also measures the non cfs tasks activity whereas current capacity removes the capacity used by non cfs tasks. So as long as arch_scale_cpu_freq is not used (which is the case for now), original capacity is ok but we might need a 3rd metric so we would have: original capacity (max capacity that can provide a CPU) usable capacity (new value that reflects current capacity available for any kind of processing rt tasks, irq, cfs tasks) current capacity (capacity available for cfs tasks)
Vincent
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b7c51be..c01d8b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) return cpu_rq(cpu)->cpu_power; } +static unsigned long power_orig_of(int cpu) +{
- return cpu_rq(cpu)->cpu_power_orig;
+}
static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4438,6 +4443,18 @@ done: return target; } +static int get_cpu_activity(int cpu) +{
- struct rq *rq = cpu_rq(cpu);
- u32 sum = rq->avg.runnable_avg_sum;
- u32 period = rq->avg.runnable_avg_period;
- if (sum >= period)
return power_orig_of(cpu) + rq->nr_running - 1;
- return (sum * power_orig_of(cpu)) / period;
+}
The rq runnable_avg_{sum, period} give a very long term view of the cpu utilization (I will use the term utilization instead of activity as I think that is what we are talking about here). IMHO, it is too slow to be used as basis for load balancing decisions. I think that was also agreed upon in the last discussion related to this topic [1].
The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742. In other words, the cpu might have been fully utilized for 345 ms before it is considered fully utilized. Periodic load-balancing happens much more frequently than that.
Also, if load-balancing actually moves tasks around it may take quite a while before runnable_avg_sum actually reflects this change. The next periodic load-balance is likely to happen before runnable_avg_sum has reflected the result of the previous periodic load-balance.
To avoid these problems, we need to base utilization on a metric which is updated instantaneously when we add/remove tasks to a cpu (or a least fast enough that we don't see the above problems). In the previous discussion [1] it was suggested that a sum of unweighted task runnable_avg_{sum,period} ratio instead. That is, an unweighted equivalent to weighted_cpuload(). That isn't a perfect solution either. It is fine as long as the cpus are not fully utilized, but when they are we need to use weighted_cpuload() to preserve smp_nice. What to do around the tipping point needs more thought, but I think that is currently the best proposal for a solution for task and cpu utilization.
rq runnable_avg_sum is useful for decisions where we need a longer term view of the cpu utilization, but I don't see how we can use as cpu utilization metric for load-balancing decisions at wakeup or periodically.
Morten
On 28 May 2014 14:10, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b7c51be..c01d8b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) return cpu_rq(cpu)->cpu_power; }
+static unsigned long power_orig_of(int cpu) +{
return cpu_rq(cpu)->cpu_power_orig;
+}
static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4438,6 +4443,18 @@ done: return target; }
+static int get_cpu_activity(int cpu) +{
struct rq *rq = cpu_rq(cpu);
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;
if (sum >= period)
return power_orig_of(cpu) + rq->nr_running - 1;
return (sum * power_orig_of(cpu)) / period;
+}
The rq runnable_avg_{sum, period} give a very long term view of the cpu utilization (I will use the term utilization instead of activity as I think that is what we are talking about here). IMHO, it is too slow to be used as basis for load balancing decisions. I think that was also agreed upon in the last discussion related to this topic [1].
The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742. In other words, the cpu might have been fully utilized for 345 ms before it is considered fully utilized. Periodic load-balancing happens much more frequently than that.
I agree that it's not really responsive but several statistics of the scheduler use the same kind of metrics and have the same kind of responsiveness. I agree that it's not enough and that's why i'm not using only this metric but it gives information that the unweighted load_avg_contrib (that you are speaking about below) can't give. So i would be less contrasted than you and would say that we probably need additional metrics
Also, if load-balancing actually moves tasks around it may take quite a while before runnable_avg_sum actually reflects this change. The next periodic load-balance is likely to happen before runnable_avg_sum has reflected the result of the previous periodic load-balance.
runnable_avg_sum uses a 1ms unit step so i tend to disagree with your point above
To avoid these problems, we need to base utilization on a metric which is updated instantaneously when we add/remove tasks to a cpu (or a least fast enough that we don't see the above problems). In the previous discussion [1] it was suggested that a sum of unweighted task runnable_avg_{sum,period} ratio instead. That is, an unweighted equivalent to weighted_cpuload(). That isn't a perfect solution either.
Regarding the unweighted load_avg_contrib, you will have similar issue because of the slowness in the variation of each sched_entity load that will be added/removed in the unweighted load_avg_contrib.
The update of the runnable_avg_{sum,period} of an sched_entity is quite similar to cpu utilization. This value is linked to the CPU on which it has run previously because of the time sharing with others tasks, so the unweighted load of a freshly migrated task will reflect its load on the previous CPU (with the time sharing with other tasks on prev CPU).
I'm not saying that such metric is useless but it's not perfect as well.
Vincent
It is fine as long as the cpus are not fully utilized, but when they are we need to use weighted_cpuload() to preserve smp_nice. What to do around the tipping point needs more thought, but I think that is currently the best proposal for a solution for task and cpu utilization.
rq runnable_avg_sum is useful for decisions where we need a longer term view of the cpu utilization, but I don't see how we can use as cpu utilization metric for load-balancing decisions at wakeup or periodically.
Morten
On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
On 28 May 2014 14:10, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b7c51be..c01d8b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) return cpu_rq(cpu)->cpu_power; }
+static unsigned long power_orig_of(int cpu) +{
return cpu_rq(cpu)->cpu_power_orig;
+}
static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4438,6 +4443,18 @@ done: return target; }
+static int get_cpu_activity(int cpu) +{
struct rq *rq = cpu_rq(cpu);
u32 sum = rq->avg.runnable_avg_sum;
u32 period = rq->avg.runnable_avg_period;
if (sum >= period)
return power_orig_of(cpu) + rq->nr_running - 1;
return (sum * power_orig_of(cpu)) / period;
+}
The rq runnable_avg_{sum, period} give a very long term view of the cpu utilization (I will use the term utilization instead of activity as I think that is what we are talking about here). IMHO, it is too slow to be used as basis for load balancing decisions. I think that was also agreed upon in the last discussion related to this topic [1].
The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742. In other words, the cpu might have been fully utilized for 345 ms before it is considered fully utilized. Periodic load-balancing happens much more frequently than that.
I agree that it's not really responsive but several statistics of the scheduler use the same kind of metrics and have the same kind of responsiveness.
I might be wrong, but I don't think we use anything similar to this to estimate cpu load/utilization for load-balancing purposes except for {source, target}_load() where it is used to bias the decisions whether or not to balance if the difference is small. That is what the discussion was about last time.
I agree that it's not enough and that's why i'm not using only this metric but it gives information that the unweighted load_avg_contrib (that you are speaking about below) can't give. So i would be less contrasted than you and would say that we probably need additional metrics
I'm not saying that we shouldn't this metric at all, I'm just saying that I don't think it is suitable for estimating the short term view cpu utilization which is what you need to make load-balancing decisions. We can't observe the effect of recent load-balancing decisions if the metric is too slow to react.
I realize that what I mean by 'slow' might be unclear. Load tracking (both task and rq) takes a certain amount of history into account in runnable_avg_{sum, period}. This amount is determined by the 'y'-weight, which has been chosen such that we consider the load in the past 345 time units, where the time unit is ~1 ms. The contribution is smaller the further you go back due to y^n, which diminishes to 0 for n > 345. So, if a task or cpu goes from having been idle for >345 ms to being constantly busy, it will take 345 ms until the entire history that we care about will reflect this change. Only then runnable_avg_sum will reach 47742. The rate of change is faster to begin with since the weight of the most recent history is higher. runnable_avg_sum will get to 47742/2 in just 32 ms.
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
IMO, we need cpu utilization to clearly represent the current utilization of the cpu such that any changes since the last wakeup or load-balance are clearly visible.
Also, if load-balancing actually moves tasks around it may take quite a while before runnable_avg_sum actually reflects this change. The next periodic load-balance is likely to happen before runnable_avg_sum has reflected the result of the previous periodic load-balance.
runnable_avg_sum uses a 1ms unit step so i tend to disagree with your point above
See explanation above. The important thing is how much history we take into account. That is 345x 1 ms time units. The rate at which the sum is updated doesn't have any change anything. 1 ms after a change (wakeup, load-balance,...) runnable_avg_sum can only change by 1024. The remaining ~98% of your weighted history still reflects the world before the change.
To avoid these problems, we need to base utilization on a metric which is updated instantaneously when we add/remove tasks to a cpu (or a least fast enough that we don't see the above problems). In the previous discussion [1] it was suggested that a sum of unweighted task runnable_avg_{sum,period} ratio instead. That is, an unweighted equivalent to weighted_cpuload(). That isn't a perfect solution either.
Regarding the unweighted load_avg_contrib, you will have similar issue because of the slowness in the variation of each sched_entity load that will be added/removed in the unweighted load_avg_contrib.
The update of the runnable_avg_{sum,period} of an sched_entity is quite similar to cpu utilization.
Yes, runnable_avg_{sum, period} for tasks and rqs are exactly the same. No difference there :)
This value is linked to the CPU on which it has run previously because of the time sharing with others tasks, so the unweighted load of a freshly migrated task will reflect its load on the previous CPU (with the time sharing with other tasks on prev CPU).
I agree that the task runnable_avg_sum is always affected by the circumstances on the cpu where it is running, and that it takes this history with it. However, I think cfs.runnable_load_avg leads to less problems than using the rq runnable_avg_sum. It would work nicely for the two tasks on two cpus example I mentioned earlier. We don't need add something on top when the cpu is fully utilized by more than one task. It comes more naturally with cfs.runnable_load_avg. If it is much larger than 47742, it should be fairly safe to assume that you shouldn't stick more tasks on that cpu.
I'm not saying that such metric is useless but it's not perfect as well.
It comes with its own set of problems, agreed. Based on my current understanding (or lack thereof) they just seem smaller :)
Morten
On 28 May 2014 17:47, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
On 28 May 2014 14:10, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
[snip]
This value is linked to the CPU on which it has run previously because of the time sharing with others tasks, so the unweighted load of a freshly migrated task will reflect its load on the previous CPU (with the time sharing with other tasks on prev CPU).
I agree that the task runnable_avg_sum is always affected by the circumstances on the cpu where it is running, and that it takes this history with it. However, I think cfs.runnable_load_avg leads to less problems than using the rq runnable_avg_sum. It would work nicely for the two tasks on two cpus example I mentioned earlier. We don't need add
i would say that nr_running is an even better metrics for such situation as the load doesn't give any additional information. Just to point that we can spent a lot of time listing which use case are better covered by which metrics :-)
something on top when the cpu is fully utilized by more than one task. It comes more naturally with cfs.runnable_load_avg. If it is much larger than 47742, it should be fairly safe to assume that you shouldn't stick more tasks on that cpu.
I'm not saying that such metric is useless but it's not perfect as well.
It comes with its own set of problems, agreed. Based on my current understanding (or lack thereof) they just seem smaller :)
I think it's worth using the cpu utilization for some cases because it has got some information that are not available elsewhere. And the replacement of the current capacity computation is one example. As explained previously, I'm not against adding other metrics and i'm not sure to understand why you oppose these 2 metrics whereas they could be complementary
Vincent
Morten
On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
On 28 May 2014 17:47, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
On 28 May 2014 14:10, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
[snip]
This value is linked to the CPU on which it has run previously because of the time sharing with others tasks, so the unweighted load of a freshly migrated task will reflect its load on the previous CPU (with the time sharing with other tasks on prev CPU).
I agree that the task runnable_avg_sum is always affected by the circumstances on the cpu where it is running, and that it takes this history with it. However, I think cfs.runnable_load_avg leads to less problems than using the rq runnable_avg_sum. It would work nicely for the two tasks on two cpus example I mentioned earlier. We don't need add
i would say that nr_running is an even better metrics for such situation as the load doesn't give any additional information.
I fail to understand how nr_running can be used. nr_running doesn't tell you anything about the utilization of the cpu, just the number tasks that happen to be runnable at a point in time on a specific cpu. It might be two small tasks that just happened to be running while you read nr_running.
An unweighted version of cfs.runnable_load_avg gives you a metric that captures cpu utilization to some extend, but not the number of tasks. And it reflects task migrations immediately unlike the rq runnable_avg_sum.
Just to point that we can spent a lot of time listing which use case are better covered by which metrics :-)
Agreed, but I think it is quite important to discuss what we understand by cpu utilization. It seems to be different depending on what you want to use it for. I think it is also clear that none of the metrics that have been proposed are perfect. We therefore have to be careful to only use metrics in scenarios where they make sense. IMHO, both rq runnable_avg_sum and unweighted cfs.runnable_load_avg capture cpu utilization, but in different ways.
We have done experiments internally with rq runnable_avg_sum for load-balancing decisions in the past and found it unsuitable due to its slow response to task migrations. That is why I brought it up here. AFAICT, you use rq runnable_avg_sum more like a flag than a quantitative measure of cpu utilization. Viewing things from an energy-awareness point of view I'm more interested in the latter for estimating the implications of moving tasks around. I don't have any problems with using rq runnable_avg_sum for other things as long we are fully aware of how this metric works.
something on top when the cpu is fully utilized by more than one task. It comes more naturally with cfs.runnable_load_avg. If it is much larger than 47742, it should be fairly safe to assume that you shouldn't stick more tasks on that cpu.
I'm not saying that such metric is useless but it's not perfect as well.
It comes with its own set of problems, agreed. Based on my current understanding (or lack thereof) they just seem smaller :)
I think it's worth using the cpu utilization for some cases because it has got some information that are not available elsewhere. And the replacement of the current capacity computation is one example. As explained previously, I'm not against adding other metrics and i'm not sure to understand why you oppose these 2 metrics whereas they could be complementary
I think we more or less agree :) I'm fine with both metrics and I agree that they complement each other. My concern is using the right metric for the right job. If you choose to use rq runnable_avg_sum you have to keep its slow reaction time in mind. I think that might be difficult/not possible for some load-balancing decisions. That is basically my point :)
Morten
On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
On 28 May 2014 17:47, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
On 28 May 2014 14:10, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
I agree that the task runnable_avg_sum is always affected by the circumstances on the cpu where it is running, and that it takes this history with it. However, I think cfs.runnable_load_avg leads to less problems than using the rq runnable_avg_sum. It would work nicely for the two tasks on two cpus example I mentioned earlier. We don't need add
i would say that nr_running is an even better metrics for such situation as the load doesn't give any additional information.
I fail to understand how nr_running can be used. nr_running doesn't tell you anything about the utilization of the cpu, just the number tasks that happen to be runnable at a point in time on a specific cpu. It might be two small tasks that just happened to be running while you read nr_running.
Agreed, I'm not at all seeing how nr_running is useful here.
An unweighted version of cfs.runnable_load_avg gives you a metric that captures cpu utilization to some extend, but not the number of tasks. And it reflects task migrations immediately unlike the rq runnable_avg_sum.
So runnable_avg would be equal to the utilization as long as there's idle time, as soon as we're over-loaded the metric shows how much extra cpu is required.
That is, runnable_avg - running_avg >= 0 and the amount is the exact amount of extra cpu required to make all tasks run but not have idle time.
Agreed, but I think it is quite important to discuss what we understand by cpu utilization. It seems to be different depending on what you want to use it for.
I understand utilization to be however much cpu is actually used, so I would, per the existing naming, call running_avg to be the avg utilization of a task/group/cpu whatever.
We have done experiments internally with rq runnable_avg_sum for load-balancing decisions in the past and found it unsuitable due to its slow response to task migrations. That is why I brought it up here.
So I'm not entirely seeing that from the code (I've not traced this), afaict we actually update the per-cpu values on migration based on the task values.
old_rq->sum -= p->val; new_rq->sum += p->val;
like,.. except of course totally obscured.
On Tue, Jun 03, 2014 at 04:59:39PM +0100, Peter Zijlstra wrote:
On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
An unweighted version of cfs.runnable_load_avg gives you a metric that captures cpu utilization to some extend, but not the number of tasks. And it reflects task migrations immediately unlike the rq runnable_avg_sum.
So runnable_avg would be equal to the utilization as long as there's idle time, as soon as we're over-loaded the metric shows how much extra cpu is required.
That is, runnable_avg - running_avg >= 0 and the amount is the exact amount of extra cpu required to make all tasks run but not have idle time.
Yes, roughly. runnable_avg goes up quite steeply if you have many tasks on a fully utilized cpu, so the actual amount of extra cpu required might be somewhat lower. I can't come up with something better, so I agree.
Agreed, but I think it is quite important to discuss what we understand by cpu utilization. It seems to be different depending on what you want to use it for.
I understand utilization to be however much cpu is actually used, so I would, per the existing naming, call running_avg to be the avg utilization of a task/group/cpu whatever.
I see your point, but for load balancing purposes we are more intested in the runnable_avg as it tells us about the cpu capacity requirements. I don't like to throw more terms into the mix, but you could call runnable_avg the potential task/group/cpu utilization. This is an estimate of how much utilization a task would cause if we moved it to an idle cpu. That might be quite different from running_avg on an over-utilized cpu.
We have done experiments internally with rq runnable_avg_sum for load-balancing decisions in the past and found it unsuitable due to its slow response to task migrations. That is why I brought it up here.
So I'm not entirely seeing that from the code (I've not traced this), afaict we actually update the per-cpu values on migration based on the task values.
old_rq->sum -= p->val; new_rq->sum += p->val;
like,.. except of course totally obscured.
Yes, for cfs.runnable_load_avg, rq->avg.runnable_avg_sum is different. See the other reply.
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
On Tue, Jun 03, 2014 at 04:50:07PM +0100, Peter Zijlstra wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
No, not for this per-cpu tracking metric :) For cfs.runnable_load_avg you are right, but this patch set is using rq->avg.runnable_avg_sum which is different. See my other reply.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
That could probably make sense. We had that in pjt's first proposal.
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
Vincent
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
On Wed, Jun 04, 2014 at 09:08:09AM +0100, Peter Zijlstra wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
Both running_avg and runnable_avg are affected by other tasks on the same cpus, but in different ways. They are equal if you only have one task on a cpu. If you have more, running_avg will give you the true requirement of the tasks until the cpu is fully utilized. At which point the task running_avg will drop if you add more tasks (the unweighted sum of task running_avgs remains constant).
runnable_avg on the other hand, might be affected as soon as you have two task running on the same cpu if they are runnable at the same time. That isn't necessarily a bad thing for load-balancing purposes, because tasks that are runnable at the same time are likely to be run more efficiently by placing them on different cpus. You might view as at sort of built in concurrency factor, somewhat similar to what Yuyang is proposing. runnable_avg increases rapidly when the cpu is over-utilized.
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
You should only be able to get to 75% worst case for runnable_avg for that example. The total running_avg is 50% no matter if the tasks overlaps or not.
f you had five tasks on one cpu that each have a 25% requirement you can get individual task runnable_avgs of up to 100% (cpu unweighted runnable_load_avg can get up 500%, I think), but the task running_avgs would be 20% each (total of 100%).
On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
Both running_avg and runnable_avg are affected by other tasks on the same cpus, but in different ways. They are equal if you only have one task on a cpu. If you have more, running_avg will give you the true requirement of the tasks until the cpu is fully utilized. At which point the task running_avg will drop if you add more tasks (the unweighted sum of task running_avgs remains constant).
runnable_avg on the other hand, might be affected as soon as you have two task running on the same cpu if they are runnable at the same time. That isn't necessarily a bad thing for load-balancing purposes, because tasks that are runnable at the same time are likely to be run more efficiently by placing them on different cpus. You might view as at sort of built in concurrency factor, somewhat similar to what Yuyang is proposing. runnable_avg increases rapidly when the cpu is over-utilized.
Agreed.
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
You should only be able to get to 75% worst case for runnable_avg for that example. The total running_avg is 50% no matter if the tasks overlaps or not.
Yes, 75% is what I ended up with.
f you had five tasks on one cpu that each have a 25% requirement you can get individual task runnable_avgs of up to 100% (cpu unweighted runnable_load_avg can get up 500%, I think), but the task running_avgs would be 20% each (total of 100%).
Yeah, more or less so indeed. I had not considered the queueing effects on runnable_avg yesterday, so good that that got raised.
That does indeed invalidate my: runnable - running := extra cpu required thing. It ends up being the extra cpu required for 0 latency but gobs of idle time, which is something else entirely.
On 4 June 2014 11:23, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
Both running_avg and runnable_avg are affected by other tasks on the same cpus, but in different ways. They are equal if you only have one task on a cpu. If you have more, running_avg will give you the true requirement of the tasks until the cpu is fully utilized. At which point the task running_avg will drop if you add more tasks (the unweighted sum of task running_avgs remains constant).
runnable_avg on the other hand, might be affected as soon as you have two task running on the same cpu if they are runnable at the same time. That isn't necessarily a bad thing for load-balancing purposes, because tasks that are runnable at the same time are likely to be run more efficiently by placing them on different cpus. You might view as at sort of built in concurrency factor, somewhat similar to what Yuyang is proposing. runnable_avg increases rapidly when the cpu is over-utilized.
Agreed.
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
You should only be able to get to 75% worst case for runnable_avg for that example. The total running_avg is 50% no matter if the tasks overlaps or not.
Yes, 75% is what I ended up with.
Can you explain how you reach 75% as it depends on the runtime and a runtime longer than 345ms will end to a 100% load whatever the idletime was previously ?
f you had five tasks on one cpu that each have a 25% requirement you can get individual task runnable_avgs of up to 100% (cpu unweighted runnable_load_avg can get up 500%, I think), but the task running_avgs would be 20% each (total of 100%).
Yeah, more or less so indeed. I had not considered the queueing effects on runnable_avg yesterday, so good that that got raised.
That does indeed invalidate my: runnable - running := extra cpu required thing. It ends up being the extra cpu required for 0 latency but gobs of idle time, which is something else entirely.
On Wed, Jun 04, 2014 at 10:35:15AM +0100, Vincent Guittot wrote:
On 4 June 2014 11:23, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
Both running_avg and runnable_avg are affected by other tasks on the same cpus, but in different ways. They are equal if you only have one task on a cpu. If you have more, running_avg will give you the true requirement of the tasks until the cpu is fully utilized. At which point the task running_avg will drop if you add more tasks (the unweighted sum of task running_avgs remains constant).
runnable_avg on the other hand, might be affected as soon as you have two task running on the same cpu if they are runnable at the same time. That isn't necessarily a bad thing for load-balancing purposes, because tasks that are runnable at the same time are likely to be run more efficiently by placing them on different cpus. You might view as at sort of built in concurrency factor, somewhat similar to what Yuyang is proposing. runnable_avg increases rapidly when the cpu is over-utilized.
Agreed.
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
You should only be able to get to 75% worst case for runnable_avg for that example. The total running_avg is 50% no matter if the tasks overlaps or not.
Yes, 75% is what I ended up with.
Can you explain how you reach 75% as it depends on the runtime and a runtime longer than 345ms will end to a 100% load whatever the idletime was previously ?
If the busy period of each task is long enough that the first one to run runs to completetion before the other task is scheduled you get 25% and 50%. But as I said in my other reply, you can get higher task runnable if the tasks busy period is long enough that you switch between them before the first one completes.
On Wed, Jun 04, 2014 at 10:23:13AM +0100, Peter Zijlstra wrote:
f you had five tasks on one cpu that each have a 25% requirement you can get individual task runnable_avgs of up to 100% (cpu unweighted runnable_load_avg can get up 500%, I think), but the task running_avgs would be 20% each (total of 100%).
Yeah, more or less so indeed. I had not considered the queueing effects on runnable_avg yesterday, so good that that got raised.
That does indeed invalidate my: runnable - running := extra cpu required thing. It ends up being the extra cpu required for 0 latency but gobs of idle time, which is something else entirely.
Agreed, but I think it is still a useful estimate of the required compute capacity. If there is a significant difference between runnable and running on a cpu, the current mix of tasks is not good for latency. However, we need to treat it as a worst case estimate and not necessarily try to move exactly runnable-running worth of tasks to another cpu.
So far I haven't been able to come up with something better.
On 4 June 2014 10:08, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
In fact, it can be even worse than that because i forgot to take into account the geometric series effect which implies that it depends of the runtime (idletime) of the task
Take 3 examples:
2 tasks that need to run 10ms simultaneously each 40ms. If they share the same CPU, they will be on the runqueue 20ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 33% each so the unweighted runnable_load_avg of the CPU will be 66%
2 tasks that need to run 25ms simultaneously each 100ms. If they share the same CPU, they will be on the runqueue 50ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 74% each so the unweighted runnable_load_avg of the CPU will be 148%
2 tasks that need to run 50ms simultaneously each 200ms. If they share the same CPU, they will be on the runqueue 100ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 89% each so the unweighted runnable_load_avg of the CPU will be 180%
Vincent
On Wed, Jun 04, 2014 at 10:32:10AM +0100, Vincent Guittot wrote:
On 4 June 2014 10:08, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
In fact, it can be even worse than that because i forgot to take into account the geometric series effect which implies that it depends of the runtime (idletime) of the task
Take 3 examples:
2 tasks that need to run 10ms simultaneously each 40ms. If they share the same CPU, they will be on the runqueue 20ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 33% each so the unweighted runnable_load_avg of the CPU will be 66%
Right, it actually depends on how often you switch between the tasks. If you sched_tick happens every 10 ms then in this example one task will run for 10 ms an be done, while the other one waits for 10 ms and then runs to completion in 10 ms. The result is that one task is runnable for 10 ms and the other one is runnable for 20 ms. That gives you 25% and 50% for a total of 75%.
2 tasks that need to run 25ms simultaneously each 100ms. If they share the same CPU, they will be on the runqueue 50ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 74% each so the unweighted runnable_load_avg of the CPU will be 148%
2 tasks that need to run 50ms simultaneously each 200ms. If they share the same CPU, they will be on the runqueue 100ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 89% each so the unweighted runnable_load_avg of the CPU will be 180%
In this case you switch been the tasks before they complete, so in this case both tasks are waiting so runnable will get much righer than 75%. You are right. It was thinking about tasks where the busy period is short enough that they run to completion when they are scheduled.
On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
On 4 June 2014 10:08, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
Let me explain the 75%, take any one of the above scenarios. Lets call the two tasks A and B, and let for a moment assume A always wins and runs first, and then B.
So A will be runnable for 25%, B otoh will be runnable the entire time A is actually running plus its own running time, giving 50%. Together that makes 75%.
If you release the assumption that A runs first, but instead assume they equally win the first execution, you get them averaging at 37.5% each, which combined will still give 75%.
In fact, it can be even worse than that because i forgot to take into account the geometric series effect which implies that it depends of the runtime (idletime) of the task
Take 3 examples:
2 tasks that need to run 10ms simultaneously each 40ms. If they share the same CPU, they will be on the runqueue 20ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 33% each so the unweighted runnable_load_avg of the CPU will be 66%
2 tasks that need to run 25ms simultaneously each 100ms. If they share the same CPU, they will be on the runqueue 50ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 74% each so the unweighted runnable_load_avg of the CPU will be 148%
2 tasks that need to run 50ms simultaneously each 200ms. If they share the same CPU, they will be on the runqueue 100ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 89% each so the unweighted runnable_load_avg of the CPU will be 180%
And this is because the running time is 'large' compared to the decay and we get hit by the weight of the recent state? Yes, I can see that, the avg will fluctuate due to the nature of this thing.
On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
On 4 June 2014 10:08, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
Let me explain the 75%, take any one of the above scenarios. Lets call the two tasks A and B, and let for a moment assume A always wins and runs first, and then B.
So A will be runnable for 25%, B otoh will be runnable the entire time A is actually running plus its own running time, giving 50%. Together that makes 75%.
If you release the assumption that A runs first, but instead assume they equally win the first execution, you get them averaging at 37.5% each, which combined will still give 75%.
But that is assuming that the first task gets to run to completion of it busy period. If it uses up its sched_slice and we switch to the other tasks, they both get to wait.
For example, if the sched_slice is 5 ms and the busy period is 10 ms, the execution pattern would be: A, B, A, B, idle, ... In that case A is runnable for 15 ms and B is for 20 ms. Assuming that the overall period is 40 ms, the A runnable is 37.5% and B is 50%.
On Wed, Jun 04, 2014 at 11:36:19AM +0100, Morten Rasmussen wrote:
On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
Let me explain the 75%, take any one of the above scenarios. Lets call the two tasks A and B, and let for a moment assume A always wins and runs first, and then B.
So A will be runnable for 25%, B otoh will be runnable the entire time A is actually running plus its own running time, giving 50%. Together that makes 75%.
If you release the assumption that A runs first, but instead assume they equally win the first execution, you get them averaging at 37.5% each, which combined will still give 75%.
But that is assuming that the first task gets to run to completion of it busy period. If it uses up its sched_slice and we switch to the other tasks, they both get to wait.
For example, if the sched_slice is 5 ms and the busy period is 10 ms, the execution pattern would be: A, B, A, B, idle, ... In that case A is runnable for 15 ms and B is for 20 ms. Assuming that the overall period is 40 ms, the A runnable is 37.5% and B is 50%.
Indeed, with preemption added you can pull this out further. You can then indeed get infinitesimally close to 100% with this.
On 4 June 2014 12:36, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
On 4 June 2014 10:08, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote:
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > Since we may do periodic load-balance every 10 ms or so, we will perform > a number of load-balances where runnable_avg_sum will mostly be > reflecting the state of the world before a change (new task queued or > moved a task to a different cpu). If you had have two tasks continuously > on one cpu and your other cpu is idle, and you move one of the tasks to > the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > first cpu while it starts from 0 on the other one. 10 ms later it will > have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > it reaches 47742. In the mean time the cpu doesn't appear fully utilized > and we might decide to put more tasks on it because we don't know if > runnable_avg_sum represents a partially utilized cpu (for example a 50% > task) or if it will continue to rise and eventually get to 47742.
Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated.
If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
Let me explain the 75%, take any one of the above scenarios. Lets call the two tasks A and B, and let for a moment assume A always wins and runs first, and then B.
So A will be runnable for 25%, B otoh will be runnable the entire time A is actually running plus its own running time, giving 50%. Together that makes 75%.
If you release the assumption that A runs first, but instead assume they equally win the first execution, you get them averaging at 37.5% each, which combined will still give 75%.
But that is assuming that the first task gets to run to completion of it busy period. If it uses up its sched_slice and we switch to the other tasks, they both get to wait.
For example, if the sched_slice is 5 ms and the busy period is 10 ms, the execution pattern would be: A, B, A, B, idle, ... In that case A is runnable for 15 ms and B is for 20 ms. Assuming that the overall period is 40 ms, the A runnable is 37.5% and B is 50%.
The exact value for your scheduling example above is: A runnable will be 47% and B runnable will be 60% (unless i make a mistake in my computation) and CPU runnable will be 60% too
Vincent
On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
On 4 June 2014 12:36, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
On 4 June 2014 10:08, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote: > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: >> Since we may do periodic load-balance every 10 ms or so, we will perform >> a number of load-balances where runnable_avg_sum will mostly be >> reflecting the state of the world before a change (new task queued or >> moved a task to a different cpu). If you had have two tasks continuously >> on one cpu and your other cpu is idle, and you move one of the tasks to >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the >> first cpu while it starts from 0 on the other one. 10 ms later it will >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized >> and we might decide to put more tasks on it because we don't know if >> runnable_avg_sum represents a partially utilized cpu (for example a 50% >> task) or if it will continue to rise and eventually get to 47742. > > Ah, no, since we track per task, and update the per-cpu ones when we > migrate tasks, the per-cpu values should be instantly updated. > > If we were to increase per task storage, we might as well also track > running_avg not only runnable_avg.
I agree that the removed running_avg should give more useful information about the the load of a CPU.
The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
Let me explain the 75%, take any one of the above scenarios. Lets call the two tasks A and B, and let for a moment assume A always wins and runs first, and then B.
So A will be runnable for 25%, B otoh will be runnable the entire time A is actually running plus its own running time, giving 50%. Together that makes 75%.
If you release the assumption that A runs first, but instead assume they equally win the first execution, you get them averaging at 37.5% each, which combined will still give 75%.
But that is assuming that the first task gets to run to completion of it busy period. If it uses up its sched_slice and we switch to the other tasks, they both get to wait.
For example, if the sched_slice is 5 ms and the busy period is 10 ms, the execution pattern would be: A, B, A, B, idle, ... In that case A is runnable for 15 ms and B is for 20 ms. Assuming that the overall period is 40 ms, the A runnable is 37.5% and B is 50%.
The exact value for your scheduling example above is: A runnable will be 47% and B runnable will be 60% (unless i make a mistake in my computation)
I get:
A: 15/40 ms = 37.5% B: 20/40 ms = 50%
Schedule:
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
and CPU runnable will be 60% too
rq->avg.runnable_avg_sum should be 50%. You have two tasks running for 20 ms every 40 ms.
Right?
Morten
On 4 June 2014 13:23, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
On 4 June 2014 12:36, Morten Rasmussen morten.rasmussen@arm.com wrote:
On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
On 4 June 2014 10:08, Peter Zijlstra peterz@infradead.org wrote:
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: > On 3 June 2014 17:50, Peter Zijlstra peterz@infradead.org wrote: > > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > >> Since we may do periodic load-balance every 10 ms or so, we will perform > >> a number of load-balances where runnable_avg_sum will mostly be > >> reflecting the state of the world before a change (new task queued or > >> moved a task to a different cpu). If you had have two tasks continuously > >> on one cpu and your other cpu is idle, and you move one of the tasks to > >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > >> first cpu while it starts from 0 on the other one. 10 ms later it will > >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized > >> and we might decide to put more tasks on it because we don't know if > >> runnable_avg_sum represents a partially utilized cpu (for example a 50% > >> task) or if it will continue to rise and eventually get to 47742. > > > > Ah, no, since we track per task, and update the per-cpu ones when we > > migrate tasks, the per-cpu values should be instantly updated. > > > > If we were to increase per task storage, we might as well also track > > running_avg not only runnable_avg. > > I agree that the removed running_avg should give more useful > information about the the load of a CPU. > > The main issue with running_avg is that it's disturbed by other tasks > (as point out previously). As a typical example, if we have 2 tasks > with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be > in the range of [100% - 50%] depending of the parallelism of the > runtime of the tasks whereas the reality is 50% and the use of > running_avg will return this value
I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
Let me explain the 75%, take any one of the above scenarios. Lets call the two tasks A and B, and let for a moment assume A always wins and runs first, and then B.
So A will be runnable for 25%, B otoh will be runnable the entire time A is actually running plus its own running time, giving 50%. Together that makes 75%.
If you release the assumption that A runs first, but instead assume they equally win the first execution, you get them averaging at 37.5% each, which combined will still give 75%.
But that is assuming that the first task gets to run to completion of it busy period. If it uses up its sched_slice and we switch to the other tasks, they both get to wait.
For example, if the sched_slice is 5 ms and the busy period is 10 ms, the execution pattern would be: A, B, A, B, idle, ... In that case A is runnable for 15 ms and B is for 20 ms. Assuming that the overall period is 40 ms, the A runnable is 37.5% and B is 50%.
The exact value for your scheduling example above is: A runnable will be 47% and B runnable will be 60% (unless i make a mistake in my computation)
I get:
A: 15/40 ms = 37.5% B: 20/40 ms = 50%
Schedule:
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
and CPU runnable will be 60% too
rq->avg.runnable_avg_sum should be 50%. You have two tasks running for 20 ms every 40 ms.
Right?
ok, i see the misunderstood. it's depends of what we mean by runnable. You take the % of time whereas i take the runnable_avg_sum/period
so A is on_rq 15/40 ms = 37.5% of the time which gives a runnable_avg_sum/runnable_avg_period of 47% B is on_rq 20/40 ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60% and CPU has a task on its rq 20/40ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60%
Vincent
Morten
On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
On 4 June 2014 13:23, Morten Rasmussen morten.rasmussen@arm.com wrote:
I get:
A: 15/40 ms = 37.5% B: 20/40 ms = 50%
Schedule:
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
and CPU runnable will be 60% too
rq->avg.runnable_avg_sum should be 50%. You have two tasks running for 20 ms every 40 ms.
Right?
ok, i see the misunderstood. it's depends of what we mean by runnable. You take the % of time whereas i take the runnable_avg_sum/period
Right. There is a difference.
so A is on_rq 15/40 ms = 37.5% of the time which gives a runnable_avg_sum/runnable_avg_period of 47% B is on_rq 20/40 ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60% and CPU has a task on its rq 20/40ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60%
Yes, that seems about right.
On Wed, Jun 04, 2014 at 02:09:18PM +0100, Morten Rasmussen wrote:
On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
On 4 June 2014 13:23, Morten Rasmussen morten.rasmussen@arm.com wrote:
I get:
A: 15/40 ms = 37.5% B: 20/40 ms = 50%
Schedule:
| 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq
and CPU runnable will be 60% too
rq->avg.runnable_avg_sum should be 50%. You have two tasks running for 20 ms every 40 ms.
Right?
ok, i see the misunderstood. it's depends of what we mean by runnable. You take the % of time whereas i take the runnable_avg_sum/period
Right. There is a difference.
so A is on_rq 15/40 ms = 37.5% of the time which gives a runnable_avg_sum/runnable_avg_period of 47% B is on_rq 20/40 ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60% and CPU has a task on its rq 20/40ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60%
Yes, that seems about right.
If my calculations are right, cfs.runnable_load_avg would peak 15 ms into the period at about 104%. After 20 ms A has decayed more than B has gained so the sum is slightly lower.
On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742.
But it takes only 4 periods (4*32ms) to get ~94% of the way there,
So our half-time is 32 * 1024 ms, now look at \Sum_i=1,4 1/(2^i) = 0.9375. So 4 periods (4*32*1024=131072 ms) gets us ~94% of the way there.
Now, granted, 4 periods of 32-ish ms is still very long, but nowhere near as horrid as the full convergence.
On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
The rq runnable_avg_{sum, period} give a very long term view of the cpu utilization (I will use the term utilization instead of activity as I think that is what we are talking about here). IMHO, it is too slow to be used as basis for load balancing decisions. I think that was also agreed upon in the last discussion related to this topic [1].
The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742. In other words, the cpu might have been fully utilized for 345 ms before it is considered fully utilized. Periodic load-balancing happens much more frequently than that.
Like said earlier the 94% mark is actually hit much sooner, but yes, likely still too slow.
50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..
Also, if load-balancing actually moves tasks around it may take quite a while before runnable_avg_sum actually reflects this change. The next periodic load-balance is likely to happen before runnable_avg_sum has reflected the result of the previous periodic load-balance.
To avoid these problems, we need to base utilization on a metric which is updated instantaneously when we add/remove tasks to a cpu (or a least fast enough that we don't see the above problems).
So the per-task-load-tracking stuff already does that. It updates the per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().
And keeping an unweighted per-cpu variant isn't that much more work.
In the previous discussion [1] it was suggested that a sum of unweighted task runnable_avg_{sum,period} ratio instead. That is, an unweighted equivalent to weighted_cpuload(). That isn't a perfect solution either. It is fine as long as the cpus are not fully utilized, but when they are we need to use weighted_cpuload() to preserve smp_nice. What to do around the tipping point needs more thought, but I think that is currently the best proposal for a solution for task and cpu utilization.
I'm not too worried about the tipping point, per task runnable figures of an overloaded cpu are higher, so migration between an overloaded cpu and an underloaded cpu are going to be tricky no matter what we do.
rq runnable_avg_sum is useful for decisions where we need a longer term view of the cpu utilization, but I don't see how we can use as cpu utilization metric for load-balancing decisions at wakeup or periodically.
So keeping one with a faster decay would add extra per-task storage. But would be possible..
On Tue, Jun 03, 2014 at 04:40:58PM +0100, Peter Zijlstra wrote:
On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
The rq runnable_avg_{sum, period} give a very long term view of the cpu utilization (I will use the term utilization instead of activity as I think that is what we are talking about here). IMHO, it is too slow to be used as basis for load balancing decisions. I think that was also agreed upon in the last discussion related to this topic [1].
The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742. In other words, the cpu might have been fully utilized for 345 ms before it is considered fully utilized. Periodic load-balancing happens much more frequently than that.
Like said earlier the 94% mark is actually hit much sooner, but yes, likely still too slow.
50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..
Agreed.
Also, if load-balancing actually moves tasks around it may take quite a while before runnable_avg_sum actually reflects this change. The next periodic load-balance is likely to happen before runnable_avg_sum has reflected the result of the previous periodic load-balance.
To avoid these problems, we need to base utilization on a metric which is updated instantaneously when we add/remove tasks to a cpu (or a least fast enough that we don't see the above problems).
So the per-task-load-tracking stuff already does that. It updates the per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().
I think there is some confusion here. There are two per-cpu load metrics that tracks differently.
The cfs.runnable_load_avg is basically the sum of the load contributions of the tasks on the cfs rq. The sum gets updated whenever tasks are {en,de}queued by adding/subtracting the load contribution of the task being added/removed. That is the one you are referring to.
The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is tracking whether the cpu has something to do or not. It doesn't matter many tasks are runnable or what their load is. It is updated in update_rq_runnable_avg(). It increases when rq->nr_running > 0 and decays if not. It also takes time spent running rt tasks into account in idle_{enter, exit}_fair(). So if you remove tasks from the rq, this metric will start decaying and eventually get to 0, unlike the cfs.runnable_load_avg where the task load contribution subtracted every time a task is removed. The rq runnable_avg_sum is the one being used in this patch set.
Ben, pjt, please correct me if I'm wrong.
And keeping an unweighted per-cpu variant isn't that much more work.
Agreed.
In the previous discussion [1] it was suggested that a sum of unweighted task runnable_avg_{sum,period} ratio instead. That is, an unweighted equivalent to weighted_cpuload(). That isn't a perfect solution either. It is fine as long as the cpus are not fully utilized, but when they are we need to use weighted_cpuload() to preserve smp_nice. What to do around the tipping point needs more thought, but I think that is currently the best proposal for a solution for task and cpu utilization.
I'm not too worried about the tipping point, per task runnable figures of an overloaded cpu are higher, so migration between an overloaded cpu and an underloaded cpu are going to be tricky no matter what we do.
Yes, agreed. I just got the impression that you were concerned about smp_nice last time we discussed this.
rq runnable_avg_sum is useful for decisions where we need a longer term view of the cpu utilization, but I don't see how we can use as cpu utilization metric for load-balancing decisions at wakeup or periodically.
So keeping one with a faster decay would add extra per-task storage. But would be possible..
I have had that thought when we discussed potential replacements for cpu_load[]. It will require some messing around with the nicely optimized load tracking maths if we want to have load tracking with a different y-coefficient.
On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
I'm not too worried about the tipping point, per task runnable figures of an overloaded cpu are higher, so migration between an overloaded cpu and an underloaded cpu are going to be tricky no matter what we do.
Yes, agreed. I just got the impression that you were concerned about smp_nice last time we discussed this.
Well, yes, we need to keep that working, but the exact detail around the tipping point are near impossible to get right, so I'm not too bothered there.
rq runnable_avg_sum is useful for decisions where we need a longer term view of the cpu utilization, but I don't see how we can use as cpu utilization metric for load-balancing decisions at wakeup or periodically.
So keeping one with a faster decay would add extra per-task storage. But would be possible..
I have had that thought when we discussed potential replacements for cpu_load[]. It will require some messing around with the nicely optimized load tracking maths if we want to have load tracking with a different y-coefficient.
My initial thought was a y=0.5, which is really >>=1. But yes, if we want something else that'll get messy real fast methinks.
On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
So the per-task-load-tracking stuff already does that. It updates the per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().
I think there is some confusion here. There are two per-cpu load metrics that tracks differently.
The cfs.runnable_load_avg is basically the sum of the load contributions of the tasks on the cfs rq. The sum gets updated whenever tasks are {en,de}queued by adding/subtracting the load contribution of the task being added/removed. That is the one you are referring to.
The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is tracking whether the cpu has something to do or not. It doesn't matter many tasks are runnable or what their load is. It is updated in update_rq_runnable_avg(). It increases when rq->nr_running > 0 and decays if not. It also takes time spent running rt tasks into account in idle_{enter, exit}_fair(). So if you remove tasks from the rq, this metric will start decaying and eventually get to 0, unlike the cfs.runnable_load_avg where the task load contribution subtracted every time a task is removed. The rq runnable_avg_sum is the one being used in this patch set.
Ben, pjt, please correct me if I'm wrong.
Argh, ok I completely missed that. I think the cfs.runnable_load_avg is the sane number, not entirely sure what good rq->avg.runnable_avg is good for, it seems a weird metric on first consideration.
Will have to ponder that a bit more.
The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742. In other words, the cpu might have been fully utilized for 345 ms before it is considered fully utilized. Periodic load-balancing happens much more frequently than that.
Like said earlier the 94% mark is actually hit much sooner, but yes, likely still too slow.
50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..
In the previous discussion [1] it was suggested that a sum of unweighted task runnable_avg_{sum,period} ratio instead. That is, an unweighted equivalent to weighted_cpuload(). That isn't a perfect solution either. It is fine as long as the cpus are not fully utilized, but when they are we need to use weighted_cpuload() to preserve smp_nice. What to do around the tipping point needs more thought, but I think that is currently the best proposal for a solution for task and cpu utilization.
I'm not too worried about the tipping point, per task runnable figures of an overloaded cpu are higher, so migration between an overloaded cpu and an underloaded cpu are going to be tricky no matter what we do.
Hi,
Can I join this dicussion late?
As I understand, you are talking about a metric for cpu activity. And the issues about runnable_avg_sum is its sluggishness to latest change, and also need unweighted load averages.
You might be aware of my recent proposal to CPU ConCurrency (CC). It is 1) an average of nr_running, or 2) nr_running weighted CPU utilization. So it is a combination of CPU utlization and run queue (both factored natually). It meets the needs you talked, I think. You can take it as a candidate, or at least we can talk about it?
Thanks, Yuyang
On 23/05/14 16:53, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
[...]
/*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5518,6 +5535,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_power;
- unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity; unsigned int idle_cpus;
@@ -5538,6 +5556,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest = NULL, .local = NULL, .total_load = 0UL,
.total_activity = 0UL,
AFAICS, total_activity is not used right now. Do you intend to use it to calculate something like avg_activity later (like total_load/avg_load)?
.total_pwr = 0UL, .busiest_stat = { .avg_load = 0UL,
[...]
On 30 May 2014 11:50, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/05/14 16:53, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
[...]
/*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5518,6 +5535,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_power;
unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity; unsigned int idle_cpus;
@@ -5538,6 +5556,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest = NULL, .local = NULL, .total_load = 0UL,
.total_activity = 0UL,
AFAICS, total_activity is not used right now. Do you intend to use it to calculate something like avg_activity later (like total_load/avg_load)?
this patch only creates group_activity and total_activity.The use of these statistics appears on the next patches. I have not planned to compute such avg_activity for the moment mainly because i have need it yet.
Do you have in mind any use of such avg_activity ?
Vincent
.total_pwr = 0UL, .busiest_stat = { .avg_load = 0UL,
[...]
On 30/05/14 20:20, Vincent Guittot wrote:
On 30 May 2014 11:50, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/05/14 16:53, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
[...]
/*
- select_task_rq_fair: Select target runqueue for the waking task in domains
- that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5518,6 +5535,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_power;
unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity; unsigned int idle_cpus;
@@ -5538,6 +5556,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest = NULL, .local = NULL, .total_load = 0UL,
.total_activity = 0UL,
AFAICS, total_activity is not used right now. Do you intend to use it to calculate something like avg_activity later (like total_load/avg_load)?
this patch only creates group_activity and total_activity.The use of these statistics appears on the next patches. I have not planned to compute such avg_activity for the moment mainly because i have need it yet.
Do you have in mind any use of such avg_activity ?
No, not really but I still don't see (taken all your patches) where sds->total_activity is used.
$ find . -name "*.[ch]" | xargs grep total_activity ./kernel/sched/fair.c: unsigned long total_activity; /* Total activity of all groups in sd */ ./kernel/sched/fair.c: .total_activity = 0UL, ./kernel/sched/fair.c: sds->total_activity += sgs->group_activity;
OOTH, sgs->group_activity is used to set sgs->group_capacity which is then used in if conditions in update_sd_pick_busiest(), update_sd_lb_stats() and find_busiest_group().
[...]
On 1 June 2014 13:33, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 30/05/14 20:20, Vincent Guittot wrote:
On 30 May 2014 11:50, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/05/14 16:53, Vincent Guittot wrote:
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
[...]
/*
- select_task_rq_fair: Select target runqueue for the waking task in
domains
- that have the 'sd_flag' flag set. In practice, this is
SD_BALANCE_WAKE, @@ -5518,6 +5535,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_power;
unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity; unsigned int idle_cpus;
@@ -5538,6 +5556,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_activity; /* Total activity of all groups in
sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
@@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest = NULL, .local = NULL, .total_load = 0UL,
.total_activity = 0UL,
AFAICS, total_activity is not used right now. Do you intend to use it to calculate something like avg_activity later (like total_load/avg_load)?
this patch only creates group_activity and total_activity.The use of these statistics appears on the next patches. I have not planned to compute such avg_activity for the moment mainly because i have need it yet.
Do you have in mind any use of such avg_activity ?
No, not really but I still don't see (taken all your patches) where sds->total_activity is used.
$ find . -name "*.[ch]" | xargs grep total_activity ./kernel/sched/fair.c: unsigned long total_activity; /* Total activity of all groups in sd */ ./kernel/sched/fair.c: .total_activity = 0UL, ./kernel/sched/fair.c: sds->total_activity += sgs->group_activity;
OOTH, sgs->group_activity is used to set sgs->group_capacity which is then used in if conditions in update_sd_pick_busiest(), update_sd_lb_stats() and find_busiest_group().
Sorry, i misread you 1st email and haven't understood that you were specifically speaking about total_activity. When i created the activity statistic, i have added both sg and sd fields but i finally only use the sg one. If i don't need the sd field after updating my patches, i will remove it
Thanks Vincent
[...]
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one which share its LLC.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c01d8b6..e8a30f9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4241,6 +4241,7 @@ static int wake_wide(struct task_struct *p) static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; + s64 this_eff_load, prev_eff_load; int idx, this_cpu, prev_cpu; struct task_group *tg; unsigned long weight; @@ -4284,21 +4285,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) * Otherwise check if either cpus are near enough in load to allow this * task to be woken on this_cpu. */ - if (this_load > 0) { - s64 this_eff_load, prev_eff_load; + this_eff_load = 100; + this_eff_load *= power_of(prev_cpu); + + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= power_of(this_cpu);
- this_eff_load = 100; - this_eff_load *= power_of(prev_cpu); + if (this_load > 0) { this_eff_load *= this_load + effective_load(tg, this_cpu, weight, weight);
- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; - prev_eff_load *= power_of(this_cpu); prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); + } + + balanced = this_eff_load <= prev_eff_load;
- balanced = this_eff_load <= prev_eff_load; - } else - balanced = true; schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
if (!balanced)
On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one which share its LLC.
OK, so I'm having a terrible time parsing the above.
So looking at the patch you make balance false even when this_load==0 when the effective power/capacity (nico's patches haven't fully sunk in yet) of this cpu is less than that of the previous cpu.
Is that right?
Now I'm only struggling to understand the rationale for this, its got LLC in there somewhere, but I'm failing to comprehend.
On 28 May 2014 12:58, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
Currently the task always wakes affine on this_cpu if the latter is idle. Before waking up the task on this_cpu, we check that this_cpu capacity is not significantly reduced because of RT tasks or irq activity.
Use case where the number of irq and the time spent under irq is important will take benefit of this because the task that is woken up by irq or softirq will not use the same CPU than irq (and softirq) but a idle one which share its LLC.
OK, so I'm having a terrible time parsing the above.
So looking at the patch you make balance false even when this_load==0 when the effective power/capacity (nico's patches haven't fully sunk in yet) of this cpu is less than that of the previous cpu.
Is that right?
yes,
Now I'm only struggling to understand the rationale for this, its got LLC in there somewhere, but I'm failing to comprehend.
Ah.. i have probably overestimated the fact that wake_affine was only done at MC or SMT level but after reading more deeply the glags configuration of all sched_domain level , my assumption is not true. So I need to test sd flag to make sure that their share their cache at this level
Hi Vincent, On 5/28/14, 7:15 PM, Vincent Guittot wrote:
On 28 May 2014 12:58, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
[snip]
Now I'm only struggling to understand the rationale for this, its got LLC in there somewhere, but I'm failing to comprehend.
Ah.. i have probably overestimated the fact that wake_affine was only done at MC or SMT level but after reading more deeply the glags configuration of all sched_domain level , my assumption is not true.
What's level the wake_affine should happen?
Regards, Wanpeng Li
So I need to test sd flag to make sure that their share their cache at this level -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 24 November 2014 at 01:34, Wanpeng Li kernellwp@gmail.com wrote:
Hi Vincent, On 5/28/14, 7:15 PM, Vincent Guittot wrote:
On 28 May 2014 12:58, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
[snip]
Now I'm only struggling to understand the rationale for this, its got LLC in there somewhere, but I'm failing to comprehend.
Ah.. i have probably overestimated the fact that wake_affine was only done at MC or SMT level but after reading more deeply the glags configuration of all sched_domain level , my assumption is not true.
What's level the wake_affine should happen?
In 1st version of this patch, it was proposed to test the cpu_capacity of idle CPU only if they were sharing their cache
Regards, Vincent
Regards, Wanpeng Li
So I need to test sd flag to make sure that their share their cache at this level -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
+ /* + * The group capacity is reduced probably because of activity from other + * sched class or interrupts which use part of the available capacity + */ + if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct)) + return true; + if (sgs->group_imb) return true;
@@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
if (nr_busy > 1) goto need_kick_unlock; + + if ((rq->cfs.h_nr_running >= 1) + && ((rq->cpu_power * sd->imbalance_pct) < + (rq->cpu_power_orig * 100))) + goto need_kick_unlock; + }
sd = rcu_dereference(per_cpu(sd_asym, cpu));
On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
- /*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
- if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
return true;
- if (sgs->group_imb) return true;
But we should already do this because the load numbers are scaled with the power/capacity figures. If one CPU gets significant less time to run fair tasks, its effective load would spike and it'd get to be selected here anyway.
Or am I missing something?
@@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq) if (nr_busy > 1) goto need_kick_unlock;
if ((rq->cfs.h_nr_running >= 1)
&& ((rq->cpu_power * sd->imbalance_pct) <
(rq->cpu_power_orig * 100)))
goto need_kick_unlock;
- }
sd = rcu_dereference(per_cpu(sd_asym, cpu));
OK, so there you're kicking the idle balancer to try and get another CPU to pull some load? That makes sense I suppose.
That function is pretty horrible though; how about something like this first?
--- kernel/sched/fair.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c9617b73bcc0..47fb96e6fa83 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7215,15 +7215,16 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) * - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler * domain span are idle. */ -static inline int nohz_kick_needed(struct rq *rq) +static inline bool nohz_kick_needed(struct rq *rq) { unsigned long now = jiffies; struct sched_domain *sd; struct sched_group_power *sgp; int nr_busy, cpu = rq->cpu; + bool kick = false;
if (unlikely(rq->idle_balance)) - return 0; + return false;
/* * We may be recently in ticked or tickless idle mode. At the first @@ -7237,38 +7238,34 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus))) - return 0; + return false;
if (time_before(now, nohz.next_balance)) - return 0; + return false;
if (rq->nr_running >= 2) - goto need_kick; + return true;
rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); - if (sd) { sgp = sd->groups->sgp; nr_busy = atomic_read(&sgp->nr_busy_cpus);
- if (nr_busy > 1) - goto need_kick_unlock; + if (nr_busy > 1) { + kick = true; + goto unlock; + } }
sd = rcu_dereference(per_cpu(sd_asym, cpu)); - if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)) - goto need_kick_unlock; - - rcu_read_unlock(); - return 0; + kick = true;
-need_kick_unlock: +unlock: rcu_read_unlock(); -need_kick: - return 1; + return kick; } #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
On 29 May 2014 11:50, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
return true;
if (sgs->group_imb) return true;
But we should already do this because the load numbers are scaled with the power/capacity figures. If one CPU gets significant less time to run fair tasks, its effective load would spike and it'd get to be selected here anyway.
Or am I missing something?
The CPU could have been picked when the capacity becomes null (which occurred when the cpu_power goes below half the default SCHED_POWER_SCALE). And even after that, there were some conditions in find_busiest_group that was bypassing this busiest group
@@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
if (nr_busy > 1) goto need_kick_unlock;
if ((rq->cfs.h_nr_running >= 1)
&& ((rq->cpu_power * sd->imbalance_pct) <
(rq->cpu_power_orig * 100)))
goto need_kick_unlock;
} sd = rcu_dereference(per_cpu(sd_asym, cpu));
OK, so there you're kicking the idle balancer to try and get another CPU to pull some load? That makes sense I suppose.
and especially if we have idle CPUs and one task on the CPU with reduced capacity
That function is pretty horrible though; how about something like this first?
ok, i will integrate this modification in next version
kernel/sched/fair.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c9617b73bcc0..47fb96e6fa83 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7215,15 +7215,16 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
- For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
domain span are idle.
*/ -static inline int nohz_kick_needed(struct rq *rq) +static inline bool nohz_kick_needed(struct rq *rq) { unsigned long now = jiffies; struct sched_domain *sd; struct sched_group_power *sgp; int nr_busy, cpu = rq->cpu;
bool kick = false; if (unlikely(rq->idle_balance))
return 0;
return false; /* * We may be recently in ticked or tickless idle mode. At the first
@@ -7237,38 +7238,34 @@ static inline int nohz_kick_needed(struct rq *rq) * balancing. */ if (likely(!atomic_read(&nohz.nr_cpus)))
return 0;
return false; if (time_before(now, nohz.next_balance))
return 0;
return false; if (rq->nr_running >= 2)
goto need_kick;
return true; rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu));
if (sd) { sgp = sd->groups->sgp; nr_busy = atomic_read(&sgp->nr_busy_cpus);
if (nr_busy > 1)
goto need_kick_unlock;
if (nr_busy > 1) {
kick = true;
goto unlock;
} } sd = rcu_dereference(per_cpu(sd_asym, cpu));
if (sd && (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu))
goto need_kick_unlock;
rcu_read_unlock();
return 0;
kick = true;
-need_kick_unlock: +unlock: rcu_read_unlock(); -need_kick:
return 1;
return kick;
} #else static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
On 29 May 2014 11:50, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
return true;
if (sgs->group_imb) return true;
But we should already do this because the load numbers are scaled with the power/capacity figures. If one CPU gets significant less time to run fair tasks, its effective load would spike and it'd get to be selected here anyway.
Or am I missing something?
The CPU could have been picked when the capacity becomes null (which occurred when the cpu_power goes below half the default SCHED_POWER_SCALE). And even after that, there were some conditions in find_busiest_group that was bypassing this busiest group
Could you detail those conditions? FWIW those make excellent Changelog material.
On 30 May 2014 08:29, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
On 29 May 2014 11:50, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
return true;
if (sgs->group_imb) return true;
But we should already do this because the load numbers are scaled with the power/capacity figures. If one CPU gets significant less time to run fair tasks, its effective load would spike and it'd get to be selected here anyway.
Or am I missing something?
The CPU could have been picked when the capacity becomes null (which occurred when the cpu_power goes below half the default SCHED_POWER_SCALE). And even after that, there were some conditions in find_busiest_group that was bypassing this busiest group
Could you detail those conditions? FWIW those make excellent Changelog material.
I need to go back to my traces and use case to be sure that I point the right culprit but IIRC, the imbalance was null. I will come back with more details once i'll be back in front of the boards.
On 30 May 2014 08:29, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
On 29 May 2014 11:50, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
/*
* The group capacity is reduced probably because of activity from other
* sched class or interrupts which use part of the available capacity
*/
if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
return true;
if (sgs->group_imb) return true;
But we should already do this because the load numbers are scaled with the power/capacity figures. If one CPU gets significant less time to run fair tasks, its effective load would spike and it'd get to be selected here anyway.
Or am I missing something?
The CPU could have been picked when the capacity becomes null (which occurred when the cpu_power goes below half the default SCHED_POWER_SCALE). And even after that, there were some conditions in find_busiest_group that was bypassing this busiest group
Could you detail those conditions? FWIW those make excellent Changelog material.
I have looked back into my tests and traces:
In a 1st test, the capacity of the CPU was still above half default value (power=538) unlike what i remembered. So it's some what "normal" to keep the task on CPU0 which also handles IRQ because sg_capacity still returns 1.
In a 2nd test,the main task runs (most of the time) on CPU0 whereas the max power of the latter is only 623 and the cpu_power goes below 512 (power=330) during the use case. So the sg_capacity of CPU0 is null but the main task still stays on CPU0. The use case (scp transfer) is made of a long running task (ssh) and a periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on CPU1. The newly idle load balance on CPU1 doesn't pull the long running task although sg_capacity is null because of sd->nr_balance_failed is never incremented and load_balance doesn't trig an active load_balance. When an idle balance occurs in the middle of the newly idle balance, the ssh long task migrates on CPU1 but as soon as it sleeps and wakes up, it goes back on CPU0 because of the wake affine which migrates it back on CPU0 (issue solved by patch 09).
Vincent
On Mon, Jun 02, 2014 at 07:06:44PM +0200, Vincent Guittot wrote:
Could you detail those conditions? FWIW those make excellent Changelog material.
I have looked back into my tests and traces:
In a 1st test, the capacity of the CPU was still above half default value (power=538) unlike what i remembered. So it's some what "normal" to keep the task on CPU0 which also handles IRQ because sg_capacity still returns 1.
OK, so I suspect that once we move to utilization based capacity stuff we'll do the migration IF the task indeed requires more cpu than can be provided by the reduced, one, right?
In a 2nd test,the main task runs (most of the time) on CPU0 whereas the max power of the latter is only 623 and the cpu_power goes below 512 (power=330) during the use case. So the sg_capacity of CPU0 is null but the main task still stays on CPU0. The use case (scp transfer) is made of a long running task (ssh) and a periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on CPU1. The newly idle load balance on CPU1 doesn't pull the long running task although sg_capacity is null because of sd->nr_balance_failed is never incremented and load_balance doesn't trig an active load_balance. When an idle balance occurs in the middle of the newly idle balance, the ssh long task migrates on CPU1 but as soon as it sleeps and wakes up, it goes back on CPU0 because of the wake affine which migrates it back on CPU0 (issue solved by patch 09).
OK, so there's two problems here, right? 1) we don't migrate away from cpu0 2) if we do, we get pulled back.
And patch 9 solves 2, so maybe enhance its changelog to mention this slightly more explicit.
Which leaves us with 1.. interesting problem. I'm just not sure endlessly kicking a low capacity cpu is the right fix for that.
On 3 June 2014 13:15, Peter Zijlstra peterz@infradead.org wrote:
On Mon, Jun 02, 2014 at 07:06:44PM +0200, Vincent Guittot wrote:
Could you detail those conditions? FWIW those make excellent Changelog material.
I have looked back into my tests and traces:
In a 1st test, the capacity of the CPU was still above half default value (power=538) unlike what i remembered. So it's some what "normal" to keep the task on CPU0 which also handles IRQ because sg_capacity still returns 1.
OK, so I suspect that once we move to utilization based capacity stuff we'll do the migration IF the task indeed requires more cpu than can be provided by the reduced, one, right?
The current version of the patchset only checks if the capacity of a CPU has significantly reduced that we should look for another CPU. But we effectively could also add compare the remaining capacity with the task load
In a 2nd test,the main task runs (most of the time) on CPU0 whereas the max power of the latter is only 623 and the cpu_power goes below 512 (power=330) during the use case. So the sg_capacity of CPU0 is null but the main task still stays on CPU0. The use case (scp transfer) is made of a long running task (ssh) and a periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on CPU1. The newly idle load balance on CPU1 doesn't pull the long running task although sg_capacity is null because of sd->nr_balance_failed is never incremented and load_balance doesn't trig an active load_balance. When an idle balance occurs in the middle of the newly idle balance, the ssh long task migrates on CPU1 but as soon as it sleeps and wakes up, it goes back on CPU0 because of the wake affine which migrates it back on CPU0 (issue solved by patch 09).
OK, so there's two problems here, right?
- we don't migrate away from cpu0
- if we do, we get pulled back.
And patch 9 solves 2, so maybe enhance its changelog to mention this slightly more explicit.
Which leaves us with 1.. interesting problem. I'm just not sure endlessly kicking a low capacity cpu is the right fix for that.
What prevent us to migrate the task directly is the fact that nr_balance_failed is not incremented for newly idle and it's the only condition for active migration (except asym feature)
We could add a additional test in need_active_balance for newly_idle load balance. Something like:
if ((sd->flags & SD_SHARE_PKG_RESOURCES) && (senv->rc_rq->cpu_power_orig * 100) > (env->src_rq->group_power * env->sd->imbalance_pct)) return 1;
On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
@@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq) if (nr_busy > 1) goto need_kick_unlock;
if ((rq->cfs.h_nr_running >= 1)
&& ((rq->cpu_power * sd->imbalance_pct) <
(rq->cpu_power_orig * 100)))
goto need_kick_unlock;
- }
sd = rcu_dereference(per_cpu(sd_asym, cpu));
So what happens when a cpu is consistently low on power (say due to a pinned RT task) the balancer would quickly adjust the load level, but this would endlessly kick things into action, even though we're balanced just fine.
On 29 May 2014 16:04, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
@@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
if (nr_busy > 1) goto need_kick_unlock;
if ((rq->cfs.h_nr_running >= 1)
&& ((rq->cpu_power * sd->imbalance_pct) <
(rq->cpu_power_orig * 100)))
goto need_kick_unlock;
} sd = rcu_dereference(per_cpu(sd_asym, cpu));
So what happens when a cpu is consistently low on power (say due to a pinned RT task) the balancer would quickly adjust the load level, but this would endlessly kick things into action, even though we're balanced just fine.
If there is more than 1 running task or more than 1 busy CPU, we will kick the ilb because of the former conditions. Then, if there is only 1 task and no other busy cpu, we should trig the ILB. Nevertheless, I can add a test to check that there is an idle cpu in the sched_domain
On 23/05/14 16:53, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
- /*
* The group capacity is reduced probably because of activity from other
Here 'group capacity' refers to sgs->group_power and not to sgs->group_capacity, right?
* sched class or interrupts which use part of the available capacity
*/
... 'interrupts' only w/ CONFIG_IRQ_TIME_ACCOUNTING=y, right ?
- if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
return true;
[...]
On 30 May 2014 15:26, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/05/14 16:53, Vincent Guittot wrote:
If the CPU is used for handling lot of IRQs, trig a load balance to check if it's worth moving its tasks on another CPU that has more capacity
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8a30f9..2501e49 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->sum_nr_running > sgs->group_capacity) return true;
/*
* The group capacity is reduced probably because of activity from other
Here 'group capacity' refers to sgs->group_power and not to sgs->group_capacity, right?
yes, you're right it's cpu_power, i will correct the comment
* sched class or interrupts which use part of the available capacity
*/
... 'interrupts' only w/ CONFIG_IRQ_TIME_ACCOUNTING=y, right ?
yes, we need CONFIG_IRQ_TIME_ACCOUNTING in order to scale the cpu_power with time spent under irq. I have made test with and without this config to show impact (in the cover letter)
Thanks Vincent
if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
return true;
[...]
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, 30 May 2014, Vincent Guittot wrote:
On 30 May 2014 15:26, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
/*
* The group capacity is reduced probably because of activity from other
Here 'group capacity' refers to sgs->group_power and not to sgs->group_capacity, right?
yes, you're right it's cpu_power, i will correct the comment
Note that peterz applied my rename patches so sgs->group_power has become sgs->group_capacity.
Nicolas
On 30 May 2014 21:45, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Fri, 30 May 2014, Vincent Guittot wrote:
On 30 May 2014 15:26, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
/*
* The group capacity is reduced probably because of activity from other
Here 'group capacity' refers to sgs->group_power and not to sgs->group_capacity, right?
yes, you're right it's cpu_power, i will correct the comment
Note that peterz applied my rename patches so sgs->group_power has become sgs->group_capacity.
I will update the naming in the next version when i will rebase
Vincent
Nicolas
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_POWER_SCALE. We can now have a better idea of the utilization of a group fo CPUs thanks to group_actitvity and deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 94 ++++++++++++++--------------------------------------- 1 file changed, 24 insertions(+), 70 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2501e49..05b9502 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5538,11 +5538,10 @@ struct sg_lb_stats { unsigned long group_power; unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ - unsigned int group_capacity; + long group_capacity; unsigned int idle_cpus; unsigned int group_weight; int group_imb; /* Is there an imbalance in the group ? */ - int group_has_capacity; /* Is there extra capacity in the group? */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; @@ -5785,31 +5784,6 @@ void update_group_power(struct sched_domain *sd, int cpu) }
/* - * Try and fix up capacity for tiny siblings, this is needed when - * things like SD_ASYM_PACKING need f_b_g to select another sibling - * which on its own isn't powerful enough. - * - * See update_sd_pick_busiest() and check_asym_packing(). - */ -static inline int -fix_small_capacity(struct sched_domain *sd, struct sched_group *group) -{ - /* - * Only siblings can have significantly less than SCHED_POWER_SCALE - */ - if (!(sd->flags & SD_SHARE_CPUPOWER)) - return 0; - - /* - * If ~90% of the cpu_power is still there, we're good. - */ - if (group->sgp->power * 32 > group->sgp->power_orig * 29) - return 1; - - return 0; -} - -/* * Group imbalance indicates (and tries to solve) the problem where balancing * groups is inadequate due to tsk_cpus_allowed() constraints. * @@ -5843,33 +5817,6 @@ static inline int sg_imbalanced(struct sched_group *group) return group->sgp->imbalance; }
-/* - * Compute the group capacity. - * - * Avoid the issue where N*frac(smt_power) >= 1 creates 'phantom' cores by - * first dividing out the smt factor and computing the actual number of cores - * and limit power unit capacity with that. - */ -static inline int sg_capacity(struct lb_env *env, struct sched_group *group) -{ - unsigned int capacity, smt, cpus; - unsigned int power, power_orig; - - power = group->sgp->power; - power_orig = group->sgp->power_orig; - cpus = group->group_weight; - - /* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */ - smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig); - capacity = cpus / smt; /* cores */ - - capacity = min_t(unsigned, capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE)); - if (!capacity) - capacity = fix_small_capacity(env->sd, group); - - return capacity; -} - /** * update_sg_lb_stats - Update sched_group's statistics for load balancing. * @env: The load balancing environment. @@ -5918,10 +5865,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_weight = group->group_weight;
sgs->group_imb = sg_imbalanced(group); - sgs->group_capacity = sg_capacity(env, group); + sgs->group_capacity = group->sgp->power_orig - sgs->group_activity; +
- if (sgs->group_capacity > sgs->sum_nr_running) - sgs->group_has_capacity = 1; }
/** @@ -5945,7 +5891,15 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->avg_load <= sds->busiest_stat.avg_load) return false;
- if (sgs->sum_nr_running > sgs->group_capacity) + /* The group has an obvious long run overload */ + if (sgs->group_capacity < 0) + return true; + + /* + * The group has a short run overload because more tasks than available + * CPUs are running + */ + if (sgs->sum_nr_running > sgs->group_weight) return true;
/* @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local && - sds->local_stat.group_has_capacity) - sgs->group_capacity = min(sgs->group_capacity, 1U); + sds->local_stat.group_capacity > 0) + sgs->group_capacity = min(sgs->group_capacity, 1L);
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * have to drop below capacity to reach cpu-load equilibrium. */ load_above_capacity = - (busiest->sum_nr_running - busiest->group_capacity); + (busiest->sum_nr_running - busiest->group_weight);
load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); load_above_capacity /= busiest->group_power; @@ -6294,6 +6248,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat;
+ /* ASYM feature bypasses nice load balance check */ if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && check_asym_packing(env, &sds)) return sds.busiest; @@ -6313,8 +6268,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && local->group_has_capacity && - !busiest->group_has_capacity) + if (env->idle == CPU_NEWLY_IDLE && (local->group_capacity > 0) + && (busiest->group_capacity < 0)) goto force_balance;
/* @@ -6372,7 +6327,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { - unsigned long power, capacity, wl; + unsigned long power, wl; enum fbq_type rt;
rq = cpu_rq(i); @@ -6400,18 +6355,17 @@ static struct rq *find_busiest_queue(struct lb_env *env, if (rt > env->fbq_type) continue;
- power = power_of(i); - capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE); - if (!capacity) - capacity = fix_small_capacity(env->sd, group); - wl = weighted_cpuload(i);
/* * When comparing with imbalance, use weighted_cpuload() * which is not scaled with the cpu power. */ - if (capacity && rq->nr_running == 1 && wl > env->imbalance) + + power = power_of(i); + + if (rq->nr_running == 1 && wl > env->imbalance && + ((power * env->sd->imbalance_pct) >= (rq->cpu_power_orig * 100))) continue;
/*
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_POWER_SCALE. We can now have a better idea of the utilization of a group fo CPUs thanks to group_actitvity and deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Right, so as Preeti already mentioned, this wrecks SMT. It also seems to loose the aggressive spread, where we want to run 1 task on each 'core' before we start 'balancing'.
So I think we should be able to fix this by setting PREFER_SIBLING on the SMT domain, that way we'll get single tasks running on each SMT domain before filling them up until capacity.
Now, its been a while since I looked at PREFER_SIBLING, and I've not yet looked at what your patch does to it, but it seems to me that that is the first direction we should look for an answer to this.
On 29 May 2014 15:55, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_POWER_SCALE. We can now have a better idea of the utilization of a group fo CPUs thanks to group_actitvity and deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Right, so as Preeti already mentioned, this wrecks SMT. It also seems to loose the aggressive spread, where we want to run 1 task on each 'core' before we start 'balancing'.
So I think we should be able to fix this by setting PREFER_SIBLING on the SMT domain, that way we'll get single tasks running on each SMT domain before filling them up until capacity.
Now, its been a while since I looked at PREFER_SIBLING, and I've not yet looked at what your patch does to it, but it seems to me that that is the first direction we should look for an answer to this.
OK, i'm going to look more deeply in PREFER_SIBLING too
On 05/29/2014 07:25 PM, Peter Zijlstra wrote:
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_POWER_SCALE. We can now have a better idea of the utilization of a group fo CPUs thanks to group_actitvity and deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Right, so as Preeti already mentioned, this wrecks SMT. It also seems to loose the aggressive spread, where we want to run 1 task on each 'core' before we start 'balancing'.
True. I just profiled the ebizzy runs and found that ebizzy threads were being packed onto a single core which is SMT-8 capable before spreading. This was a 6 core, SMT-8 machine. So for instance if I run 8 threads of ebizzy. the load balancing as record by perf sched record showed that two cores were packed upto 3 ebizzy threads and one core ran two ebizzy threads while the rest of the 3 cores were idle.
I am unable to understand which part of this patch is aiding packing to a core. There is this check in this patch right?
if (sgs->group_capacity < 0) return true;
which should ideally prevent such packing? Because irrespective of the number of SMT threads, the capacity of a core is unchanged. And in the above scenario, we have 6 tasks on 3 cores. So shouldn't the above check have caught it?
Regards Preeti U Murthy
So I think we should be able to fix this by setting PREFER_SIBLING on the SMT domain, that way we'll get single tasks running on each SMT domain before filling them up until capacity.
Now, its been a while since I looked at PREFER_SIBLING, and I've not yet looked at what your patch does to it, but it seems to me that that is the first direction we should look for an answer to this.
On 2 June 2014 08:21, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 05/29/2014 07:25 PM, Peter Zijlstra wrote:
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
The scheduler tries to compute how many tasks a group of CPUs can handle by assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is SCHED_POWER_SCALE. We can now have a better idea of the utilization of a group fo CPUs thanks to group_actitvity and deduct how many capacity is still available.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Right, so as Preeti already mentioned, this wrecks SMT. It also seems to loose the aggressive spread, where we want to run 1 task on each 'core' before we start 'balancing'.
True. I just profiled the ebizzy runs and found that ebizzy threads were being packed onto a single core which is SMT-8 capable before spreading. This was a 6 core, SMT-8 machine. So for instance if I run 8 threads of ebizzy. the load balancing as record by perf sched record showed that two cores were packed upto 3 ebizzy threads and one core ran two ebizzy threads while the rest of the 3 cores were idle.
I am unable to understand which part of this patch is aiding packing to a core. There is this check in this patch right?
if (sgs->group_capacity < 0) return true;
which should ideally prevent such packing? Because irrespective of the number of SMT threads, the capacity of a core is unchanged. And in the above scenario, we have 6 tasks on 3 cores. So shouldn't the above check have caught it?
yes, it should. the group_capacity should become < 0 because the CPU are fully loaded and the activity reach the max capacity value + nr_running
Regards Preeti U Murthy
So I think we should be able to fix this by setting PREFER_SIBLING on the SMT domain, that way we'll get single tasks running on each SMT domain before filling them up until capacity.
Now, its been a while since I looked at PREFER_SIBLING, and I've not yet looked at what your patch does to it, but it seems to me that that is the first direction we should look for an answer to this.
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
@@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local &&
sds->local_stat.group_has_capacity)
sgs->group_capacity = min(sgs->group_capacity, 1U);
sds->local_stat.group_capacity > 0)
sgs->group_capacity = min(sgs->group_capacity, 1L);
if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * have to drop below capacity to reach cpu-load equilibrium. */ load_above_capacity =
(busiest->sum_nr_running - busiest->group_capacity);
(busiest->sum_nr_running - busiest->group_weight);
load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); load_above_capacity /= busiest->group_power;
I think you just broke PREFER_SIBLING here..
So we want PREFER_SIBLING to work on nr_running, not utilization because we want to spread single tasks around, regardless of their utilization.
On 29 May 2014 16:02, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
@@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local &&
sds->local_stat.group_has_capacity)
sgs->group_capacity = min(sgs->group_capacity, 1U);
sds->local_stat.group_capacity > 0)
sgs->group_capacity = min(sgs->group_capacity, 1L); if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg;
@@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * have to drop below capacity to reach cpu-load equilibrium. */ load_above_capacity =
(busiest->sum_nr_running - busiest->group_capacity);
(busiest->sum_nr_running - busiest->group_weight); load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); load_above_capacity /= busiest->group_power;
I think you just broke PREFER_SIBLING here..
you mean by replacing the capacity which was reflecting the number of core for SMT by the group_weight ?
So we want PREFER_SIBLING to work on nr_running, not utilization because we want to spread single tasks around, regardless of their utilization.
On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote:
On 29 May 2014 16:02, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
@@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local &&
sds->local_stat.group_has_capacity)
sgs->group_capacity = min(sgs->group_capacity, 1U);
sds->local_stat.group_capacity > 0)
sgs->group_capacity = min(sgs->group_capacity, 1L); if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg;
@@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * have to drop below capacity to reach cpu-load equilibrium. */ load_above_capacity =
(busiest->sum_nr_running - busiest->group_capacity);
(busiest->sum_nr_running - busiest->group_weight); load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); load_above_capacity /= busiest->group_power;
I think you just broke PREFER_SIBLING here..
you mean by replacing the capacity which was reflecting the number of core for SMT by the group_weight ?
Right to in the first hunk we lower group_capacity to 1 when prefer_sibling, then in the second hunk, you replace that group_capacity usage with group_weight.
With the end result that prefer_sibling is now ineffective.
That said, I fudged the prefer_sibling usage into the capacity logic, mostly because I could and it was already how the SMT stuff was working. But there is no reason we should continue to intertwine these two things.
So I think it would be good to have a patch that implements prefer_sibling on nr_running separate from the existing capacity bits, and then convert the remaining capacity bits to utilization (or activity or whatever you did call it, see Morton's comments etc.).
On 30 May 2014 08:34, Peter Zijlstra peterz@infradead.org wrote:
On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote:
On 29 May 2014 16:02, Peter Zijlstra peterz@infradead.org wrote:
On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
@@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd * with a large weight task outweighs the tasks on the system). */ if (prefer_sibling && sds->local &&
sds->local_stat.group_has_capacity)
sgs->group_capacity = min(sgs->group_capacity, 1U);
sds->local_stat.group_capacity > 0)
sgs->group_capacity = min(sgs->group_capacity, 1L); if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg;
@@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * have to drop below capacity to reach cpu-load equilibrium. */ load_above_capacity =
(busiest->sum_nr_running - busiest->group_capacity);
(busiest->sum_nr_running - busiest->group_weight); load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); load_above_capacity /= busiest->group_power;
I think you just broke PREFER_SIBLING here..
you mean by replacing the capacity which was reflecting the number of core for SMT by the group_weight ?
Right to in the first hunk we lower group_capacity to 1 when prefer_sibling, then in the second hunk, you replace that group_capacity usage with group_weight.
With the end result that prefer_sibling is now ineffective.
ok
That said, I fudged the prefer_sibling usage into the capacity logic, mostly because I could and it was already how the SMT stuff was working. But there is no reason we should continue to intertwine these two things.
So I think it would be good to have a patch that implements prefer_sibling on nr_running separate from the existing capacity bits, and then convert the remaining capacity bits to utilization (or activity or whatever you did call it, see Morton's comments etc.).
ok, i'm going to prepare such change
Hi Vincent,
I conducted test runs of ebizzy on a Power8 box which had 48 cpus. 6 cores with SMT-8 to be precise. Its a single socket box. The results are as below.
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power has been created and replace arch_scale_smt_power, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_POWER_SCALE.
Now that we have the original capacity of a CPUS and its activity/utilization, we can evaluate more accuratly the capacity of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we can certainly take advantage of this new statistic in several other places of the load balance.
TODO:
- align variable's and field's name with the renaming [3]
Tests results: I have put below results of 2 tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
on a dual cortex-A7 hackbench scp tip/master 25.75s(+/-0.25) 5.16MB/s(+/-1.49)
- patches 1,2 25.89s(+/-0.31) 5.18MB/s(+/-1.45)
- patches 3-10 25.68s(+/-0.22) 7.00MB/s(+/-1.88)
- irq accounting 25.80s(+/-0.25) 8.06MB/s(+/-0.05)
on a quad cortex-A15 hackbench scp tip/master 15.69s(+/-0.16) 9.70MB/s(+/-0.04)
- patches 1,2 15.53s(+/-0.13) 9.72MB/s(+/-0.05)
- patches 3-10 15.56s(+/-0.22) 9.88MB/s(+/-0.05)
- irq accounting 15.99s(+/-0.08) 10.37MB/s(+/-0.03)
The improvement of scp bandwidth happens when tasks and irq are using different CPU which is a bit random without irq accounting config
N -> Number of threads of ebizzy
Each 'N' run was for 30 seconds with multiple iterations and averaging them.
N %change in number of records read after patching ------------------------------------------ 1 + 0.0038 4 -17.6429 8 -26.3989 12 -29.5070 16 -38.4842 20 -44.5747 24 -51.9792 28 -34.1863 32 -38.4029 38 -22.2490 42 -7.4843 47 -0.69676
Let me profile it and check where the cause of this degradation is.
Regards Preeti U Murthy
On 26 May 2014 11:44, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
I conducted test runs of ebizzy on a Power8 box which had 48 cpus. 6 cores with SMT-8 to be precise. Its a single socket box. The results are as below.
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power has been created and replace arch_scale_smt_power, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_POWER_SCALE.
Now that we have the original capacity of a CPUS and its activity/utilization, we can evaluate more accuratly the capacity of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we can certainly take advantage of this new statistic in several other places of the load balance.
TODO:
- align variable's and field's name with the renaming [3]
Tests results: I have put below results of 2 tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
on a dual cortex-A7 hackbench scp tip/master 25.75s(+/-0.25) 5.16MB/s(+/-1.49)
- patches 1,2 25.89s(+/-0.31) 5.18MB/s(+/-1.45)
- patches 3-10 25.68s(+/-0.22) 7.00MB/s(+/-1.88)
- irq accounting 25.80s(+/-0.25) 8.06MB/s(+/-0.05)
on a quad cortex-A15 hackbench scp tip/master 15.69s(+/-0.16) 9.70MB/s(+/-0.04)
- patches 1,2 15.53s(+/-0.13) 9.72MB/s(+/-0.05)
- patches 3-10 15.56s(+/-0.22) 9.88MB/s(+/-0.05)
- irq accounting 15.99s(+/-0.08) 10.37MB/s(+/-0.03)
The improvement of scp bandwidth happens when tasks and irq are using different CPU which is a bit random without irq accounting config
N -> Number of threads of ebizzy
Each 'N' run was for 30 seconds with multiple iterations and averaging them.
N %change in number of records read after patching
1 + 0.0038 4 -17.6429 8 -26.3989 12 -29.5070 16 -38.4842 20 -44.5747 24 -51.9792 28 -34.1863 32 -38.4029 38 -22.2490 42 -7.4843 47 -0.69676
Let me profile it and check where the cause of this degradation is.
Hi Preeti,
Thanks for the test and the help to find the root cause of the degration. I'm going to run the test on my platforms too and see if i have similar results with my platforms
Regards Vincent
Regards Preeti U Murthy
Hi Preeti,
I have done ebizzy tests on my platforms but doesn't have similar results than you (my results below). It seems to be linked to SMT. I'm going to look at that part more deeply and try to find a more suitable HW for tests.
ebizzy -t N -S 20 Quad cores N tip +patchset 1 100.00% (+/- 0.30%) 97.00% (+/- 0.42%) 2 100.00% (+/- 0.80%) 100.48% (+/- 0.88%) 4 100.00% (+/- 1.18%) 99.32% (+/- 1.05%) 6 100.00% (+/- 8.54%) 98.84% (+/- 1.39%) 8 100.00% (+/- 0.45%) 98.89% (+/- 0.91%) 10 100.00% (+/- 0.32%) 99.25% (+/- 0.31%) 12 100.00% (+/- 0.15%) 99.20% (+/- 0.86%) 14 100.00% (+/- 0.58%) 99.44% (+/- 0.55%)
Dual cores N tip +patchset 1 100.00% (+/- 1.70%) 99.35% (+/- 2.82%) 2 100.00% (+/- 2.75%) 100.48% (+/- 1.51%) 4 100.00% (+/- 2.37%) 102.63% (+/- 2.35%) 6 100.00% (+/- 3.11%) 97.65% (+/- 1.02%) 8 100.00% (+/- 0.26%) 103.68% (+/- 5.90%) 10 100.00% (+/- 0.30%) 106.71% (+/- 10.85%) 12 100.00% (+/- 1.18%) 98.95% (+/- 0.75%) 14 100.00% (+/- 1.82%) 102.89% (+/- 2.32%)
Regards, Vincent
On 26 May 2014 12:04, Vincent Guittot vincent.guittot@linaro.org wrote:
On 26 May 2014 11:44, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
I conducted test runs of ebizzy on a Power8 box which had 48 cpus. 6 cores with SMT-8 to be precise. Its a single socket box. The results are as below.
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power has been created and replace arch_scale_smt_power, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_POWER_SCALE.
Now that we have the original capacity of a CPUS and its activity/utilization, we can evaluate more accuratly the capacity of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we can certainly take advantage of this new statistic in several other places of the load balance.
TODO:
- align variable's and field's name with the renaming [3]
Tests results: I have put below results of 2 tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
on a dual cortex-A7 hackbench scp tip/master 25.75s(+/-0.25) 5.16MB/s(+/-1.49)
- patches 1,2 25.89s(+/-0.31) 5.18MB/s(+/-1.45)
- patches 3-10 25.68s(+/-0.22) 7.00MB/s(+/-1.88)
- irq accounting 25.80s(+/-0.25) 8.06MB/s(+/-0.05)
on a quad cortex-A15 hackbench scp tip/master 15.69s(+/-0.16) 9.70MB/s(+/-0.04)
- patches 1,2 15.53s(+/-0.13) 9.72MB/s(+/-0.05)
- patches 3-10 15.56s(+/-0.22) 9.88MB/s(+/-0.05)
- irq accounting 15.99s(+/-0.08) 10.37MB/s(+/-0.03)
The improvement of scp bandwidth happens when tasks and irq are using different CPU which is a bit random without irq accounting config
N -> Number of threads of ebizzy
Each 'N' run was for 30 seconds with multiple iterations and averaging them.
N %change in number of records read after patching
1 + 0.0038 4 -17.6429 8 -26.3989 12 -29.5070 16 -38.4842 20 -44.5747 24 -51.9792 28 -34.1863 32 -38.4029 38 -22.2490 42 -7.4843 47 -0.69676
Let me profile it and check where the cause of this degradation is.
Hi Preeti,
Thanks for the test and the help to find the root cause of the degration. I'm going to run the test on my platforms too and see if i have similar results with my platforms
Regards Vincent
Regards Preeti U Murthy
On 05/26/2014 09:24 PM, Vincent Guittot wrote:
Hi Preeti,
I have done ebizzy tests on my platforms but doesn't have similar results than you (my results below). It seems to be linked to SMT. I'm going to look at that part more deeply and try to find a more suitable HW for tests.
You are right Vincent. I tested this in smt-off mode and the regression was not seen. But the regression was of the order 27% with higher number of threads in smt-on mode. What is interesting is that the regression increases in the range N=1 to N=24 and then it dips to 0 at N=48 on a 6 core, SMT 8 machine. Let me dig this further.
Let me dig further.
Regards Preeti U Murthy
ebizzy -t N -S 20 Quad cores N tip +patchset 1 100.00% (+/- 0.30%) 97.00% (+/- 0.42%) 2 100.00% (+/- 0.80%) 100.48% (+/- 0.88%) 4 100.00% (+/- 1.18%) 99.32% (+/- 1.05%) 6 100.00% (+/- 8.54%) 98.84% (+/- 1.39%) 8 100.00% (+/- 0.45%) 98.89% (+/- 0.91%) 10 100.00% (+/- 0.32%) 99.25% (+/- 0.31%) 12 100.00% (+/- 0.15%) 99.20% (+/- 0.86%) 14 100.00% (+/- 0.58%) 99.44% (+/- 0.55%)
Dual cores N tip +patchset 1 100.00% (+/- 1.70%) 99.35% (+/- 2.82%) 2 100.00% (+/- 2.75%) 100.48% (+/- 1.51%) 4 100.00% (+/- 2.37%) 102.63% (+/- 2.35%) 6 100.00% (+/- 3.11%) 97.65% (+/- 1.02%) 8 100.00% (+/- 0.26%) 103.68% (+/- 5.90%) 10 100.00% (+/- 0.30%) 106.71% (+/- 10.85%) 12 100.00% (+/- 1.18%) 98.95% (+/- 0.75%) 14 100.00% (+/- 1.82%) 102.89% (+/- 2.32%)
Regards, Vincent
On 26 May 2014 12:04, Vincent Guittot vincent.guittot@linaro.org wrote:
On 26 May 2014 11:44, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
Hi Vincent,
I conducted test runs of ebizzy on a Power8 box which had 48 cpus. 6 cores with SMT-8 to be precise. Its a single socket box. The results are as below.
On 05/23/2014 09:22 PM, Vincent Guittot wrote:
Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm
SMT system is no more the only system that can have a CPUs with an original capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power has been created and replace arch_scale_smt_power, which is SMT specifc in the computation of the capapcity of a CPU.
During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method assumes that tasks have a fix load of SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE. This assumption generates wrong decision by creating ghost cores and by removing real ones when the original capacity of CPUs is different from the default SCHED_POWER_SCALE.
Now that we have the original capacity of a CPUS and its activity/utilization, we can evaluate more accuratly the capacity of a group of CPUs.
This patchset mainly replaces the old capacity method by a new one and has kept the policy almost unchanged whereas we can certainly take advantage of this new statistic in several other places of the load balance.
TODO:
- align variable's and field's name with the renaming [3]
Tests results: I have put below results of 2 tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform
on a dual cortex-A7 hackbench scp tip/master 25.75s(+/-0.25) 5.16MB/s(+/-1.49)
- patches 1,2 25.89s(+/-0.31) 5.18MB/s(+/-1.45)
- patches 3-10 25.68s(+/-0.22) 7.00MB/s(+/-1.88)
- irq accounting 25.80s(+/-0.25) 8.06MB/s(+/-0.05)
on a quad cortex-A15 hackbench scp tip/master 15.69s(+/-0.16) 9.70MB/s(+/-0.04)
- patches 1,2 15.53s(+/-0.13) 9.72MB/s(+/-0.05)
- patches 3-10 15.56s(+/-0.22) 9.88MB/s(+/-0.05)
- irq accounting 15.99s(+/-0.08) 10.37MB/s(+/-0.03)
The improvement of scp bandwidth happens when tasks and irq are using different CPU which is a bit random without irq accounting config
N -> Number of threads of ebizzy
Each 'N' run was for 30 seconds with multiple iterations and averaging them.
N %change in number of records read after patching
1 + 0.0038 4 -17.6429 8 -26.3989 12 -29.5070 16 -38.4842 20 -44.5747 24 -51.9792 28 -34.1863 32 -38.4029 38 -22.2490 42 -7.4843 47 -0.69676
Let me profile it and check where the cause of this degradation is.
Hi Preeti,
Thanks for the test and the help to find the root cause of the degration. I'm going to run the test on my platforms too and see if i have similar results with my platforms
Regards Vincent
Regards Preeti U Murthy
linaro-kernel@lists.linaro.org