[PATCH RFC 0/8] EASv5.2: Optimize Performance

List overview All Threads
Download

newer

older

[PATCH RFCv2 0/8] EASv5.2:...

SchedTune: define payoff...

Leo Yan

23 Jun 2016 23 Jun '16

1:43 p.m.

This patch series is to optimize performance.

Patch 0001 is to optimize CPU selection flow so let task has more chance to stay on previous CPU. Patch 0002 actually is a big change for EAS's policy for CPU selection, it trys to select idle CPU as possible. From profiling result, 0002 have good effect that spread tasks out if there have many tasks are running at the meantime.

Patches 0003~0004 are to optimize the scenario for single thread case. In this case, the thread has relative high utilization value, but the value cannot easily over tipping point. So patche 0004 try to set criteria to in some condition change to use load_avg rather than util_avg to boost the single thread.

Patch 0005 is to optimize the flow for spreading tasks within big cluster.

Patches 0006~0007 is to fix the signal for avg_load.

Leo Yan (8): sched/fair: optimize to more chance to select previous CPU sched/fair: select idle CPU for waken up task sched/fair: add path to migrate to higher capacity CPU sched/fair: use load to replace util when have big difference sched/fair: spread tasks in cluster when over tipping point sched/fair: correct avg_load as CPU average load sched/fair: fix to calculate average load cross cluster sched/fair: set imbn to 1 for too many tasks on rq

include/linux/sched.h | 1 + kernel/sched/fair.c | 93 +++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 84 insertions(+), 10 deletions(-)

-- 1.9.1

Show replies by date

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 1/8] sched/fair: optimize to more chance to select previous CPU

In previous EAS waken up path, it will select any possible CPU which have higher capacity which can meet the task requirement. This patch will prefer to fall back to previous CPU as possible. So this can avoid unnecessary task migration between the cluster.

Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e5ffe8..c4bc7ae 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5569,7 +5569,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX; - int target_cpu = task_cpu(p); + int target_cpu = -1; int i;

sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p))); @@ -5612,7 +5612,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) */ int new_util = cpu_util(i) + boosted_task_util(p);

- if (new_util > capacity_orig_of(i)) + if (new_util > capacity_orig_of(i) && target_cpu != -1) continue;

if (new_util < capacity_curr_of(i)) { @@ -5621,11 +5621,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) break; }

- /* cpu has capacity at higher OPP, keep it as fallback */ - if (target_cpu == task_cpu(p)) + /* + * cpu has capacity at higher OPP, keep it as fallback; + * give the previous cpu more chance to run + */ + if (task_cpu(p) == i || target_cpu == -1) target_cpu = i; }

+ /* If have not select any CPU, then to use previous CPU */ + if (target_cpu == -1) + target_cpu = task_cpu(p); + if (target_cpu != task_cpu(p)) { struct energy_env eenv = { .util_delta = task_util(p), -- 1.9.1

Morten Rasmussen

1 Jul 1 Jul

4:10 p.m.

New subject: [PATCH RFC 1/8] sched/fair: optimize to more chance to select previous CPU

On Thu, Jun 23, 2016 at 09:43:03PM +0800, Leo Yan wrote:

...

In previous EAS waken up path, it will select any possible CPU which have higher capacity which can meet the task requirement. This patch will prefer to fall back to previous CPU as possible. So this can avoid unnecessary task migration between the cluster.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e5ffe8..c4bc7ae 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5569,7 +5569,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX;

int target_cpu = task_cpu(p);

int target_cpu = -1; int i;

sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));

@@ -5612,7 +5612,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) */ int new_util = cpu_util(i) + boosted_task_util(p);
if (new_util > capacity_orig_of(i))
if (new_util > capacity_orig_of(i) && target_cpu != -1)
continue;
if (new_util < capacity_curr_of(i)) {
@@ -5621,11 +5621,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) break; }
/* cpu has capacity at higher OPP, keep it as fallback */
if (target_cpu == task_cpu(p))
/*
 * cpu has capacity at higher OPP, keep it as fallback;
 * give the previous cpu more chance to run
 */
if (task_cpu(p) == i || target_cpu == -1)
target_cpu = i;
}

Don't you end up potentially using an over-utilized cpu as target_cpu?

The new_util > capacity_orig_of(i) condition used to reject all cpus that are over-utilized. With additional target_cpu != -1 this is no longer the case. In the first iteration target_cpu is still -1, if the cpu is over-utilized the first criteria is now false, the second (new_util < capacity_curr_of(i)) is false as well, and the last one (above) is true, so we end up setting target_cpu = i, despite the cpu having no spare cycles.

Did I miss something?

...

/* If have not select any CPU, then to use previous CPU */

if (target_cpu == -1)
      target_cpu = task_cpu(p);

You could just return task_cpu(p) here.

Leo Yan

4 Jul 4 Jul

9:36 a.m.

New subject: [PATCH RFC 1/8] sched/fair: optimize to more chance to select previous CPU

On Fri, Jul 01, 2016 at 05:10:00PM +0100, Morten Rasmussen wrote:

...

On Thu, Jun 23, 2016 at 09:43:03PM +0800, Leo Yan wrote:

...
In previous EAS waken up path, it will select any possible CPU which have higher capacity which can meet the task requirement. This patch will prefer to fall back to previous CPU as possible. So this can avoid unnecessary task migration between the cluster.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e5ffe8..c4bc7ae 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5569,7 +5569,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX;

int target_cpu = task_cpu(p);

int target_cpu = -1; int i;

sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));

@@ -5612,7 +5612,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) */ int new_util = cpu_util(i) + boosted_task_util(p);
if (new_util > capacity_orig_of(i))
if (new_util > capacity_orig_of(i) && target_cpu != -1)
continue;
if (new_util < capacity_curr_of(i)) {
@@ -5621,11 +5621,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) break; }
/* cpu has capacity at higher OPP, keep it as fallback */
if (target_cpu == task_cpu(p))
/*
 * cpu has capacity at higher OPP, keep it as fallback;
 * give the previous cpu more chance to run
 */
if (task_cpu(p) == i || target_cpu == -1)
target_cpu = i;
}
Don't you end up potentially using an over-utilized cpu as target_cpu?

The new_util > capacity_orig_of(i) condition used to reject all cpus that are over-utilized. With additional target_cpu != -1 this is no longer the case. In the first iteration target_cpu is still -1, if the cpu is over-utilized the first criteria is now false, the second (new_util < capacity_curr_of(i)) is false as well, and the last one (above) is true, so we end up setting target_cpu = i, despite the cpu having no spare cycles.

Did I miss something?

Yes. It's possible to select an over-utilized cpu. My original purpose is to force fallback to one target cpu if target_cpu has _NOT_ been initialized. Will revert back to:

if (new_util > capacity_orig_of(i))

...

...
/* If have not select any CPU, then to use previous CPU */

if (target_cpu == -1)
      target_cpu = task_cpu(p);
You could just return task_cpu(p) here.

Will fix.

Thanks, Leo Yan

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 2/8] sched/fair: select idle CPU for waken up task

Prefer to select idle CPU to place waken up task. So we can see the task can be more stable to spread to CPUs in the cluster and hope can decrease OPP for power saving.

Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c4bc7ae..7bd1c5b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5617,7 +5617,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target)

if (new_util < capacity_curr_of(i)) { target_cpu = i; - if (cpu_rq(i)->nr_running) + if (!cpu_rq(i)->nr_running) break; }

-- 1.9.1

Patrick Bellasi

28 Jun 28 Jun

10:46 a.m.

New subject: [PATCH RFC 2/8] sched/fair: select idle CPU for waken up task

On 23-Jun 21:43, Leo Yan wrote:

...

Prefer to select idle CPU to place waken up task. So we can see the task can be more stable to spread to CPUs in the cluster and hope can decrease OPP for power saving.

In general I think it's worth to have these comments in the code as well, right over the IF condition you are modifying...

...

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c4bc7ae..7bd1c5b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5617,7 +5617,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target)
if (new_util < capacity_curr_of(i)) {
	target_cpu = i;
	if (cpu_rq(i)->nr_running)
	if (!cpu_rq(i)->nr_running)

Maybe we should link this change with a check on boosted value? It could make sense to use a spread strategy only for boosted tasks, where waking up a task on an idle CPU can have a positive effect on its time to completion.

...

		break;
}

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

Morten Rasmussen

1 Jul 1 Jul

4:44 p.m.

New subject: [PATCH RFC 2/8] sched/fair: select idle CPU for waken up task

On Tue, Jun 28, 2016 at 11:46:14AM +0100, Patrick Bellasi wrote:

...

On 23-Jun 21:43, Leo Yan wrote:

...
Prefer to select idle CPU to place waken up task. So we can see the task can be more stable to spread to CPUs in the cluster and hope can decrease OPP for power saving.

It can potentially harm energy consumption a lot as you opt for more wake-ups instead of trying to use the spare cycles available on cpus already awake. However, if you are after latency improvements, this make a lot of sense.

...

...
@@ -5617,7 +5617,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target)
if (new_util < capacity_curr_of(i)) {
	target_cpu = i;
	if (cpu_rq(i)->nr_running)
	if (!cpu_rq(i)->nr_running)
Maybe we should link this change with a check on boosted value? It could make sense to use a spread strategy only for boosted tasks, where waking up a task on an idle CPU can have a positive effect on its time to completion.

As long as a non-boosted task with lower vruntime doesn't wake up just after and decides to pack on the same cpu preempting the boosted task ;-)

I'm not sure if it would lead to a stable task distribution if some tasks try to stick together while others try to escape.

Patrick Bellasi

4 Jul 4 Jul

8:51 a.m.

New subject: [PATCH RFC 2/8] sched/fair: select idle CPU for waken up task

On 01-Jul 17:44, Morten Rasmussen wrote:

...

On Tue, Jun 28, 2016 at 11:46:14AM +0100, Patrick Bellasi wrote:

...
On 23-Jun 21:43, Leo Yan wrote:

...
Prefer to select idle CPU to place waken up task. So we can see the task can be more stable to spread to CPUs in the cluster and hope can decrease OPP for power saving.

It can potentially harm energy consumption a lot as you opt for more wake-ups instead of trying to use the spare cycles available on cpus already awake. However, if you are after latency improvements, this make a lot of sense.

...
...
@@ -5617,7 +5617,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target)
if (new_util < capacity_curr_of(i)) {
	target_cpu = i;
	if (cpu_rq(i)->nr_running)
	if (!cpu_rq(i)->nr_running)
Maybe we should link this change with a check on boosted value? It could make sense to use a spread strategy only for boosted tasks, where waking up a task on an idle CPU can have a positive effect on its time to completion.
As long as a non-boosted task with lower vruntime doesn't wake up just after and decides to pack on the same cpu preempting the boosted task ;-)

Well, this "a posteriori" analysis is valid for many EAS code paths... we cannot (yet) forecast the future. :-/

...

I'm not sure if it would lead to a stable task distribution if some tasks try to stick together while others try to escape.

That's a good point. If we go for a mixed packing-vs-spreading strategy there should be some metric/mechanism that allows to monitor and control oscillations...

However, since we know that EAS is looking for "local minimums", I would expect more likely scenarios in which: a) boosted tasks wakeup and spread, thus moving the system away from a potential local minimum b) non boosted tasks wakeup and pack, this moving the system toward a new (maybe even better) local minimum

-- #include <best/regards.h>

Patrick Bellasi

Morten Rasmussen

9:42 a.m.

New subject: [PATCH RFC 2/8] sched/fair: select idle CPU for waken up task

On Mon, Jul 04, 2016 at 09:51:15AM +0100, Patrick Bellasi wrote:

...

On 01-Jul 17:44, Morten Rasmussen wrote:

...
On Tue, Jun 28, 2016 at 11:46:14AM +0100, Patrick Bellasi wrote:

...
On 23-Jun 21:43, Leo Yan wrote:

...
Prefer to select idle CPU to place waken up task. So we can see the task can be more stable to spread to CPUs in the cluster and hope can decrease OPP for power saving.

It can potentially harm energy consumption a lot as you opt for more wake-ups instead of trying to use the spare cycles available on cpus already awake. However, if you are after latency improvements, this make a lot of sense.

...
...
@@ -5617,7 +5617,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target)
if (new_util < capacity_curr_of(i)) {
	target_cpu = i;
	if (cpu_rq(i)->nr_running)
	if (!cpu_rq(i)->nr_running)
Maybe we should link this change with a check on boosted value? It could make sense to use a spread strategy only for boosted tasks, where waking up a task on an idle CPU can have a positive effect on its time to completion.
As long as a non-boosted task with lower vruntime doesn't wake up just after and decides to pack on the same cpu preempting the boosted task ;-)
Well, this "a posteriori" analysis is valid for many EAS code paths... we cannot (yet) forecast the future. :-/

...
I'm not sure if it would lead to a stable task distribution if some tasks try to stick together while others try to escape.

That's a good point. If we go for a mixed packing-vs-spreading strategy there should be some metric/mechanism that allows to monitor and control oscillations...

Adding more noise and then add yet more features to try to counter it seems a bit counter-productive to me.

...

However, since we know that EAS is looking for "local minimums", I would expect more likely scenarios in which: a) boosted tasks wakeup and spread, thus moving the system away from a potential local minimum b) non boosted tasks wakeup and pack, this moving the system toward a new (maybe even better) local minimum

Or you keep moving around causing unnecessary task migrations.

The perturbation argument can be used for every single scheduling decision, good or bad :-)

If you really want to add perturbations to the task placement, I would suggest to make it explicit and much better controlled. It shouldn't be a random side effect somewhere. I'm not convinced myself about the idea. Perturbations are not free, and can easily cause more harm than good, so it has to be designed carefully.

I would suggest to aim for a stable scheduling policy by default, and then think about adding perturbations later if necessary.

Morten

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 3/8] sched/fair: add path to migrate to higher capacity CPU

In old code there has only one possible to directly migrate task for lower capacity CPU to higher capacity CPU: the previous CPU's capacity cannot meet the bandwidth requirement and the CPU is not over-utilized. This is reasonable if we only think to task util_avg signal to judge the task performance requirement with comparing with CPU capacity.

But in some scenarios, the task takes much time to stay on runnable state, so its load_avg value is quite higher than util_avg. So for this case can optimize to use load_avg to help migration decision.

If so, we need add path to migrate to higher capacity CPU if destination CPU has higher capacity than lower capacity CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7bd1c5b..185efe1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5633,6 +5633,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) if (target_cpu == -1) target_cpu = task_cpu(p);

+ /* + * Destination CPU has higher capacity than previous CPU, so from + * previous path that means pervious CPU has no enough capacity to meet + * the waken up task. Therefore directly return back and select + * destination CPU. + */ + if (capacity_of(target_cpu) != capacity_of(task_cpu(p))) { + if (capacity_of(target_cpu) == + cpu_rq(target_cpu)->rd->max_cpu_capacity.val) + return target_cpu; + } + if (target_cpu != task_cpu(p)) { struct energy_env eenv = { .util_delta = task_util(p), -- 1.9.1

Morten Rasmussen

4 Jul 4 Jul

8:41 a.m.

New subject: [PATCH RFC 3/8] sched/fair: add path to migrate to higher capacity CPU

On Thu, Jun 23, 2016 at 09:43:05PM +0800, Leo Yan wrote:

...

In old code there has only one possible to directly migrate task for lower capacity CPU to higher capacity CPU: the previous CPU's capacity cannot meet the bandwidth requirement and the CPU is not over-utilized. This is reasonable if we only think to task util_avg signal to judge the task performance requirement with comparing with CPU capacity.

But in some scenarios, the task takes much time to stay on runnable state, so its load_avg value is quite higher than util_avg. So for this case can optimize to use load_avg to help migration decision.

If so, we need add path to migrate to higher capacity CPU if destination CPU has higher capacity than lower capacity CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7bd1c5b..185efe1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5633,6 +5633,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) if (target_cpu == -1) target_cpu = task_cpu(p);
/*
* Destination CPU has higher capacity than previous CPU, so from
* previous path that means pervious CPU has no enough capacity to meet
* the waken up task. Therefore directly return back and select
* destination CPU.
*/
if (capacity_of(target_cpu) != capacity_of(task_cpu(p))) {
if (capacity_of(target_cpu) ==
	cpu_rq(target_cpu)->rd->max_cpu_capacity.val)
return target_cpu;
}

It is not straightforward to infer that if target_cpu has the highest cpu capacity available in the system, then task_cpu(p) (previous cpu) must be overutilized. IIUC, it is only the case because task_fits_max() and cpu_overutilized() both use capacity_margin as the margin and we only consider cpus accepted by task_fits_max() when finding target_cpu.

Anyway, the scenario you seem to cover is the case where task_cpu() is over-utilized. This scenario should already be covered a little further down in energy_aware_wake_cpu():

/* Not enough spare capacity on previous cpu */ if (cpu_overutilized(task_cpu(p))) return target_cpu;

Are there any cases where this isn't sufficient? I thought we were already covered for this scenario?

The commit message mentions load_avg. I don't see it used anywhere in this patch or how it relates to the code, so I'm a bit puzzled.

Also 'return target_cpu;' seems to need more indentation.

Morten

Leo Yan

9:46 a.m.

New subject: [PATCH RFC 3/8] sched/fair: add path to migrate to higher capacity CPU

On Mon, Jul 04, 2016 at 09:41:45AM +0100, Morten Rasmussen wrote:

...

On Thu, Jun 23, 2016 at 09:43:05PM +0800, Leo Yan wrote:

...
In old code there has only one possible to directly migrate task for lower capacity CPU to higher capacity CPU: the previous CPU's capacity cannot meet the bandwidth requirement and the CPU is not over-utilized. This is reasonable if we only think to task util_avg signal to judge the task performance requirement with comparing with CPU capacity.

But in some scenarios, the task takes much time to stay on runnable state, so its load_avg value is quite higher than util_avg. So for this case can optimize to use load_avg to help migration decision.

If so, we need add path to migrate to higher capacity CPU if destination CPU has higher capacity than lower capacity CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7bd1c5b..185efe1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5633,6 +5633,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) if (target_cpu == -1) target_cpu = task_cpu(p);
/*
* Destination CPU has higher capacity than previous CPU, so from
* previous path that means pervious CPU has no enough capacity to meet
* the waken up task. Therefore directly return back and select
* destination CPU.
*/
if (capacity_of(target_cpu) != capacity_of(task_cpu(p))) {
if (capacity_of(target_cpu) ==
	cpu_rq(target_cpu)->rd->max_cpu_capacity.val)
return target_cpu;
}
It is not straightforward to infer that if target_cpu has the highest cpu capacity available in the system, then task_cpu(p) (previous cpu) must be overutilized. IIUC, it is only the case because task_fits_max() and cpu_overutilized() both use capacity_margin as the margin and we only consider cpus accepted by task_fits_max() when finding target_cpu.

Anyway, the scenario you seem to cover is the case where task_cpu() is over-utilized. This scenario should already be covered a little further down in energy_aware_wake_cpu():
            /* Not enough spare capacity on previous cpu */
            if (cpu_overutilized(task_cpu(p)))
                    return target_cpu;
Are there any cases where this isn't sufficient? I thought we were already covered for this scenario?

The commit message mentions load_avg. I don't see it used anywhere in this patch or how it relates to the code, so I'm a bit puzzled.

Sorry my patches' order introduced the confusion. This patch works for patch 4, after we apply patch 4 so it will add one case is even system is not over "over-utilized", the task still might be migrated to big cluster.

So we cannot rely on "over-utilized" flag anymore but need direct migrate task to big core.

...

Also 'return target_cpu;' seems to need more indentation.

Will fix. Thanks for poining out.

...

Morten

Morten Rasmussen

10:03 a.m.

New subject: [PATCH RFC 3/8] sched/fair: add path to migrate to higher capacity CPU

On Mon, Jul 04, 2016 at 05:46:18PM +0800, Leo Yan wrote:

...

On Mon, Jul 04, 2016 at 09:41:45AM +0100, Morten Rasmussen wrote:

...
On Thu, Jun 23, 2016 at 09:43:05PM +0800, Leo Yan wrote:

...
In old code there has only one possible to directly migrate task for lower capacity CPU to higher capacity CPU: the previous CPU's capacity cannot meet the bandwidth requirement and the CPU is not over-utilized. This is reasonable if we only think to task util_avg signal to judge the task performance requirement with comparing with CPU capacity.

But in some scenarios, the task takes much time to stay on runnable state, so its load_avg value is quite higher than util_avg. So for this case can optimize to use load_avg to help migration decision.

If so, we need add path to migrate to higher capacity CPU if destination CPU has higher capacity than lower capacity CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7bd1c5b..185efe1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5633,6 +5633,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) if (target_cpu == -1) target_cpu = task_cpu(p);
/*
* Destination CPU has higher capacity than previous CPU, so from
* previous path that means pervious CPU has no enough capacity to meet
* the waken up task. Therefore directly return back and select
* destination CPU.
*/
if (capacity_of(target_cpu) != capacity_of(task_cpu(p))) {
if (capacity_of(target_cpu) ==
	cpu_rq(target_cpu)->rd->max_cpu_capacity.val)
return target_cpu;
}
It is not straightforward to infer that if target_cpu has the highest cpu capacity available in the system, then task_cpu(p) (previous cpu) must be overutilized. IIUC, it is only the case because task_fits_max() and cpu_overutilized() both use capacity_margin as the margin and we only consider cpus accepted by task_fits_max() when finding target_cpu.

Anyway, the scenario you seem to cover is the case where task_cpu() is over-utilized. This scenario should already be covered a little further down in energy_aware_wake_cpu():
            /* Not enough spare capacity on previous cpu */
            if (cpu_overutilized(task_cpu(p)))
                    return target_cpu;
Are there any cases where this isn't sufficient? I thought we were already covered for this scenario?

The commit message mentions load_avg. I don't see it used anywhere in this patch or how it relates to the code, so I'm a bit puzzled.
Sorry my patches' order introduced the confusion. This patch works for patch 4, after we apply patch 4 so it will add one case is even system is not over "over-utilized", the task still might be migrated to big cluster.

So we cannot rely on "over-utilized" flag anymore but need direct migrate task to big core.

I'm guessing this is due to the boosting of the task utilization?

Wouldn't it be better to just modify the existing over-utilization check to handle boosted task as well so we don't have two conditions that does almost the same thing?

Morten

Leo Yan

10:43 a.m.

New subject: [PATCH RFC 3/8] sched/fair: add path to migrate to higher capacity CPU

On Mon, Jul 04, 2016 at 11:03:41AM +0100, Morten Rasmussen wrote:

...

On Mon, Jul 04, 2016 at 05:46:18PM +0800, Leo Yan wrote:

...
On Mon, Jul 04, 2016 at 09:41:45AM +0100, Morten Rasmussen wrote:

...
On Thu, Jun 23, 2016 at 09:43:05PM +0800, Leo Yan wrote:

...
In old code there has only one possible to directly migrate task for lower capacity CPU to higher capacity CPU: the previous CPU's capacity cannot meet the bandwidth requirement and the CPU is not over-utilized. This is reasonable if we only think to task util_avg signal to judge the task performance requirement with comparing with CPU capacity.

But in some scenarios, the task takes much time to stay on runnable state, so its load_avg value is quite higher than util_avg. So for this case can optimize to use load_avg to help migration decision.

If so, we need add path to migrate to higher capacity CPU if destination CPU has higher capacity than lower capacity CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7bd1c5b..185efe1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5633,6 +5633,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) if (target_cpu == -1) target_cpu = task_cpu(p);
/*
* Destination CPU has higher capacity than previous CPU, so from
* previous path that means pervious CPU has no enough capacity to meet
* the waken up task. Therefore directly return back and select
* destination CPU.
*/
if (capacity_of(target_cpu) != capacity_of(task_cpu(p))) {
if (capacity_of(target_cpu) ==
	cpu_rq(target_cpu)->rd->max_cpu_capacity.val)
return target_cpu;
}
It is not straightforward to infer that if target_cpu has the highest cpu capacity available in the system, then task_cpu(p) (previous cpu) must be overutilized. IIUC, it is only the case because task_fits_max() and cpu_overutilized() both use capacity_margin as the margin and we only consider cpus accepted by task_fits_max() when finding target_cpu.

Anyway, the scenario you seem to cover is the case where task_cpu() is over-utilized. This scenario should already be covered a little further down in energy_aware_wake_cpu():
            /* Not enough spare capacity on previous cpu */
            if (cpu_overutilized(task_cpu(p)))
                    return target_cpu;
Are there any cases where this isn't sufficient? I thought we were already covered for this scenario?

The commit message mentions load_avg. I don't see it used anywhere in this patch or how it relates to the code, so I'm a bit puzzled.
Sorry my patches' order introduced the confusion. This patch works for patch 4, after we apply patch 4 so it will add one case is even system is not over "over-utilized", the task still might be migrated to big cluster.

So we cannot rely on "over-utilized" flag anymore but need direct migrate task to big core.
I'm guessing this is due to the boosting of the task utilization?

Yes.

...

Wouldn't it be better to just modify the existing over-utilization check to handle boosted task as well so we don't have two conditions that does almost the same thing?

Have two different thinking: one method is using this patch to migrate task but don't trigger tipping point; another method is to set lower bar for tipping point so go back to use traditional SMP balance.

Though both can optimize performance, but I prefer first method for the scenario like there have many tasks but only one or two big tasks need to be migrated to big cluster.

How about you think for this case?

Thanks, Leo Yan

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org --- include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight { * 2) for entity, support any load.weight always runnable */ struct sched_avg { + u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

+ sa->last_migrate_time = 0; sa->last_update_time = 0; /* * sched_avg's period_contrib should be strictly less then 1024, so @@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time; + se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{ + return p->se.avg.load_avg; +} + unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task); + unsigned long load = task_load(task); unsigned long margin = schedtune_task_margin(task);

+ int cpu = task_cpu(task); + struct sched_entity *se = &task->se; + u64 delta; + + /* + * change to use load metrics if can meet two conditions: + * - load is 20% higher than util, so that means task have extra + * 20% time for runnable state and waiting to run; Or the task has + * higher prioirty than nice 0; then consider to use load signal + * rather than util signal; + * - load reach CPU "over-utilized" criteria. + */ + if ((load * capacity_margin > capacity_of(cpu) * 1024) && + (load * 1024 > util * capacity_margin)) + util = load; + else { + /* + * Avoid ping-pong issue, so make sure the task can run at + * least once in higher capacity CPU + */ + delta = se->avg.last_update_time - se->avg.last_migrate_time; + if (delta < sysctl_sched_latency && + capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val) + util = load; + } + trace_sched_boost_task(task, util, margin);

return util + margin; @@ -9100,6 +9134,7 @@ static void task_move_group_fair(struct task_struct *p) #ifdef CONFIG_SMP /* Tell se's cfs_rq has been changed -- migrated */ p->se.avg.last_update_time = 0; + p->se.avg.last_migrate_time = 0; #endif attach_task_cfs_rq(p); } -- 1.9.1

Patrick Bellasi

28 Jun 28 Jun

11:12 a.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

On 23-Jun 21:43, Leo Yan wrote:

...

When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight {

for entity, support any load.weight always runnable

*/ struct sched_avg {

u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

sa->last_migrate_time = 0; sa->last_update_time = 0; /*

sched_avg's period_contrib should be strictly less then 1024, so

@@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time;

se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{

return p->se.avg.load_avg;

+}

unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task);

unsigned long load = task_load(task);

Should not be:

unsigned long load = clamp(task_load(task), 0, SCHED_LOAD_SCALE);

...

unsigned long margin = schedtune_task_margin(task);

Again, as a general comment, should we enabled these mods only for boosted tasks? In this function we can use the condition "margin != 0" to idenitfy when it's worth to enable the switch from utilization to load... thus increasing chances to migrate on big CPUs only for boosted tasks.

...

int cpu = task_cpu(task);
struct sched_entity *se = &task->se;
u64 delta;
/*

* change to use load metrics if can meet two conditions:

* - load is 20% higher than util, so that means task have extra

*   20% time for runnable state and waiting to run; Or the task has

*   higher prioirty than nice 0; then consider to use load signal

```
*   rather than util signal;
```

* - load reach CPU "over-utilized" criteria.

Maybe add that the 20% margin is defined by the value:

capacity_margin / SCHED_LOAD_SCALE

...

*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))

Just for readability and to match the previous comment:

if ((util * capacity_margin < load * 1024) && (load * capacity_margin > capacity_of(cpu) * 1024))

...

util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}

trace_sched_boost_task(task, util, margin);

return util + margin;
@@ -9100,6 +9134,7 @@ static void task_move_group_fair(struct task_struct *p) #ifdef CONFIG_SMP /* Tell se's cfs_rq has been changed -- migrated */ p->se.avg.last_update_time = 0;

p->se.avg.last_migrate_time = 0;

#endif attach_task_cfs_rq(p); } -- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

Leo Yan

12:45 p.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

Hi Patrick,

On Tue, Jun 28, 2016 at 12:12:14PM +0100, Patrick Bellasi wrote:

...

On 23-Jun 21:43, Leo Yan wrote:

...
When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight {

for entity, support any load.weight always runnable

*/ struct sched_avg {

u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

sa->last_migrate_time = 0; sa->last_update_time = 0; /*

sched_avg's period_contrib should be strictly less then 1024, so

@@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time;

se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{

return p->se.avg.load_avg;

+}

unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task);

unsigned long load = task_load(task);

Should not be:

unsigned long load = clamp(task_load(task), 0, SCHED_LOAD_SCALE);

Should we do this after finish below conditions checking so we can keep original value?

...

...
unsigned long margin = schedtune_task_margin(task);

Again, as a general comment, should we enabled these mods only for boosted tasks? In this function we can use the condition "margin != 0" to idenitfy when it's worth to enable the switch from utilization to load... thus increasing chances to migrate on big CPUs only for boosted tasks.

This patch is not purpose for boosted task but for there have a single big task with sufficient runnable time on rq but cannot migrate to big core. If the task has been specified boost margin from CGroup or sysfs, then I refer to directly use the specified value.

How about you think for this?

...

int cpu = task_cpu(task);
struct sched_entity *se = &task->se;
u64 delta;
/*

* change to use load metrics if can meet two conditions:

* - load is 20% higher than util, so that means task have extra

*   20% time for runnable state and waiting to run; Or the task has

*   higher prioirty than nice 0; then consider to use load signal

```
*   rather than util signal;
```

* - load reach CPU "over-utilized" criteria.

Maybe add that the 20% margin is defined by the value:

         capacity_margin / SCHED_LOAD_SCALE

Yes, this is more consistent. Will fix.

...

...
*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))
Just for readability and to match the previous comment:

if ((util * capacity_margin < load * 1024) && (load * capacity_margin > capacity_of(cpu) * 1024))

Will fix.

...

...
util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}

trace_sched_boost_task(task, util, margin);

return util + margin;
@@ -9100,6 +9134,7 @@ static void task_move_group_fair(struct task_struct *p) #ifdef CONFIG_SMP /* Tell se's cfs_rq has been changed -- migrated */ p->se.avg.last_update_time = 0;

p->se.avg.last_migrate_time = 0;

#endif attach_task_cfs_rq(p); } -- 1.9.1
-- #include <best/regards.h>

Patrick Bellasi

Patrick Bellasi

5 Jul 5 Jul

8:43 a.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

On 28-Jun 20:45, Leo Yan wrote:

...

Hi Patrick,

On Tue, Jun 28, 2016 at 12:12:14PM +0100, Patrick Bellasi wrote:

...
On 23-Jun 21:43, Leo Yan wrote:

...
When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight {

for entity, support any load.weight always runnable

*/ struct sched_avg {

u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

sa->last_migrate_time = 0; sa->last_update_time = 0; /*

sched_avg's period_contrib should be strictly less then 1024, so

@@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time;

se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{

return p->se.avg.load_avg;

+}

unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task);

unsigned long load = task_load(task);

Should not be:

unsigned long load = clamp(task_load(task), 0, SCHED_LOAD_SCALE);

Should we do this after finish below conditions checking so we can keep original value?

...
...
unsigned long margin = schedtune_task_margin(task);

Again, as a general comment, should we enabled these mods only for boosted tasks? In this function we can use the condition "margin != 0" to idenitfy when it's worth to enable the switch from utilization to load... thus increasing chances to migrate on big CPUs only for boosted tasks.

This patch is not purpose for boosted task but for there have a single big task with sufficient runnable time on rq but cannot migrate to big core. If the task has been specified boost margin from CGroup or sysfs, then I refer to directly use the specified value.

How about you think for this?

Agree, however it seems to me that you always use the load (also for boosted tasks) if their utilization is matching the following criterias, isn't it?

In that case we end up somming a margin to a load, which is bigger than utilization... and this can end up in boosted_task_util() returning something >1024.

...

...
...
int cpu = task_cpu(task);

struct sched_entity *se = &task->se;

u64 delta;

/*
* change to use load metrics if can meet two conditions:
* - load is 20% higher than util, so that means task have extra
*   20% time for runnable state and waiting to run; Or the task has
*   higher prioirty than nice 0; then consider to use load signal
*   rather than util signal;
* - load reach CPU "over-utilized" criteria.
Maybe add that the 20% margin is defined by the value:
         capacity_margin / SCHED_LOAD_SCALE
Yes, this is more consistent. Will fix.

...
...
*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))
Just for readability and to match the previous comment:

if ((util * capacity_margin < load * 1024) && (load * capacity_margin > capacity_of(cpu) * 1024))
Will fix.

...
...
util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}

Maybe we can cap "util+margin" here, right before tracing and returning the value?

...

...
...
trace_sched_boost_task(task, util, margin);

return util + margin; @@ -9100,6 +9134,7 @@ static void task_move_group_fair(struct task_struct *p) #ifdef CONFIG_SMP /* Tell se's cfs_rq has been changed -- migrated */ p->se.avg.last_update_time = 0;

p->se.avg.last_migrate_time = 0;

#endif attach_task_cfs_rq(p); } -- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

-- #include <best/regards.h>

Patrick Bellasi

Leo Yan

10:01 a.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

On Tue, Jul 05, 2016 at 09:43:06AM +0100, Patrick Bellasi wrote:

...

On 28-Jun 20:45, Leo Yan wrote:

...
Hi Patrick,

On Tue, Jun 28, 2016 at 12:12:14PM +0100, Patrick Bellasi wrote:

...
On 23-Jun 21:43, Leo Yan wrote:

...
When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight {

for entity, support any load.weight always runnable

*/ struct sched_avg {

u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

sa->last_migrate_time = 0; sa->last_update_time = 0; /*

sched_avg's period_contrib should be strictly less then 1024, so

@@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time;

se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{

return p->se.avg.load_avg;

+}

unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task);

unsigned long load = task_load(task);

Should not be:

unsigned long load = clamp(task_load(task), 0, SCHED_LOAD_SCALE);

Should we do this after finish below conditions checking so we can keep original value?

...
...
unsigned long margin = schedtune_task_margin(task);

Again, as a general comment, should we enabled these mods only for boosted tasks? In this function we can use the condition "margin != 0" to idenitfy when it's worth to enable the switch from utilization to load... thus increasing chances to migrate on big CPUs only for boosted tasks.

This patch is not purpose for boosted task but for there have a single big task with sufficient runnable time on rq but cannot migrate to big core. If the task has been specified boost margin from CGroup or sysfs, then I refer to directly use the specified value.

How about you think for this?

Agree, however it seems to me that you always use the load (also for boosted tasks) if their utilization is matching the following criterias, isn't it?

My fault. Will fix it.

...

In that case we end up somming a margin to a load, which is bigger than utilization... and this can end up in boosted_task_util() returning something >1024.

Yes.

...

...
...
...
int cpu = task_cpu(task);

struct sched_entity *se = &task->se;

u64 delta;

/*
* change to use load metrics if can meet two conditions:
* - load is 20% higher than util, so that means task have extra
*   20% time for runnable state and waiting to run; Or the task has
*   higher prioirty than nice 0; then consider to use load signal
*   rather than util signal;
* - load reach CPU "over-utilized" criteria.
Maybe add that the 20% margin is defined by the value:
         capacity_margin / SCHED_LOAD_SCALE
Yes, this is more consistent. Will fix.

...
...
*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))
Just for readability and to match the previous comment:

if ((util * capacity_margin < load * 1024) && (load * capacity_margin > capacity_of(cpu) * 1024))
Will fix.

...
...
util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}
Maybe we can cap "util+margin" here, right before tracing and returning the value?

Will add this.

...

...
...
...
trace_sched_boost_task(task, util, margin);

return util + margin; @@ -9100,6 +9134,7 @@ static void task_move_group_fair(struct task_struct *p) #ifdef CONFIG_SMP /* Tell se's cfs_rq has been changed -- migrated */ p->se.avg.last_update_time = 0;

p->se.avg.last_migrate_time = 0;

#endif attach_task_cfs_rq(p); } -- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

-- #include <best/regards.h>

Patrick Bellasi

Morten Rasmussen

4 Jul 4 Jul

10:13 a.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

On Thu, Jun 23, 2016 at 09:43:06PM +0800, Leo Yan wrote:

...

When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight {

for entity, support any load.weight always runnable

*/ struct sched_avg {

u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

sa->last_migrate_time = 0; sa->last_update_time = 0; /*

sched_avg's period_contrib should be strictly less then 1024, so

@@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time;

se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{

return p->se.avg.load_avg;

+}

unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task);
unsigned long load = task_load(task); unsigned long margin = schedtune_task_margin(task);

int cpu = task_cpu(task);

struct sched_entity *se = &task->se;

u64 delta;

/*
* change to use load metrics if can meet two conditions:
* - load is 20% higher than util, so that means task have extra
*   20% time for runnable state and waiting to run; Or the task has
*   higher prioirty than nice 0; then consider to use load signal
*   rather than util signal;
* - load reach CPU "over-utilized" criteria.
*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))
util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}

This extra boost for tasks that have recently migrated isn't mentioned in the cover letter but seems to be a significant part of the actual patch.

IIUC, you boost utilization of tasks that have recently migrated. Could you explain a little more about why it is needed?

The task will appear bigger each time it migrates regardless of whether it has migrate little->big or big->little. Doesn't it mean that you are likely to send tasks that have recently migrated big->little back to big immediately because of the boost?

Morten

Leo Yan

5 Jul 5 Jul

8:17 a.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

On Mon, Jul 04, 2016 at 11:13:50AM +0100, Morten Rasmussen wrote:

...

On Thu, Jun 23, 2016 at 09:43:06PM +0800, Leo Yan wrote:

...
When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight {

for entity, support any load.weight always runnable

*/ struct sched_avg {

u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

sa->last_migrate_time = 0; sa->last_update_time = 0; /*

sched_avg's period_contrib should be strictly less then 1024, so

@@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time;

se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{

return p->se.avg.load_avg;

+}

unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task);
unsigned long load = task_load(task); unsigned long margin = schedtune_task_margin(task);

int cpu = task_cpu(task);

struct sched_entity *se = &task->se;

u64 delta;

/*
* change to use load metrics if can meet two conditions:
* - load is 20% higher than util, so that means task have extra
*   20% time for runnable state and waiting to run; Or the task has
*   higher prioirty than nice 0; then consider to use load signal
*   rather than util signal;
* - load reach CPU "over-utilized" criteria.
*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))
util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}
This extra boost for tasks that have recently migrated isn't mentioned in the cover letter but seems to be a significant part of the actual patch.

Yes.

...

IIUC, you boost utilization of tasks that have recently migrated. Could you explain a little more about why it is needed?

At first the patch wants to boost utilization if the task have stayed at runnable state for enough time; then after migrate the task from little core to big core, ensure the task can keep to run on big core for a while (at least ensure the task can run on big core for one time). So this is why in this two scenarios replace utilization by load signals.

...

The task will appear bigger each time it migrates regardless of whether it has migrate little->big or big->little. Doesn't it mean that you are likely to send tasks that have recently migrated big->little back to big immediately because of the boost?

Yes. Also want to avoid ping-pong issue if we have boosted utilization signal so let task can run at big cluster for a while.

Actually this patch wants to achieve similiar effect with HMP up_threshold and down_threshold, if task is over up_threshod then the task can stay at big core for a while and the task will be migrated back to little core until its load is less than down_threshold.

So especially for the scenario of single thread which has big load but it does _NOT_ over EAS tipping point, we can see the task can stay in little cluster and has much less chance to migrate to big core. But the same scenario running with HMP, its load_avg value just need occasionally cross up_threshod then it will have chance to stay on big core. So HMP can achieve much better performance result than EAS.

Thanks, Leo Yan

Morten Rasmussen

9:25 a.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

On Tue, Jul 05, 2016 at 04:17:03PM +0800, Leo Yan wrote:

...

On Mon, Jul 04, 2016 at 11:13:50AM +0100, Morten Rasmussen wrote:

...
On Thu, Jun 23, 2016 at 09:43:06PM +0800, Leo Yan wrote:

...
When load_avg is much higher than util_avg, then it indicate either the task have higher priority for more weight value for load_avg or because the task have much longer time for runnable state.

So for both this two case, replace util_avg value with load_avg. So use this way to inflate utilization signal and finally let the single big task has more chance to migrate to big CPU.

Signed-off-by: Leo Yan leo.yan@linaro.org

include/linux/sched.h | 1 + kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 644c39a..5d6bb25 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1166,6 +1166,7 @@ struct load_weight {

for entity, support any load.weight always runnable

*/ struct sched_avg {

u64 last_migrate_time; u64 last_update_time, load_sum; u32 util_sum, period_contrib; unsigned long load_avg, util_avg;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 185efe1..7fbfd41 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,6 +674,7 @@ void init_entity_runnable_average(struct sched_entity *se) { struct sched_avg *sa = &se->avg;

sa->last_migrate_time = 0; sa->last_update_time = 0; /*

sched_avg's period_contrib should be strictly less then 1024, so

@@ -2771,6 +2772,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s

skip_aging: se->avg.last_update_time = cfs_rq->avg.last_update_time;

se->avg.last_migrate_time = cfs_rq->avg.last_update_time; cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se->avg.load_sum;

@@ -5228,6 +5230,11 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }

+static inline unsigned long task_load(struct task_struct *p) +{

return p->se.avg.load_avg;

+}

unsigned int capacity_margin = 1280; /* ~20% margin */

static inline unsigned long boosted_task_util(struct task_struct *task); @@ -5369,8 +5376,35 @@ static inline unsigned long boosted_task_util(struct task_struct *task) { unsigned long util = task_util(task);
unsigned long load = task_load(task); unsigned long margin = schedtune_task_margin(task);

int cpu = task_cpu(task);

struct sched_entity *se = &task->se;

u64 delta;

/*
* change to use load metrics if can meet two conditions:
* - load is 20% higher than util, so that means task have extra
*   20% time for runnable state and waiting to run; Or the task has
*   higher prioirty than nice 0; then consider to use load signal
*   rather than util signal;
* - load reach CPU "over-utilized" criteria.
*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))
util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}
This extra boost for tasks that have recently migrated isn't mentioned in the cover letter but seems to be a significant part of the actual patch.
Yes.

...
IIUC, you boost utilization of tasks that have recently migrated. Could you explain a little more about why it is needed?

At first the patch wants to boost utilization if the task have stayed at runnable state for enough time; then after migrate the task from little core to big core, ensure the task can keep to run on big core for a while (at least ensure the task can run on big core for one time). So this is why in this two scenarios replace utilization by load signals.

I don't see why a task that has recently migrated little->big should not get to run at least once on the big cpu if the system is not above the tipping point (over-utilized).

The task was enqueued on a big rq recently, and nobody should have pulled it away before it had a chance to run at least once. We don't do load_balance() when below the tipping point. AFAICT, your recently migrated condition only has effect after the first wake-up on a big cpu (i.e. second wake, third wake, and so forth until sched_latency time has passed since the migration). The first wake-up was when the migration happened.

So to me, it looks like a mechanism to make the task keep waking up on a big cpu until sched_latency time has passed after a migration. The first wake-up should already be covered.

...

...
The task will appear bigger each time it migrates regardless of whether it has migrate little->big or big->little. Doesn't it mean that you are likely to send tasks that have recently migrated big->little back to big immediately because of the boost?

Yes. Also want to avoid ping-pong issue if we have boosted utilization signal so let task can run at big cluster for a while.

Actually this patch wants to achieve similiar effect with HMP up_threshold and down_threshold, if task is over up_threshod then the task can stay at big core for a while and the task will be migrated back to little core until its load is less than down_threshold.

I think I get what you want to achieve, but isn't it more a kind of one-way bias rather than a hysteresis like HMP has? You only try to keep task on big cpus.

...

So especially for the scenario of single thread which has big load but it does _NOT_ over EAS tipping point, we can see the task can stay in little cluster and has much less chance to migrate to big core. But the same scenario running with HMP, its load_avg value just need occasionally cross up_threshod then it will have chance to stay on big core. So HMP can achieve much better performance result than EAS.

This sounds like a scenario where you want to boost utilization of the task to get it out of the tipping point grey zone to improve latency.

Leo Yan

12:09 p.m.

New subject: [PATCH RFC 4/8] sched/fair: use load to replace util when have big difference

On Tue, Jul 05, 2016 at 10:25:31AM +0100, Morten Rasmussen wrote:

[...]

...

...
...
...
/*
* change to use load metrics if can meet two conditions:
* - load is 20% higher than util, so that means task have extra
*   20% time for runnable state and waiting to run; Or the task has
*   higher prioirty than nice 0; then consider to use load signal
*   rather than util signal;
* - load reach CPU "over-utilized" criteria.
*/
if ((load * capacity_margin > capacity_of(cpu) * 1024) &&
   (load * 1024 > util * capacity_margin))
util = load;
else {
/*
 * Avoid ping-pong issue, so make sure the task can run at
 * least once in higher capacity CPU
 */
delta = se->avg.last_update_time - se->avg.last_migrate_time;
if (delta < sysctl_sched_latency &&
    capacity_of(cpu) == cpu_rq(cpu)->rd->max_cpu_capacity.val)
	util = load;
}
This extra boost for tasks that have recently migrated isn't mentioned in the cover letter but seems to be a significant part of the actual patch.
Yes.

...
IIUC, you boost utilization of tasks that have recently migrated. Could you explain a little more about why it is needed?

At first the patch wants to boost utilization if the task have stayed at runnable state for enough time; then after migrate the task from little core to big core, ensure the task can keep to run on big core for a while (at least ensure the task can run on big core for one time). So this is why in this two scenarios replace utilization by load signals.
I don't see why a task that has recently migrated little->big should not get to run at least once on the big cpu if the system is not above the tipping point (over-utilized).

Sorry I introduced confusion and agree with you.

...

The task was enqueued on a big rq recently, and nobody should have pulled it away before it had a chance to run at least once. We don't do load_balance() when below the tipping point. AFAICT, your recently migrated condition only has effect after the first wake-up on a big cpu (i.e. second wake, third wake, and so forth until sched_latency time has passed since the migration). The first wake-up was when the migration happened.

So to me, it looks like a mechanism to make the task keep waking up on a big cpu until sched_latency time has passed after a migration. The first wake-up should already be covered.

If so, I still think we need set some barriers to avoid the task is easily migrated back to little core. But not sure if sched_latency time is a reasonable value or not, especially as your said, we cannot use sched_latency time as the reason for first wake-up anymore.

...

...
...
The task will appear bigger each time it migrates regardless of whether it has migrate little->big or big->little. Doesn't it mean that you are likely to send tasks that have recently migrated big->little back to big immediately because of the boost?

Yes. Also want to avoid ping-pong issue if we have boosted utilization signal so let task can run at big cluster for a while.

Actually this patch wants to achieve similiar effect with HMP up_threshold and down_threshold, if task is over up_threshod then the task can stay at big core for a while and the task will be migrated back to little core until its load is less than down_threshold.

I think I get what you want to achieve, but isn't it more a kind of one-way bias rather than a hysteresis like HMP has? You only try to keep task on big cpus.

After the taks is migrated to big core, it's hard for this sentence to get true: (load * capacity_margin > capacity_of(cpu) * 1024).

So it's easily to go back to use original util signal rather than load signal if we don't add the sched_latency time condition.

This is why I don't consider much for task migration from big core to little core and hope can rely on the decayed util signal after task goes to smaller load and finally can migrate to little core.

...

...
So especially for the scenario of single thread which has big load but it does _NOT_ over EAS tipping point, we can see the task can stay in little cluster and has much less chance to migrate to big core. But the same scenario running with HMP, its load_avg value just need occasionally cross up_threshod then it will have chance to stay on big core. So HMP can achieve much better performance result than EAS.

This sounds like a scenario where you want to boost utilization of the task to get it out of the tipping point grey zone to improve latency.

Yes, exactly. Like the task has util_avg = (~300) but little core has quite high capacity value (~600) then finally the task has no any chance to migrate to big core.

Thanks, Leo Yan

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

After over tipping point if there have two big tasks are packing on single CPU, these two tasks can be easily meet the condition for task_hot(). In result can_migrate_task() returns false and will _NOT_ spread task within cluster.

This patch check extra condition if source CPU has more than two runnable tasks and destination CPU is idle, then consider tasks can be more aggressively to migrate.

Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7fbfd41..a6eef88 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6541,6 +6541,17 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) return 1; }

+ /* + * After over tipping point then aggressively to spread task + * to CPUs if destination CPU is idle and source CPU has more than + * one task is runnable. + */ + if (energy_aware() && env->dst_rq->rd->overutilized) { + if (env->dst_rq->nr_running == 0 && + env->src_rq->nr_running >= 2) + return 1; + } + schedstat_inc(p, se.statistics.nr_failed_migrations_hot); return 0; } -- 1.9.1

Vincent Guittot

24 Jun 24 Jun

2:44 p.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

Hi Leo,

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...

After over tipping point if there have two big tasks are packing on single CPU, these two tasks can be easily meet the condition for task_hot(). In result can_migrate_task() returns false and will _NOT_ spread task within cluster.

This patch check extra condition if source CPU has more than two runnable tasks and destination CPU is idle, then consider tasks can be more aggressively to migrate.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7fbfd41..a6eef88 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6541,6 +6541,17 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) return 1; }
  /*
   * After over tipping point then aggressively to spread task
   * to CPUs if destination CPU is idle and source CPU has more than
   * one task is runnable.
   */
  if (energy_aware() && env->dst_rq->rd->overutilized) {
          if (env->dst_rq->nr_running == 0 &&
              env->src_rq->nr_running >= 2)

your condition seems to say that there are 2 tasks on src_rq and its place in the function indicates that the task is not the running one If the scheduler is not allowed to migrate the task, it's just because the task has run on this cpu during the last 500us (default value). So I still consider that clearing sysctl_sched_migration_cost is a better alternative if the root cause is that the task has just started to run on this cfs_rq and you want to favor task placement more than cache hotness. sysctl_sched_migration_cost is used elsewhere in the scheduler and can prevent other opportunity to balance tasks between CPUs

...

```
                  return 1;
```
```
  }
```

  schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
  return 0;

}

1.9.1

Leo Yan

3:12 p.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

Hi Vincent,

Thanks for reviewing.

On Fri, Jun 24, 2016 at 04:44:37PM +0200, Vincent Guittot wrote:

...

Hi Leo,

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
After over tipping point if there have two big tasks are packing on single CPU, these two tasks can be easily meet the condition for task_hot(). In result can_migrate_task() returns false and will _NOT_ spread task within cluster.

This patch check extra condition if source CPU has more than two runnable tasks and destination CPU is idle, then consider tasks can be more aggressively to migrate.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7fbfd41..a6eef88 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6541,6 +6541,17 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) return 1; }
  /*
   * After over tipping point then aggressively to spread task
   * to CPUs if destination CPU is idle and source CPU has more than
   * one task is runnable.
   */
  if (energy_aware() && env->dst_rq->rd->overutilized) {
          if (env->dst_rq->nr_running == 0 &&
              env->src_rq->nr_running >= 2)
your condition seems to say that there are 2 tasks on src_rq and its place in the function indicates that the task is not the running one If the scheduler is not allowed to migrate the task, it's just because the task has run on this cpu during the last 500us (default value). So

I totally agree with your analysis and thanks your helping point out the root cause.

...

I still consider that clearing sysctl_sched_migration_cost is a better alternative if the root cause is that the task has just started to run on this cfs_rq and you want to favor task placement more than cache hotness. sysctl_sched_migration_cost is used elsewhere in the scheduler and can prevent other opportunity to balance tasks between CPUs

I'm not strong to stick to use this patch but not to set sysctl_sched_migration_cost. The mainly benefit for this patch is after applying this patch, then it will be useful to keep it in the code. Otherwise there have no one place to track this issue anymore.

Or follow your suggestion, I may write one patch to define a special macor like: CONFIG_ENERGY_AWARE_AGGRESSIVE_TASK_SPREAD, after enable this configuration then we set sysctl_sched_migration_cost to 0.

If have any other idea, just free let me know. Thanks.

...

```
                  return 1;
```
```
  }
```

  schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
  return 0;

}

1.9.1

Vincent Guittot

3:22 p.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

Le 24 juin 2016 5:12 PM, "Leo Yan" leo.yan@linaro.org a écrit :

...

Hi Vincent,

Thanks for reviewing.

On Fri, Jun 24, 2016 at 04:44:37PM +0200, Vincent Guittot wrote:

...
Hi Leo,

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
After over tipping point if there have two big tasks are packing on single CPU, these two tasks can be easily meet the condition for task_hot(). In result can_migrate_task() returns false and will _NOT_ spread task within cluster.

This patch check extra condition if source CPU has more than two runnable tasks and destination CPU is idle, then consider tasks can be more aggressively to migrate.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7fbfd41..a6eef88 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6541,6 +6541,17 @@ int can_migrate_task(struct task_struct *p,

struct lb_env *env)

...

            return 1;
    }

```
  /*
```

   * After over tipping point then aggressively to spread task

   * to CPUs if destination CPU is idle and source CPU has more

than

...

...
...
   * one task is runnable.
   */
  if (energy_aware() && env->dst_rq->rd->overutilized) {
          if (env->dst_rq->nr_running == 0 &&
              env->src_rq->nr_running >= 2)
your condition seems to say that there are 2 tasks on src_rq and its place in the function indicates that the task is not the running one If the scheduler is not allowed to migrate the task, it's just because the task has run on this cpu during the last 500us (default value). So
I totally agree with your analysis and thanks your helping point out the root cause.

...
I still consider that clearing sysctl_sched_migration_cost is a better alternative if the root cause is that the task has just started to run on this cfs_rq and you want to favor task placement more than cache hotness. sysctl_sched_migration_cost is used elsewhere in the scheduler and can prevent other opportunity to balance tasks between CPUs

I'm not strong to stick to use this patch but not to set sysctl_sched_migration_cost. The mainly benefit for this patch is after applying this patch, then it will be useful to keep it in the code. Otherwise there have no one place to track this issue anymore.

Or follow your suggestion, I may write one patch to define a special macor like: CONFIG_ENERGY_AWARE_AGGRESSIVE_TASK_SPREAD, after enable this configuration then we set sysctl_sched_migration_cost to 0.

If have any other idea, just free let me know. Thanks.

Such settings are system tuning and don't have to be put in the kernel. They have to be considered like other tunings: cpufreq sampling rate, interruption affinity ...

I agree that we have to track these tunings which have major impact on EAS but kernel doesn't seem to be the best place IMO. Can we create a document like a wiki page in which we can list all settings that impact EAS behavior and what us the preferred value?

I pretty sure that other tunings (scheduler but not only) impacts EAS but default values might not be the best ones for EAS.

So we could track all if them in one single place

Regards,

...

```
                  return 1;
```
```
  }
```

  schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
  return 0;

}

1.9.1

Leo Yan

29 Jun 29 Jun

6:18 a.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

Hi Vincent,

On Fri, Jun 24, 2016 at 05:22:42PM +0200, Vincent Guittot wrote:

[...]

...

...
I'm not strong to stick to use this patch but not to set sysctl_sched_migration_cost. The mainly benefit for this patch is after applying this patch, then it will be useful to keep it in the code. Otherwise there have no one place to track this issue anymore.

Or follow your suggestion, I may write one patch to define a special macor like: CONFIG_ENERGY_AWARE_AGGRESSIVE_TASK_SPREAD, after enable this configuration then we set sysctl_sched_migration_cost to 0.

If have any other idea, just free let me know. Thanks.

Such settings are system tuning and don't have to be put in the kernel. They have to be considered like other tunings: cpufreq sampling rate, interruption affinity ...

I agree that we have to track these tunings which have major impact on EAS but kernel doesn't seem to be the best place IMO. Can we create a document like a wiki page in which we can list all settings that impact EAS behavior and what us the preferred value?

It's a good suggestion. I'm not the best one to do this due I have no global view for current working items, but this will be very useful for us to discuss questions on the same page. So I drafted one for EAS check list and known issues tracking: https://wiki.linaro.org/eas-checklist-and-known-issues-tracking

If miss anything, freely let me know or welcome updating in wiki page.

...

I pretty sure that other tunings (scheduler but not only) impacts EAS but default values might not be the best ones for EAS.

So we could track all if them in one single place

Regards,

...
...
...
                  return 1;
  }
  schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
  return 0;
}

1.9.1

Vincent Guittot

6:38 a.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

Hi leo,

On 29 June 2016 at 08:18, Leo Yan leo.yan@linaro.org wrote:

...

Hi Vincent,

On Fri, Jun 24, 2016 at 05:22:42PM +0200, Vincent Guittot wrote:

[...]

...
...
I'm not strong to stick to use this patch but not to set sysctl_sched_migration_cost. The mainly benefit for this patch is after applying this patch, then it will be useful to keep it in the code. Otherwise there have no one place to track this issue anymore.

Or follow your suggestion, I may write one patch to define a special macor like: CONFIG_ENERGY_AWARE_AGGRESSIVE_TASK_SPREAD, after enable this configuration then we set sysctl_sched_migration_cost to 0.

If have any other idea, just free let me know. Thanks.

Such settings are system tuning and don't have to be put in the kernel. They have to be considered like other tunings: cpufreq sampling rate, interruption affinity ...

I agree that we have to track these tunings which have major impact on EAS but kernel doesn't seem to be the best place IMO. Can we create a document like a wiki page in which we can list all settings that impact EAS behavior and what us the preferred value?

It's a good suggestion. I'm not the best one to do this due I have no global view for current working items, but this will be very useful for us to discuss questions on the same page. So I drafted one for EAS check list and known issues tracking: https://wiki.linaro.org/eas-checklist-and-known-issues-tracking

If miss anything, freely let me know or welcome updating in wiki page.

Thanks Leo,

I'm going to have a look

Regards, Vincent

...

...
I pretty sure that other tunings (scheduler but not only) impacts EAS but default values might not be the best ones for EAS.

So we could track all if them in one single place

Regards,

...
...
...
                  return 1;
  }
  schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
  return 0;
}

1.9.1

Robin Randhawa

9:56 a.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

Hi Leo, Vincent.

...

On 29 Jun 2016, at 07:18, Leo Yan leo.yan@linaro.org wrote:

It's a good suggestion. I'm not the best one to do this due I have no global view for current working items, but this will be very useful for us to discuss questions on the same page. So I drafted one for EAS check list and known issues tracking: https://wiki.linaro.org/eas-checklist-and-known-issues-tracking

If miss anything, freely let me know or welcome updating in wiki page.

This is a good suggestion and will help us review and land anything useful in the LSK EAS topic branches as things progress.

Thanks! Robin IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Leo Yan

1:23 p.m.

New subject: [PATCH RFC 5/8] sched/fair: spread tasks in cluster when over tipping point

On Wed, Jun 29, 2016 at 10:56:10AM +0100, Robin Randhawa wrote:

...

Hi Leo, Vincent.

...
On 29 Jun 2016, at 07:18, Leo Yan leo.yan@linaro.org wrote:

It's a good suggestion. I'm not the best one to do this due I have no global view for current working items, but this will be very useful for us to discuss questions on the same page. So I drafted one for EAS check list and known issues tracking: https://wiki.linaro.org/eas-checklist-and-known-issues-tracking

If miss anything, freely let me know or welcome updating in wiki page.

This is a good suggestion and will help us review and land anything useful in the LSK EAS topic branches as things progress.

Thanks, Robin. If you have anything to supplyment, I'm glad to help to add in the page :)

...

Thanks! Robin IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load

Current code calculates avg_load as below: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Let's see below scenario for cluster level average load calculation:

The little cluster have 4 CPUs and 2 tasks running with 100% utilization (nice = 0); if little cluster capacity is 400, then get little cluster avg_load = (2 * 1024 * 1024) / (400 * 4) = 1310.

On the other hand, if big cluster have 4 CPUs and 4 tasks running with 100% utilization (nice = 0); if big cluster capacity is 1024, then get big cluster avg_load = (4 * 1024 * 1024) / (1024 * 4) = 1024.

So finally scheduler considers little cluster has higher load and this obviously doesn't make sense due big cluster has 4 CPUs been running but little cluster actually have 2 CPUs are idle.

So this is caused by the load value is not scaled by capacity, so it will get wrong average load value by divide capacity value. So just need simply divide CPU number (group->group_weight). So for upper case we can get little cluster avg_load = (2 * 1024) / 4 = 512; big cluster avg_load = (4 * 1024) / 4 = 1024.

Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a6eef88..353520d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5475,7 +5475,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, }

/* Adjust by relative CPU capacity of the group */ - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; + avg_load = avg_load / group->group_weight;

if (local_group) { this_load = avg_load; @@ -7250,7 +7250,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,

/* Adjust by relative CPU capacity of the group */ sgs->group_capacity = group->sgc->capacity; - sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; + sgs->avg_load = sgs->group_load / group->group_weight;

if (sgs->sum_nr_running) sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; -- 1.9.1

Vincent Guittot

27 Jun 27 Jun

8:56 p.m.

New subject: [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...

Current code calculates avg_load as below: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Let's see below scenario for cluster level average load calculation:

The little cluster have 4 CPUs and 2 tasks running with 100% utilization (nice = 0); if little cluster capacity is 400, then get little cluster avg_load = (2 * 1024 * 1024) / (400 * 4) = 1310.

On the other hand, if big cluster have 4 CPUs and 4 tasks running with 100% utilization (nice = 0); if big cluster capacity is 1024, then get big cluster avg_load = (4 * 1024 * 1024) / (1024 * 4) = 1024.

So finally scheduler considers little cluster has higher load and this obviously doesn't make sense due big cluster has 4 CPUs been running but little cluster actually have 2 CPUs are idle.

This perfectly makes sense, it's all about how much computation capacity has to be shared between load so even if 2 little cpu are idle, the tasks on big cluster have more compute capacity than those on little cluster.

We can't use this avg_load into account only when big cluster becomes overloaded whereas little cpu has some idle cpus.

...

So this is caused by the load value is not scaled by capacity, so it will get wrong average load value by divide capacity value. So just need simply divide CPU number (group->group_weight). So for upper case we can get little cluster avg_load = (2 * 1024) / 4 = 512; big cluster avg_load = (4 * 1024) / 4 = 1024.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a6eef88..353520d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5475,7 +5475,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, }
            /* Adjust by relative CPU capacity of the group */
          avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
          avg_load = avg_load / group->group_weight;

          if (local_group) {
                  this_load = avg_load;
@@ -7250,7 +7250,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
    /* Adjust by relative CPU capacity of the group */
    sgs->group_capacity = group->sgc->capacity;
  sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
  sgs->avg_load = sgs->group_load / group->group_weight;

  if (sgs->sum_nr_running)
          sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
-- 1.9.1

Leo Yan

28 Jun 28 Jun

1:58 p.m.

New subject: [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load

On Mon, Jun 27, 2016 at 10:56:57PM +0200, Vincent Guittot wrote:

...

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
Current code calculates avg_load as below: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Let's see below scenario for cluster level average load calculation:

The little cluster have 4 CPUs and 2 tasks running with 100% utilization (nice = 0); if little cluster capacity is 400, then get little cluster avg_load = (2 * 1024 * 1024) / (400 * 4) = 1310.

On the other hand, if big cluster have 4 CPUs and 4 tasks running with 100% utilization (nice = 0); if big cluster capacity is 1024, then get big cluster avg_load = (4 * 1024 * 1024) / (1024 * 4) = 1024.

So finally scheduler considers little cluster has higher load and this obviously doesn't make sense due big cluster has 4 CPUs been running but little cluster actually have 2 CPUs are idle.

This perfectly makes sense, it's all about how much computation capacity has to be shared between load so even if 2 little cpu are idle, the tasks on big cluster have more compute capacity than those on little cluster.

I'm not sure if I have done enough homework for this :) My understanding for load value is the task requirement for CPU computation capacity, the range is [0..1024]. So when reach 1024 meaning all CPU capacity has been consumed. avg_load is the average value for sched_group's all CPUs.

And avg_load is a singal for CPU capacity consuming but not for task. Please free correct me if I misunderstand for this.

I gave the example which is not quite typical. Let's see more practical example:

Big cluster has 4 CPUs with 6 runnable tasks, group_load = 3547; so big cluster avg_load = (3547 * 1024) / 4096 = 886;

Little cluster has 4 CPUs with 2 runnable tasks, group_load = 1639;, so little cluster avg_load = (1639 * 1024 / 1603 = 1046;

So how we calculate imbalance load between these two clusters? In current code, even big cluster is overloaded but it will not migrate tasks to little cluser:

7788 /* 7789 * If the local group is busier than the selected busiest group 7790 * don't try and pull any tasks. 7791 */ 7792 if (local->avg_load >= busiest->avg_load) 7793 goto out_balanced;

...

We can't use this avg_load into account only when big cluster becomes overloaded whereas little cpu has some idle cpus.

Even we can add one extra checking for if big cluster is overloaded, then question is how many imbalance load should be migrated from big cluster to little cluster? For upper case, we can see avg_load value is pointless. So this is why I tried to write this patch to change avg_load to represent the concept for "CPU capacity consuming ratio".

...

...
So this is caused by the load value is not scaled by capacity, so it will get wrong average load value by divide capacity value. So just need simply divide CPU number (group->group_weight). So for upper case we can get little cluster avg_load = (2 * 1024) / 4 = 512; big cluster avg_load = (4 * 1024) / 4 = 1024.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a6eef88..353520d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5475,7 +5475,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, }
            /* Adjust by relative CPU capacity of the group */
          avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
          avg_load = avg_load / group->group_weight;

          if (local_group) {
                  this_load = avg_load;
@@ -7250,7 +7250,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
    /* Adjust by relative CPU capacity of the group */
    sgs->group_capacity = group->sgc->capacity;
  sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
  sgs->avg_load = sgs->group_load / group->group_weight;

  if (sgs->sum_nr_running)
          sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
-- 1.9.1

Vincent Guittot

2:49 p.m.

New subject: [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load

On 28 June 2016 at 15:58, Leo Yan leo.yan@linaro.org wrote:

...

On Mon, Jun 27, 2016 at 10:56:57PM +0200, Vincent Guittot wrote:

...
On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
Current code calculates avg_load as below: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Let's see below scenario for cluster level average load calculation:

The little cluster have 4 CPUs and 2 tasks running with 100% utilization (nice = 0); if little cluster capacity is 400, then get little cluster avg_load = (2 * 1024 * 1024) / (400 * 4) = 1310.

On the other hand, if big cluster have 4 CPUs and 4 tasks running with 100% utilization (nice = 0); if big cluster capacity is 1024, then get big cluster avg_load = (4 * 1024 * 1024) / (1024 * 4) = 1024.

So finally scheduler considers little cluster has higher load and this obviously doesn't make sense due big cluster has 4 CPUs been running but little cluster actually have 2 CPUs are idle.

This perfectly makes sense, it's all about how much computation capacity has to be shared between load so even if 2 little cpu are idle, the tasks on big cluster have more compute capacity than those on little cluster.

I'm not sure if I have done enough homework for this :) My understanding for load value is the task requirement for CPU computation capacity, the range is [0..1024]. So when reach 1024

load value is as an enhancement of task's weight that was used before in the scheduler. The task's weight is now pondered by the runnable time of the task so a task with a short runnable time will not make a CPU seen as heavily loaded just because of thsi short high prio task. This load is then used to ensure a fair share of the CPUs compute capacity between the tasks The range is not [0..1024] but [0..88761]

...

meaning all CPU capacity has been consumed. avg_load is the average

Only the utilization of the cpu means that all CPU capacity has been consumed but not the load which is used for fair time sharing between tasks

...

value for sched_group's all CPUs.

And avg_load is a singal for CPU capacity consuming but not for task. Please free correct me if I misunderstand for this.

I gave the example which is not quite typical. Let's see more practical example:

Big cluster has 4 CPUs with 6 runnable tasks, group_load = 3547; so big cluster avg_load = (3547 * 1024) / 4096 = 886;

Little cluster has 4 CPUs with 2 runnable tasks, group_load = 1639;, so little cluster avg_load = (1639 * 1024 / 1603 = 1046;

So how we calculate imbalance load between these two clusters? In current code, even big cluster is overloaded but it will not migrate tasks to little cluser:

In this case we use the fact the big cluster is overloaded but not little cluster and the load_per_task value In mainline kernel, this is not handle correctly but I thought that this was handled by EAS code; There is this code below in calculate_imbalance that matches with your condition

/* * Busiest group is overloaded, local is not, use the spare * cycles to maximize throughput */ if (busiest->group_type == group_overloaded && local->group_type <= group_misfit_task) { env->imbalance = busiest->load_per_task; return; }

...

7788 /* 7789 * If the local group is busier than the selected busiest group 7790 * don't try and pull any tasks. 7791 */ 7792 if (local->avg_load >= busiest->avg_load) 7793 goto out_balanced;

...
We can't use this avg_load into account only when big cluster becomes overloaded whereas little cpu has some idle cpus.

Even we can add one extra checking for if big cluster is overloaded, then question is how many imbalance load should be migrated from big cluster to little cluster? For upper case, we can see avg_load value is pointless. So this is why I tried to write this patch to change avg_load to represent the concept for "CPU capacity consuming ratio".

...
...
So this is caused by the load value is not scaled by capacity, so it will get wrong average load value by divide capacity value. So just need simply divide CPU number (group->group_weight). So for upper case we can get little cluster avg_load = (2 * 1024) / 4 = 512; big cluster avg_load = (4 * 1024) / 4 = 1024.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a6eef88..353520d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5475,7 +5475,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, }
            /* Adjust by relative CPU capacity of the group */
          avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
          avg_load = avg_load / group->group_weight;

          if (local_group) {
                  this_load = avg_load;
@@ -7250,7 +7250,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
    /* Adjust by relative CPU capacity of the group */
    sgs->group_capacity = group->sgc->capacity;
  sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
  sgs->avg_load = sgs->group_load / group->group_weight;

  if (sgs->sum_nr_running)
          sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
-- 1.9.1

Leo Yan

3:52 p.m.

New subject: [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load

On Tue, Jun 28, 2016 at 04:49:29PM +0200, Vincent Guittot wrote:

...

On 28 June 2016 at 15:58, Leo Yan leo.yan@linaro.org wrote:

...
On Mon, Jun 27, 2016 at 10:56:57PM +0200, Vincent Guittot wrote:

...
On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
Current code calculates avg_load as below: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Let's see below scenario for cluster level average load calculation:

The little cluster have 4 CPUs and 2 tasks running with 100% utilization (nice = 0); if little cluster capacity is 400, then get little cluster avg_load = (2 * 1024 * 1024) / (400 * 4) = 1310.

On the other hand, if big cluster have 4 CPUs and 4 tasks running with 100% utilization (nice = 0); if big cluster capacity is 1024, then get big cluster avg_load = (4 * 1024 * 1024) / (1024 * 4) = 1024.

So finally scheduler considers little cluster has higher load and this obviously doesn't make sense due big cluster has 4 CPUs been running but little cluster actually have 2 CPUs are idle.

This perfectly makes sense, it's all about how much computation capacity has to be shared between load so even if 2 little cpu are idle, the tasks on big cluster have more compute capacity than those on little cluster.

I'm not sure if I have done enough homework for this :) My understanding for load value is the task requirement for CPU computation capacity, the range is [0..1024]. So when reach 1024

load value is as an enhancement of task's weight that was used before in the scheduler. The task's weight is now pondered by the runnable time of the task so a task with a short runnable time will not make a CPU seen as heavily loaded just because of thsi short high prio task. This load is then used to ensure a fair share of the CPUs compute capacity between the tasks The range is not [0..1024] but [0..88761]

I will digest these info with reading code offline. But I recognize I wrongly understood "load" when I connect your description with Yuyang's an early email [1].

[1] http://article.gmane.org/gmane.linux.kernel/2036154

...

...
meaning all CPU capacity has been consumed. avg_load is the average

Only the utilization of the cpu means that all CPU capacity has been consumed but not the load which is used for fair time sharing between tasks

...
value for sched_group's all CPUs.

And avg_load is a singal for CPU capacity consuming but not for task. Please free correct me if I misunderstand for this.

I gave the example which is not quite typical. Let's see more practical example:

Big cluster has 4 CPUs with 6 runnable tasks, group_load = 3547; so big cluster avg_load = (3547 * 1024) / 4096 = 886;

Little cluster has 4 CPUs with 2 runnable tasks, group_load = 1639;, so little cluster avg_load = (1639 * 1024 / 1603 = 1046;

So how we calculate imbalance load between these two clusters? In current code, even big cluster is overloaded but it will not migrate tasks to little cluser:

In this case we use the fact the big cluster is overloaded but not little cluster and the load_per_task value In mainline kernel, this is not handle correctly but I thought that this was handled by EAS code; There is this code below in calculate_imbalance that matches with your condition

/*

Busiest group is overloaded, local is not, use the spare

cycles to maximize throughput

*/ if (busiest->group_type == group_overloaded && local->group_type <= group_misfit_task) { env->imbalance = busiest->load_per_task; return; }

But this code will _NOT_ be called in below flow:

find_busiest_group() {

[...]

if (local->avg_load >= busiest->avg_load) --> will directly return goto out_balanced;

[...]

force_balance: env->busiest_group_type = busiest->group_type; calculate_imbalance(env, &sds); return sds.busiest;

out_balanced: env->imbalance = 0; return NULL; }

So I think you are suggesting code like below:

find_busiest_group() {

[...]

/* * Force balance when busiest group is overloaded and * local group is not imbalance or overloaded. */ if (energy_aware() && (busiest->group_type == group_overloaded && local->group_type <= group_misfit_task)) goto force_balance;

if (local->avg_load >= busiest->avg_load) goto out_balanced;

[...]

force_balance: env->busiest_group_type = busiest->group_type; calculate_imbalance(env, &sds); return sds.busiest;

out_balanced: env->imbalance = 0; return NULL; }

Thanks, Leo Yan

Vincent Guittot

4:33 p.m.

New subject: [PATCH RFC 6/8] sched/fair: correct avg_load as CPU average load

On 28 June 2016 at 17:52, Leo Yan leo.yan@linaro.org wrote:

...

On Tue, Jun 28, 2016 at 04:49:29PM +0200, Vincent Guittot wrote:

...
On 28 June 2016 at 15:58, Leo Yan leo.yan@linaro.org wrote:

...
On Mon, Jun 27, 2016 at 10:56:57PM +0200, Vincent Guittot wrote:

...
On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
Current code calculates avg_load as below: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity

Let's see below scenario for cluster level average load calculation:

The little cluster have 4 CPUs and 2 tasks running with 100% utilization (nice = 0); if little cluster capacity is 400, then get little cluster avg_load = (2 * 1024 * 1024) / (400 * 4) = 1310.

On the other hand, if big cluster have 4 CPUs and 4 tasks running with 100% utilization (nice = 0); if big cluster capacity is 1024, then get big cluster avg_load = (4 * 1024 * 1024) / (1024 * 4) = 1024.

So finally scheduler considers little cluster has higher load and this obviously doesn't make sense due big cluster has 4 CPUs been running but little cluster actually have 2 CPUs are idle.

This perfectly makes sense, it's all about how much computation capacity has to be shared between load so even if 2 little cpu are idle, the tasks on big cluster have more compute capacity than those on little cluster.

I'm not sure if I have done enough homework for this :) My understanding for load value is the task requirement for CPU computation capacity, the range is [0..1024]. So when reach 1024

load value is as an enhancement of task's weight that was used before in the scheduler. The task's weight is now pondered by the runnable time of the task so a task with a short runnable time will not make a CPU seen as heavily loaded just because of thsi short high prio task. This load is then used to ensure a fair share of the CPUs compute capacity between the tasks The range is not [0..1024] but [0..88761]

I will digest these info with reading code offline. But I recognize I wrongly understood "load" when I connect your description with Yuyang's an early email [1].

[1] http://article.gmane.org/gmane.linux.kernel/2036154

...
...
meaning all CPU capacity has been consumed. avg_load is the average

Only the utilization of the cpu means that all CPU capacity has been consumed but not the load which is used for fair time sharing between tasks

...
value for sched_group's all CPUs.

And avg_load is a singal for CPU capacity consuming but not for task. Please free correct me if I misunderstand for this.

I gave the example which is not quite typical. Let's see more practical example:

Big cluster has 4 CPUs with 6 runnable tasks, group_load = 3547; so big cluster avg_load = (3547 * 1024) / 4096 = 886;

Little cluster has 4 CPUs with 2 runnable tasks, group_load = 1639;, so little cluster avg_load = (1639 * 1024 / 1603 = 1046;

So how we calculate imbalance load between these two clusters? In current code, even big cluster is overloaded but it will not migrate tasks to little cluser:

In this case we use the fact the big cluster is overloaded but not little cluster and the load_per_task value In mainline kernel, this is not handle correctly but I thought that this was handled by EAS code; There is this code below in calculate_imbalance that matches with your condition

/*

Busiest group is overloaded, local is not, use the spare

cycles to maximize throughput

*/ if (busiest->group_type == group_overloaded && local->group_type <= group_misfit_task) { env->imbalance = busiest->load_per_task; return; }

But this code will _NOT_ be called in below flow:

find_busiest_group() {
    [...]

    if (local->avg_load >= busiest->avg_load)   --> will directly return
            goto out_balanced;

    [...]
force_balance: env->busiest_group_type = busiest->group_type; calculate_imbalance(env, &sds); return sds.busiest;

out_balanced: env->imbalance = 0; return NULL; }

So I think you are suggesting code like below:

find_busiest_group() {
    [...]

    /*
     * Force balance when busiest group is overloaded and
     * local group is not imbalance or overloaded.
     */
    if (energy_aware() &&
         (busiest->group_type == group_overloaded &&
         local->group_type <= group_misfit_task))
            goto force_balance;

There is already something quite similar in find_busiest_group : /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance;

replacing env->idle == CPU_NEWLY_IDLE by env->idle == CPU_IDLE should do the job

...

    if (local->avg_load >= busiest->avg_load)
            goto out_balanced;

    [...]
force_balance: env->busiest_group_type = busiest->group_type; calculate_imbalance(env, &sds); return sds.busiest;

out_balanced: env->imbalance = 0; return NULL; }

Thanks, Leo Yan

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 7/8] sched/fair: fix to calculate average load cross cluster

When calculate imbalance between two clusters, then need calculate average load between clusters so finally can easily get to know how much load should be migrated.

Let's see the example for there have two clusters, cluster_A load_avg = 600 and cluster_B load_avg = 200; so the average load cross these two clusters = 400 if both of them have same capacity value.

If cluster_A capacity is 512 and cluster_B capacity is 1024, then that means cluster_B need migrate below load so cluster_A and cluster_B can get same load level:

avg_load(A) - X = avg_load(B) + X * capacity(A) / capacity(B)

avg_load(A) * capacity(B) - avg_load(B) * capacity(B) X = ---------------------------------------------------------- capacity(A) + capacity(B)

The middle level load is:

avg_load(A) * capacity(A) + avg_load(B) * capacity(B) --------------------------------------------------------- capacity(A) + capacity(B)

So finally total_load = avg_load(A) * capacity(A) + avg_load(B) * capacity(B).

Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 353520d..16e1fe2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7441,7 +7441,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

next_group: /* Now, start updating sd_lb_stats */ - sds->total_load += sgs->group_load; + sds->total_load += sgs->group_load * sgs->group_capacity / + SCHED_CAPACITY_SCALE / sgs->group_weight; sds->total_capacity += sgs->group_capacity;

sg = sg->next; @@ -7526,7 +7527,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long tmp, capa_now = 0, capa_move = 0; unsigned int imbn = 2; - unsigned long scaled_busy_load_per_task; + unsigned long scaled_busy_load_per_task, scaled_local_load_per_task; struct sg_lb_stats *local, *busiest;

local = &sds->local_stat; @@ -7541,8 +7542,12 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) (busiest->load_per_task * SCHED_CAPACITY_SCALE) / busiest->group_capacity;

+ scaled_local_load_per_task = + (busiest->load_per_task * SCHED_CAPACITY_SCALE) / + local->group_capacity; + if (busiest->avg_load + scaled_busy_load_per_task >= - local->avg_load + (scaled_busy_load_per_task * imbn)) { + local->avg_load + (scaled_local_load_per_task * imbn)) { env->imbalance = busiest->load_per_task; return; } -- 1.9.1

Vincent Guittot

27 Jun 27 Jun

5:50 p.m.

New subject: [PATCH RFC 7/8] sched/fair: fix to calculate average load cross cluster

Hi Leo,

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...

When calculate imbalance between two clusters, then need calculate average load between clusters so finally can easily get to know how much load should be migrated.

Let's see the example for there have two clusters, cluster_A load_avg = 600 and cluster_B load_avg = 200; so the average load cross these two clusters = 400 if both of them have same capacity value.

If cluster_A capacity is 512 and cluster_B capacity is 1024, then that means cluster_B need migrate below load so cluster_A and cluster_B can get same load level:

It's not clear for me which issue you try to fix. AFAICT, caculate_imbalance returns the right value (333) for your use case

cluster A average load will become = (600-333) / 512 = 0.52 cluster B average load will become = (200+333) / 1024 = 0.52 system average load is (600+200)/(512+1024) = 0.52

...

avg_load(A) - X = avg_load(B) + X * capacity(A) / capacity(B)
     avg_load(A) * capacity(B) - avg_load(B) * capacity(B)
X = ---------------------------------------------------------- capacity(A) + capacity(B)

The middle level load is:
avg_load(A) * capacity(A) + avg_load(B) * capacity(B)
               capacity(A) + capacity(B)
So finally total_load = avg_load(A) * capacity(A) + avg_load(B) * capacity(B).

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 353520d..16e1fe2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7441,7 +7441,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

next_group: /* Now, start updating sd_lb_stats */
          sds->total_load += sgs->group_load;
          sds->total_load += sgs->group_load * sgs->group_capacity /
                  SCHED_CAPACITY_SCALE / sgs->group_weight;

What are you trying to compute here ? and why have you use sgs->group_weight

We already have sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load) / sds.total_capacity; in the find_busiest_group function.

so the computation above seems redundant

...

            sds->total_capacity += sgs->group_capacity;

            sg = sg->next;
@@ -7526,7 +7527,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long tmp, capa_now = 0, capa_move = 0; unsigned int imbn = 2;
  unsigned long scaled_busy_load_per_task;
  unsigned long scaled_busy_load_per_task, scaled_local_load_per_task;
  struct sg_lb_stats *local, *busiest;

  local = &sds->local_stat;
@@ -7541,8 +7542,12 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) (busiest->load_per_task * SCHED_CAPACITY_SCALE) / busiest->group_capacity;
  scaled_local_load_per_task =
          (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
          local->group_capacity;
  if (busiest->avg_load + scaled_busy_load_per_task >=
      local->avg_load + (scaled_busy_load_per_task * imbn)) {
      local->avg_load + (scaled_local_load_per_task * imbn)) {
          env->imbalance = busiest->load_per_task;
          return;
  }
-- 1.9.1

Vincent Guittot

8:42 p.m.

New subject: [PATCH RFC 7/8] sched/fair: fix to calculate average load cross cluster

On 27 June 2016 at 19:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...

Hi Leo,

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
When calculate imbalance between two clusters, then need calculate average load between clusters so finally can easily get to know how much load should be migrated.

Let's see the example for there have two clusters, cluster_A load_avg = 600 and cluster_B load_avg = 200; so the average load cross these two clusters = 400 if both of them have same capacity value.

If cluster_A capacity is 512 and cluster_B capacity is 1024, then that means cluster_B need migrate below load so cluster_A and cluster_B can get same load level:

It's not clear for me which issue you try to fix. AFAICT, caculate_imbalance returns the right value (333) for your use case

cluster A average load will become = (600-333) / 512 = 0.52 cluster B average load will become = (200+333) / 1024 = 0.52 system average load is (600+200)/(512+1024) = 0.52

...
avg_load(A) - X = avg_load(B) + X * capacity(A) / capacity(B)
     avg_load(A) * capacity(B) - avg_load(B) * capacity(B)
X = ---------------------------------------------------------- capacity(A) + capacity(B)

The middle level load is:
avg_load(A) * capacity(A) + avg_load(B) * capacity(B)
               capacity(A) + capacity(B)
So finally total_load = avg_load(A) * capacity(A) + avg_load(B) * capacity(B).

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 353520d..16e1fe2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7441,7 +7441,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

next_group: /* Now, start updating sd_lb_stats */
          sds->total_load += sgs->group_load;
          sds->total_load += sgs->group_load * sgs->group_capacity /
                  SCHED_CAPACITY_SCALE / sgs->group_weight;
What are you trying to compute here ? and why have you use sgs->group_weight

ok, so you have done this because you have change the content of sgs->avg_load. You should better say that in the commit message

...

We already have sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load) / sds.total_capacity; in the find_busiest_group function.

so the computation above seems redundant

...
            sds->total_capacity += sgs->group_capacity;

            sg = sg->next;
@@ -7526,7 +7527,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long tmp, capa_now = 0, capa_move = 0; unsigned int imbn = 2;
  unsigned long scaled_busy_load_per_task;
  unsigned long scaled_busy_load_per_task, scaled_local_load_per_task;
  struct sg_lb_stats *local, *busiest;

  local = &sds->local_stat;
@@ -7541,8 +7542,12 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) (busiest->load_per_task * SCHED_CAPACITY_SCALE) / busiest->group_capacity;
  scaled_local_load_per_task =
          (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
          local->group_capacity;
  if (busiest->avg_load + scaled_busy_load_per_task >=
      local->avg_load + (scaled_busy_load_per_task * imbn)) {
      local->avg_load + (scaled_local_load_per_task * imbn)) {
          env->imbalance = busiest->load_per_task;
          return;
  }
-- 1.9.1

Leo Yan

28 Jun 28 Jun

2:25 p.m.

New subject: [PATCH RFC 7/8] sched/fair: fix to calculate average load cross cluster

On Mon, Jun 27, 2016 at 10:42:05PM +0200, Vincent Guittot wrote:

...

On 27 June 2016 at 19:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
Hi Leo,

On 23 June 2016 at 15:43, Leo Yan leo.yan@linaro.org wrote:

...
When calculate imbalance between two clusters, then need calculate average load between clusters so finally can easily get to know how much load should be migrated.

Let's see the example for there have two clusters, cluster_A load_avg = 600 and cluster_B load_avg = 200; so the average load cross these two clusters = 400 if both of them have same capacity value.

If cluster_A capacity is 512 and cluster_B capacity is 1024, then that means cluster_B need migrate below load so cluster_A and cluster_B can get same load level:

It's not clear for me which issue you try to fix. AFAICT, caculate_imbalance returns the right value (333) for your use case

cluster A average load will become = (600-333) / 512 = 0.52 cluster B average load will become = (200+333) / 1024 = 0.52 system average load is (600+200)/(512+1024) = 0.52

You are right. I also recalculated at my side.

...

...
...
avg_load(A) - X = avg_load(B) + X * capacity(A) / capacity(B)
     avg_load(A) * capacity(B) - avg_load(B) * capacity(B)
X = ---------------------------------------------------------- capacity(A) + capacity(B)

The middle level load is:
avg_load(A) * capacity(A) + avg_load(B) * capacity(B)
               capacity(A) + capacity(B)
So finally total_load = avg_load(A) * capacity(A) + avg_load(B) * capacity(B).

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 353520d..16e1fe2b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7441,7 +7441,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

next_group: /* Now, start updating sd_lb_stats */
          sds->total_load += sgs->group_load;
          sds->total_load += sgs->group_load * sgs->group_capacity /
                  SCHED_CAPACITY_SCALE / sgs->group_weight;
What are you trying to compute here ? and why have you use sgs->group_weight
ok, so you have done this because you have change the content of sgs->avg_load. You should better say that in the commit message

Yes. This patch is dependent on patch 6. So let's firstly finalize if patch 6 is necessary or not.

...

...
We already have sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load) / sds.total_capacity; in the find_busiest_group function.

so the computation above seems redundant

...
            sds->total_capacity += sgs->group_capacity;

            sg = sg->next;
@@ -7526,7 +7527,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { unsigned long tmp, capa_now = 0, capa_move = 0; unsigned int imbn = 2;
  unsigned long scaled_busy_load_per_task;
  unsigned long scaled_busy_load_per_task, scaled_local_load_per_task;
  struct sg_lb_stats *local, *busiest;

  local = &sds->local_stat;
@@ -7541,8 +7542,12 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) (busiest->load_per_task * SCHED_CAPACITY_SCALE) / busiest->group_capacity;
  scaled_local_load_per_task =
          (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
          local->group_capacity;
  if (busiest->avg_load + scaled_busy_load_per_task >=
      local->avg_load + (scaled_busy_load_per_task * imbn)) {
      local->avg_load + (scaled_local_load_per_task * imbn)) {
          env->imbalance = busiest->load_per_task;
          return;
  }
-- 1.9.1

Leo Yan

23 Jun 23 Jun

1:43 p.m.

New subject: [PATCH RFC 8/8] sched/fair: set imbn to 1 for too many tasks on rq

If there have runnable tasks number is bigger than CPU number in one schedule group, then this sched_group load_avg signal will be under estimated due the CPU load_avg value cannot accumulate all running tasks load value.

On the other hand, if another sched_group CPU has less tasks number than CPU number. As the result, the first sched_group's per_task_load will be much less than second CPU's value.

So this patch is to consider this situation and set imbn to 1 to reduce imbalance bar.

Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 16e1fe2b..7d18c7d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7537,6 +7537,9 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) local->load_per_task = cpu_avg_load_per_task(env->dst_cpu); else if (busiest->load_per_task > local->load_per_task) imbn = 1; + else if (busiest->sum_nr_running >= busiest->group_weight && + local->sum_nr_running < local->group_weight) + imbn = 1;

scaled_busy_load_per_task = (busiest->load_per_task * SCHED_CAPACITY_SCALE) / -- 1.9.1

Morten Rasmussen

4 Jul 4 Jul

11:54 a.m.

New subject: [PATCH RFC 8/8] sched/fair: set imbn to 1 for too many tasks on rq

On Thu, Jun 23, 2016 at 09:43:10PM +0800, Leo Yan wrote:

...

If there have runnable tasks number is bigger than CPU number in one schedule group, then this sched_group load_avg signal will be under estimated due the CPU load_avg value cannot accumulate all running tasks load value.

On the other hand, if another sched_group CPU has less tasks number than CPU number. As the result, the first sched_group's per_task_load will be much less than second CPU's value.

So this patch is to consider this situation and set imbn to 1 to reduce imbalance bar.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 16e1fe2b..7d18c7d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7537,6 +7537,9 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) local->load_per_task = cpu_avg_load_per_task(env->dst_cpu); else if (busiest->load_per_task > local->load_per_task) imbn = 1;
else if (busiest->sum_nr_running >= busiest->group_weight &&
 local->sum_nr_running < local->group_weight)
imbn = 1;

Have you seen any actual effect of this patch?

fix_small_imbalance() is generally as mess of heuristics that, to the best of my knowledge, nobody can explain in detail anymore.

If you want to balance tasks/cpu why not just set:

env->imbalance = busiest->load_per_task;

and just return?

Leo Yan

5 Jul 5 Jul

7:57 a.m.

New subject: [PATCH RFC 8/8] sched/fair: set imbn to 1 for too many tasks on rq

On Mon, Jul 04, 2016 at 12:54:41PM +0100, Morten Rasmussen wrote:

...

On Thu, Jun 23, 2016 at 09:43:10PM +0800, Leo Yan wrote:

...
If there have runnable tasks number is bigger than CPU number in one schedule group, then this sched_group load_avg signal will be under estimated due the CPU load_avg value cannot accumulate all running tasks load value.

On the other hand, if another sched_group CPU has less tasks number than CPU number. As the result, the first sched_group's per_task_load will be much less than second CPU's value.

So this patch is to consider this situation and set imbn to 1 to reduce imbalance bar.

Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 16e1fe2b..7d18c7d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7537,6 +7537,9 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) local->load_per_task = cpu_avg_load_per_task(env->dst_cpu); else if (busiest->load_per_task > local->load_per_task) imbn = 1;
else if (busiest->sum_nr_running >= busiest->group_weight &&
 local->sum_nr_running < local->group_weight)
imbn = 1;
Have you seen any actual effect of this patch?

Be honest, this patch has no obvious effect for performance improvement :)

...

fix_small_imbalance() is generally as mess of heuristics that, to the best of my knowledge, nobody can explain in detail anymore.

If you want to balance tasks/cpu why not just set:

env->imbalance = busiest->load_per_task;

and just return?

This is based on patches 6/7 which have changed semantics for avg_load. So here wants to loose the condtion for imbalance if busiest group have more tasks than its cpu number. Eventually want to use below code to return back env->imbalance = busiest->load_per_task:

if (busiest->avg_load + scaled_busy_load_per_task >= local->avg_load + (scaled_busy_load_per_task * imbn)) { env->imbalance = busiest->load_per_task; return; }

But as Vincent reviewed for patch 6/7, I should drop them so I think can take your suggestion.

Thanks, Leo Yan

Steve Muckle

23 Jun 23 Jun

3:53 p.m.

Hi Leo,

On Thu, Jun 23, 2016 at 09:43:02PM +0800, Leo Yan wrote:

...

This patch series is to optimize performance.

Have you had a chance to profile these patches for perf/power?

thanks, Steve

2859

days inactive

2871

days old

eas-dev@lists.linaro.org

44 comments

participants

tags (0)

participants (6)

Leo Yan
Morten Rasmussen
Patrick Bellasi
Robin Randhawa
Steve Muckle
Vincent Guittot