[RFC 0/4] sched: consolidation of cpu_power

List overview All Threads
Download

newer

older

next boot: 46 pass, 0 fail...

[PATCH 00/38] tick cleanups and...

Vincent Guittot

28 Mar 2014 28 Mar '14

1:22 p.m.

Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm

SMT system is no more the only system that can have a CPUs with capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and groups of the sched_domains. During load balance, the scheduler evaluates the number of tasks that a group of CPUs can handle. The current method ensures that we will not return more capacity than number of real cores but it returns wrong value for group of LITTLE cores and in some situation for SMT system. The proposed solution computes the ratio between CPUs and cores for a group durint the init sequence and uses it with power and power_orig to return the current capacity of a group.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2014/3/19/377

Vincent Guittot (4): sched: extend the usage of cpu_power_orig ARM: topology: use new cpu_power interface sched: fix computed capacity for HMP sched: add per group cpu_power_orig

arch/arm/kernel/topology.c | 4 ++-- kernel/sched/core.c | 9 ++++++++- kernel/sched/fair.c | 31 +++++++++++++++++++------------ kernel/sched/sched.h | 3 ++- 4 files changed, 31 insertions(+), 16 deletions(-)

-- 1.9.0

Show replies by date

Vincent Guittot

28 Mar 28 Mar

1:22 p.m.

New subject: [RFC 1/4] sched: extend the usage of cpu_power_orig

cpu_power_orig is only changed for SMT system in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set cpu_power_orig.

The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of cpu_power_orig.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e9bd0b..ed42061 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5559,6 +5559,20 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu) return default_scale_smt_power(sd, cpu); }

+unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu) +{ + unsigned long weight = sd->span_weight; + + if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { + if (sched_feat(ARCH_POWER)) + return arch_scale_smt_power(sd, cpu); + else + return default_scale_smt_power(sd, cpu); + } + + return SCHED_POWER_SCALE; +} + static unsigned long scale_rt_power(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5590,18 +5604,12 @@ static unsigned long scale_rt_power(int cpu)

static void update_cpu_power(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; unsigned long power = SCHED_POWER_SCALE; struct sched_group *sdg = sd->groups;

- if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { - if (sched_feat(ARCH_POWER)) - power *= arch_scale_smt_power(sd, cpu); - else - power *= default_scale_smt_power(sd, cpu); + power *= arch_scale_cpu_power(sd, cpu);

- power >>= SCHED_POWER_SHIFT; - } + power >>= SCHED_POWER_SHIFT;

sdg->sgp->power_orig = power;

-- 1.9.0

Preeti U Murthy

1 Apr 1 Apr

10:39 a.m.

New subject: [RFC 1/4] sched: extend the usage of cpu_power_orig

Hi Vincent,

On 03/28/2014 06:52 PM, Vincent Guittot wrote:

...

cpu_power_orig is only changed for SMT system in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.

There is no parameter 'cpu_power_orig' till your fourth patch right? Why is this term being used in this patch?

Besides, both parameters power and power_orig are changed for SMT systems to reflect the lower capacity of the CPUs.Why is there a mention of only power_orig?

IMO, the subject of the patch is not clearly reflecting the main intention of the patch. There is nothing done to change the way cpu_power is used; rather you are changing the way the cpu_power is being set to be flexible, thus allowing for the right power value to be set on heterogeneous systems.

'Allow archs to set the cpu_power instead of falling to default value' or something similar would be more appropriate. What do you think?

Regards Preeti U Murthy

...

Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set cpu_power_orig.

The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of cpu_power_orig.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e9bd0b..ed42061 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5559,6 +5559,20 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu) return default_scale_smt_power(sd, cpu); }

+unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu) +{
unsigned long weight = sd->span_weight;

if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
if (sched_feat(ARCH_POWER))
	return arch_scale_smt_power(sd, cpu);
else
	return default_scale_smt_power(sd, cpu);
}

return SCHED_POWER_SCALE;
+}

static unsigned long scale_rt_power(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5590,18 +5604,12 @@ static unsigned long scale_rt_power(int cpu)

static void update_cpu_power(struct sched_domain *sd, int cpu) {
unsigned long weight = sd->span_weight; unsigned long power = SCHED_POWER_SCALE; struct sched_group *sdg = sd->groups;

if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
if (sched_feat(ARCH_POWER))
	power *= arch_scale_smt_power(sd, cpu);
else
	power *= default_scale_smt_power(sd, cpu);
power *= arch_scale_cpu_power(sd, cpu);
power >>= SCHED_POWER_SHIFT;
}
power >>= SCHED_POWER_SHIFT;

sdg->sgp->power_orig = power;

Vincent Guittot

11:20 a.m.

New subject: [RFC 1/4] sched: extend the usage of cpu_power_orig

On 1 April 2014 12:39, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...

Hi Vincent,

On 03/28/2014 06:52 PM, Vincent Guittot wrote:

...
cpu_power_orig is only changed for SMT system in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.

There is no parameter 'cpu_power_orig' till your fourth patch right? Why is this term being used in this patch?

It looks like that i have mixed power_orig and cpu_power_orig in my commit messages

...

Besides, both parameters power and power_orig are changed for SMT systems to reflect the lower capacity of the CPUs.Why is there a mention of only power_orig?

Only SMT system is able to change power_orig from default value whereas all systems can already change power field with arch_scale_freq_power function. The goal of this patch is to change the function name that is used to set power_orig value to a more generic one and to extend the possibility of setting power_orig for any kind of system. The behavior of the power field is not changed with this patchset.

...

IMO, the subject of the patch is not clearly reflecting the main intention of the patch. There is nothing done to change the way cpu_power is used; rather you are changing the way the cpu_power is being set to be flexible, thus allowing for the right power value to be set on heterogeneous systems.

'Allow archs to set the cpu_power instead of falling to default value' or something similar would be more appropriate. What do you think?

I can change with : "Allow all archs to set the power_orig"

Thanks

...

Regards Preeti U Murthy

...
Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set cpu_power_orig.

The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of cpu_power_orig.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e9bd0b..ed42061 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5559,6 +5559,20 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu) return default_scale_smt_power(sd, cpu); }

+unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu) +{
unsigned long weight = sd->span_weight;
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
        if (sched_feat(ARCH_POWER))
                return arch_scale_smt_power(sd, cpu);
        else
                return default_scale_smt_power(sd, cpu);
}
return SCHED_POWER_SCALE;
+}

static unsigned long scale_rt_power(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5590,18 +5604,12 @@ static unsigned long scale_rt_power(int cpu)

static void update_cpu_power(struct sched_domain *sd, int cpu) {
unsigned long weight = sd->span_weight;
unsigned long power = SCHED_POWER_SCALE;
struct sched_group *sdg = sd->groups;
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
        if (sched_feat(ARCH_POWER))
                power *= arch_scale_smt_power(sd, cpu);
        else
                power *= default_scale_smt_power(sd, cpu);
power *= arch_scale_cpu_power(sd, cpu);
        power >>= SCHED_POWER_SHIFT;
}
power >>= SCHED_POWER_SHIFT;

sdg->sgp->power_orig = power;

Preeti U Murthy

3 Apr 3 Apr

10:40 a.m.

New subject: [RFC 1/4] sched: extend the usage of cpu_power_orig

On 04/01/2014 04:50 PM, Vincent Guittot wrote:

...

On 1 April 2014 12:39, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
Hi Vincent,

On 03/28/2014 06:52 PM, Vincent Guittot wrote:

...
cpu_power_orig is only changed for SMT system in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value.

There is no parameter 'cpu_power_orig' till your fourth patch right? Why is this term being used in this patch?

It looks like that i have mixed power_orig and cpu_power_orig in my commit messages

...
Besides, both parameters power and power_orig are changed for SMT systems to reflect the lower capacity of the CPUs.Why is there a mention of only power_orig?

Only SMT system is able to change power_orig from default value whereas all systems can already change power field with arch_scale_freq_power function. The goal of this patch is to change

I am unable to understand this. arch_scale_freq_power() is not doing the same job as arch_scale_smt_power() right? While the former allows the arch to set a power value per cpu different from the default value, the latter allows arch to adjust the power for hyper threads.

For example in big.LITTLE, the big cores would be required to have a higher cpu power than what is currently being used as default. This is done by arch_scale_freq_power(). Whereas for an SMT system, the power of a core could perhaps be the same as default value, but needs to be divided among the hyper threads. This is done by arch_scale_smt_power(). They have different purposes right?

So I do not understand why you mention "all systems can already change power field with arch_scale_freq_power()."

Actually I was assuming this patch would introduce arch_scale_smt_power() and arch_scale_freq_power(), two functions that would *enable all archs and not just ARCH_POWER* as is found today, to adjust their smt power and non-smt power respectively.

For the same reasons, I am also unclear as to why power_orig is not initialized *after* default/arch_scale_freq_power(). For big cores for instance you would want the power_orig to be set not to a default value but to what the arch says as the frequency of a core scaled over the default value.

...

the function name that is used to set power_orig value to a more generic one and to extend the possibility of setting power_orig for any kind of system. The behavior of the power field is not changed with this patchset.

...
IMO, the subject of the patch is not clearly reflecting the main intention of the patch. There is nothing done to change the way cpu_power is used; rather you are changing the way the cpu_power is being set to be flexible, thus allowing for the right power value to be set on heterogeneous systems.

'Allow archs to set the cpu_power instead of falling to default value' or something similar would be more appropriate. What do you think?

I can change with : "Allow all archs to set the power_orig"

Ok, but for the reasons mentioned above I was thinking the power_orig should be avoided since archs should be able to scale both power and power_orig per thread and not just ARCH_POWER.

Regards Preeti U Murthy

...

Thanks

...
Regards Preeti U Murthy

Vincent Guittot

28 Mar 28 Mar

1:22 p.m.

New subject: [RFC 2/4] ARM: topology: use new cpu_power interface

Use the new arch_scale_cpu_power in order to reflect the original capacity of a CPU instead of arch_scale_freq_power which is more linked to a scaling of the capacity regarding frequency.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..55a7d90 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -42,7 +42,7 @@ */ static DEFINE_PER_CPU(unsigned long, cpu_scale);

-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu) +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu) { return per_cpu(cpu_scale, cpu); } @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu) set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);

printk(KERN_INFO "CPU%u: update cpu_power %lu\n", - cpu, arch_scale_freq_power(NULL, cpu)); + cpu, arch_scale_cpu_power(NULL, cpu)); }

#else

-- 1.9.0

Vincent Guittot

1:22 p.m.

New subject: [RFC 3/4] sched: fix computed capacity for HMP

The current sg_capacity solves the ghost cores issue for SMT system and cluster made of big cores which have a cpu_power above SCHED_POWER_SCALE at core level. But it still removes some real cores of a cluster made of LITTLE cores which have a cpu_power below SCHED_POWER_SCALE.

Instead of using the power_orig to detect SMT system and compute a smt factor that will be used to calculate the real number of cores, we set a core_fct field when building the sched_domain topology. We can detect SMT system thanks to SD_SHARE_CPUPOWER flag and set core_fct to know how many CPUs per core we have. The core_fct will ensure that sg_capacity will return cores capacity of a SMT system and will not remove any real core of LITTLE cluster.

This method also fixes a use case where the capacity of a SMT system was overrated. Let take the example of a system made of 8 cores HT system: At CPU level, sg_capacity is cap to a maximum capacity of 8 whereas DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE) returns 9. ((589*16) / 1024) = 9.3 Now if 2 CPUs (1 core) are fully loaded by rt tasks, sg_capacity still returns a capacity of 8 whereas it should return a capacity of 7. This happen because DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE) is still above 7.5: ((589*14) / 1024) = 8.05

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 7 +++++++ kernel/sched/fair.c | 6 ++---- kernel/sched/sched.h | 2 +- 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9d9776..5b20b27 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5844,6 +5844,13 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)

WARN_ON(!sg);

+ if (!sd->child) + sg->core_fct = 1; + else if (sd->child->flags & SD_SHARE_CPUPOWER) + sg->core_fct = cpumask_weight(sched_group_cpus(sg)); + else + sg->core_fct = sd->child->groups->core_fct; + do { sg->group_weight = cpumask_weight(sched_group_cpus(sg)); sg = sg->next; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed42061..7387c05 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5773,12 +5773,10 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group) power = group->sgp->power; power_orig = group->sgp->power_orig; cpus = group->group_weight; + smt = group->core_fct;

- /* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */ - smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig); - capacity = cpus / smt; /* cores */ + capacity = DIV_ROUND_CLOSEST(power * cpus, power_orig * smt);

- capacity = min_t(unsigned, capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE)); if (!capacity) capacity = fix_small_capacity(env->sd, group);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c9007f2..46c3784 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -759,7 +759,7 @@ struct sched_group { struct sched_group *next; /* Must be a circular list */ atomic_t ref;

- unsigned int group_weight; + unsigned int group_weight, core_fct; struct sched_group_power *sgp;

-- 1.9.0

Peter Zijlstra

15 Apr 15 Apr

5:22 p.m.

New subject: [RFC 3/4] sched: fix computed capacity for HMP

On Fri, Mar 28, 2014 at 02:22:28PM +0100, Vincent Guittot wrote:

...

The current sg_capacity solves the ghost cores issue for SMT system and cluster made of big cores which have a cpu_power above SCHED_POWER_SCALE at core level. But it still removes some real cores of a cluster made of LITTLE cores which have a cpu_power below SCHED_POWER_SCALE.

Instead of using the power_orig to detect SMT system and compute a smt factor that will be used to calculate the real number of cores, we set a core_fct field when building the sched_domain topology. We can detect SMT system thanks to SD_SHARE_CPUPOWER flag and set core_fct to know how many CPUs per core we have. The core_fct will ensure that sg_capacity will return cores capacity of a SMT system and will not remove any real core of LITTLE cluster.

This method also fixes a use case where the capacity of a SMT system was overrated. Let take the example of a system made of 8 cores HT system: At CPU level, sg_capacity is cap to a maximum capacity of 8 whereas DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE) returns 9. ((589*16) / 1024) = 9.3 Now if 2 CPUs (1 core) are fully loaded by rt tasks, sg_capacity still returns a capacity of 8 whereas it should return a capacity of 7. This happen because DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE) is still above 7.5: ((589*14) / 1024) = 8.05

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 7 +++++++ kernel/sched/fair.c | 6 ++---- kernel/sched/sched.h | 2 +- 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9d9776..5b20b27 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5844,6 +5844,13 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) WARN_ON(!sg);
if (!sd->child)
sg->core_fct = 1;
else if (sd->child->flags & SD_SHARE_CPUPOWER)
sg->core_fct = cpumask_weight(sched_group_cpus(sg));
else
sg->core_fct = sd->child->groups->core_fct;
do { sg->group_weight = cpumask_weight(sched_group_cpus(sg)); sg = sg->next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed42061..7387c05 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5773,12 +5773,10 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group) power = group->sgp->power; power_orig = group->sgp->power_orig; cpus = group->group_weight;

smt = group->core_fct;

/* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */

smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);

capacity = cpus / smt; /* cores */

capacity = DIV_ROUND_CLOSEST(power * cpus, power_orig * smt);

capacity = min_t(unsigned, capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE)); if (!capacity) capacity = fix_small_capacity(env->sd, group);

So this patch only cures a little problem; the much bigger problem is that capacity as exists is completely wrong.

We really should be using utilization here. Until a CPU is fully utilized we shouldn't be moving tasks around (unless packing, but where not there yet, and in that case you want to stop, where this starts, namely full utilization).

So while I appreciate what you're trying to 'fix' here, its really just trying to dress a monkey.

Vincent Guittot

16 Apr 16 Apr

6:54 a.m.

New subject: [RFC 3/4] sched: fix computed capacity for HMP

On 15 April 2014 19:22, Peter Zijlstra peterz@infradead.org wrote:

...

On Fri, Mar 28, 2014 at 02:22:28PM +0100, Vincent Guittot wrote:

...
The current sg_capacity solves the ghost cores issue for SMT system and cluster made of big cores which have a cpu_power above SCHED_POWER_SCALE at core level. But it still removes some real cores of a cluster made of LITTLE cores which have a cpu_power below SCHED_POWER_SCALE.

Instead of using the power_orig to detect SMT system and compute a smt factor that will be used to calculate the real number of cores, we set a core_fct field when building the sched_domain topology. We can detect SMT system thanks to SD_SHARE_CPUPOWER flag and set core_fct to know how many CPUs per core we have. The core_fct will ensure that sg_capacity will return cores capacity of a SMT system and will not remove any real core of LITTLE cluster.

This method also fixes a use case where the capacity of a SMT system was overrated. Let take the example of a system made of 8 cores HT system: At CPU level, sg_capacity is cap to a maximum capacity of 8 whereas DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE) returns 9. ((589*16) / 1024) = 9.3 Now if 2 CPUs (1 core) are fully loaded by rt tasks, sg_capacity still returns a capacity of 8 whereas it should return a capacity of 7. This happen because DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE) is still above 7.5: ((589*14) / 1024) = 8.05

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 7 +++++++ kernel/sched/fair.c | 6 ++---- kernel/sched/sched.h | 2 +- 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9d9776..5b20b27 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5844,6 +5844,13 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
  WARN_ON(!sg);
if (!sd->child)
        sg->core_fct = 1;
else if (sd->child->flags & SD_SHARE_CPUPOWER)
        sg->core_fct = cpumask_weight(sched_group_cpus(sg));
else
        sg->core_fct = sd->child->groups->core_fct;
do {
        sg->group_weight = cpumask_weight(sched_group_cpus(sg));
        sg = sg->next;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ed42061..7387c05 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5773,12 +5773,10 @@ static inline int sg_capacity(struct lb_env *env, struct sched_group *group) power = group->sgp->power; power_orig = group->sgp->power_orig; cpus = group->group_weight;
smt = group->core_fct;
/* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */
smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);
capacity = cpus / smt; /* cores */
capacity = DIV_ROUND_CLOSEST(power * cpus, power_orig * smt);
capacity = min_t(unsigned, capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE));
if (!capacity)
        capacity = fix_small_capacity(env->sd, group);
So this patch only cures a little problem; the much bigger problem is that capacity as exists is completely wrong.

We really should be using utilization here. Until a CPU is fully

ok, I'm goiing to see how to replace capacity with utilization

Thanks

...

utilized we shouldn't be moving tasks around (unless packing, but where not there yet, and in that case you want to stop, where this starts, namely full utilization).

So while I appreciate what you're trying to 'fix' here, its really just trying to dress a monkey.

Nicolas Pitre

29 Mar 29 Mar

2:27 a.m.

On Fri, 28 Mar 2014, Vincent Guittot wrote:

...

Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm

SMT system is no more the only system that can have a CPUs with capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and groups of the sched_domains.

[...]

Wouldn't it make sense to have a patch in this series that performs some s/power/capacity/ throughout the scheduler code before the other patches are applied? That would help getting consistent terminology and avoid confusion before real power related metrics are eventually brought into the scheduler.

Nicolas

Vincent Guittot

31 Mar 31 Mar

7:31 a.m.

On 29 March 2014 03:27, Nicolas Pitre nicolas.pitre@linaro.org wrote:

...

On Fri, 28 Mar 2014, Vincent Guittot wrote:

...
Part of this patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology [2] -update and consolidation of cpu_power (this patchset) -tasks packing algorithm

SMT system is no more the only system that can have a CPUs with capacity that is different from the default value. We need to extend the use of cpu_power_orig to all kind of platform so the scheduler will have both the maximum capacity (cpu_power_orig/power_orig) and the current capacity (cpu_power/power) of CPUs and groups of the sched_domains.

[...]

Wouldn't it make sense to have a patch in this series that performs some s/power/capacity/ throughout the scheduler code before the other patches are applied? That would help getting consistent terminology and avoid confusion before real power related metrics are eventually brought into the scheduler.

Fair point. I will add the changes in the next version.

Thanks Vincent

...

Nicolas

4376

days inactive

4395

days old

linaro-kernel@lists.linaro.org

10 comments

participants

tags (0)

participants (4)

Nicolas Pitre
Peter Zijlstra
Preeti U Murthy
Vincent Guittot