HMP patches v2

List overview All Threads
Download

newer

older

booting an exynos

int64_t definition conflict on...

Liviu Dudau

16 Nov 2012 16 Nov '12

6:32 p.m.

Hi Viresh,

Here are the latest patches for HMP tunables to be included in the MP branch for the 12.11 release. They depend on Vincent Guittot's patches that you have on -exp-v12 branch which we need included in the PULL request. Testing shows that they improve power consumption for HMP.

Thanks, Liviu

Show replies by date

Liviu Dudau

16 Nov 16 Nov

6:32 p.m.

New subject: [HMP tunables v2][PATCH 1/7] Revert "sched: secure access to other CPU statistics"

This reverts commit 2aa14d0379cc54bc0ec44adb7a2e0ad02ae293d0.

The way this functionality is implemented is under review and the current implementation is considered not safe.

Signed-of-by: Liviu Dudau Liviu.Dudau@arm.com --- kernel/sched/fair.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 12f7d29..0ee9834 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3382,28 +3382,13 @@ static struct attribute_group hmp_attr_group = { static inline bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu); - volatile u32 *psum = &rq->avg.runnable_avg_sum; - volatile u32 *pperiod = &rq->avg.runnable_avg_period; - u32 sum, new_sum, period, new_period; - int timeout = 10; - - while (timeout) { - sum = *psum; - period = *pperiod; - new_sum = *psum; - new_period = *pperiod; - - if ((sum == new_sum) && (period == new_period)) - break; - - timeout--; - }

/* * A busy buddy is a CPU with a high load or a small load with a lot of * running tasks. */ - return ((new_sum << rq->nr_running) > new_period); + return ((rq->avg.usage_avg_sum << rq->nr_running) > + rq->avg.runnable_avg_period); }

static inline bool is_light_task(struct task_struct *p)

-- 1.7.9.5

Liviu Dudau

6:32 p.m.

New subject: [HMP tunables v2][PATCH 2/7] sched: SD_SHARE_POWERLINE buddy selection fix

From: Morten Rasmussen morten.rasmussen@arm.com

Fixes update_packing_domain() to behave better for topologies where SD_SHARE_POWERLINE is disabled at highest sched domain level.

Signed-of-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/fair.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0ee9834..d758086 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -184,7 +184,8 @@ void update_packing_domain(int cpu) if (!sd) sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); else - sd = sd->parent; + if (cpumask_first(sched_domain_span(sd)) == cpu || !sd->parent) + sd = sd->parent;

while (sd) { struct sched_group *sg = sd->groups; @@ -195,6 +196,18 @@ void update_packing_domain(int cpu) if (id == -1) id = cpumask_first(sched_domain_span(sd));

+ /* Find sched group of candidate */ + tmp = sd->groups; + do { + if (cpumask_test_cpu(id, sched_group_cpus(tmp))) { + sg = tmp; + break; + } + } while (tmp = tmp->next, tmp != sd->groups); + + pack = sg; + tmp = sg->next; + /* loop the sched groups to find the best one */ while (tmp != sg) { if (tmp->sgp->power * sg->group_weight <

-- 1.7.9.5

Vincent Guittot

19 Nov 19 Nov

9:03 a.m.

New subject: [HMP tunables v2][PATCH 2/7] sched: SD_SHARE_POWERLINE buddy selection fix

Hi,

I would prefer that you use the branch in the git tree below instead which is the final correction http://git.linaro.org/gitweb?p=people/vingu/kernel.git%3Ba=shortlog%3Bh=refs...

Regards Vincent

On 16 November 2012 19:32, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

From: Morten Rasmussen morten.rasmussen@arm.com

Fixes update_packing_domain() to behave better for topologies where SD_SHARE_POWERLINE is disabled at highest sched domain level.

Signed-of-by: Morten Rasmussen morten.rasmussen@arm.com

kernel/sched/fair.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0ee9834..d758086 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -184,7 +184,8 @@ void update_packing_domain(int cpu) if (!sd) sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); else
          sd = sd->parent;
          if (cpumask_first(sched_domain_span(sd)) == cpu || !sd->parent)
                  sd = sd->parent;

  while (sd) {
          struct sched_group *sg = sd->groups;
@@ -195,6 +196,18 @@ void update_packing_domain(int cpu) if (id == -1) id = cpumask_first(sched_domain_span(sd));
          /* Find sched group of candidate */
          tmp = sd->groups;
          do {
                  if (cpumask_test_cpu(id, sched_group_cpus(tmp))) {
                          sg = tmp;
                          break;
                  }
          } while (tmp = tmp->next, tmp != sd->groups);
          pack = sg;
          tmp = sg->next;
          /* loop the sched groups to find the best one */
          while (tmp != sg) {
                  if (tmp->sgp->power * sg->group_weight <
-- 1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Viresh Kumar

9:10 a.m.

New subject: [HMP tunables v2][PATCH 2/7] sched: SD_SHARE_POWERLINE buddy selection fix

On 19 November 2012 14:33, Vincent Guittot vincent.guittot@linaro.org wrote:

...

I would prefer that you use the branch in the git tree below instead which is the final correction http://git.linaro.org/gitweb?p=people/vingu/kernel.git%3Ba=shortlog%3Bh=refs...

Hi Vingu,

I have applied 3 patches on top of your branch in my current PULL request. Can you please check them and let me know which ones should i keep out of 5 patches (2 from you, 3 from ARM) ?

-- viresh

Vincent Guittot

9:34 a.m.

New subject: [HMP tunables v2][PATCH 2/7] sched: SD_SHARE_POWERLINE buddy selection fix

On 19 November 2012 10:10, Viresh Kumar viresh.kumar@linaro.org wrote:

...

On 19 November 2012 14:33, Vincent Guittot vincent.guittot@linaro.org wrote:

...
I would prefer that you use the branch in the git tree below instead which is the final correction http://git.linaro.org/gitweb?p=people/vingu/kernel.git%3Ba=shortlog%3Bh=refs...

Hi Vingu,

I have applied 3 patches on top of your branch in my current PULL request. Can you please check them and let me know which ones should i keep out of 5 patches (2 from you, 3 from ARM) ?

sched: pack small tasks: fix update packing domain sched-pack-small-task-v1-fixed - fix the buddy selection loop. it will be part of the next version sched: pack small tasks: fix printk formating - fix a display issue that has been reported by Tixy but it's not related to current discussion. it will be part of the next version ARM: TC2: Re-enable SD_SHARE_POWERLINE - depend of which configuration we want for TC2 sched: SD_SHARE_POWERLINE buddy selection fix - Remove it and use the fix above Revert "sched: secure access to other CPU statistics" - this patch removes a patch that doesn't fulfill its goal but doesn't introduce any regression so reverting it doesn't change anything in the behaviour. You should keep the original patch and remove the revert as long as there is no regression

Regards, Vincent

...

-- viresh

Liviu Dudau

16 Nov 16 Nov

6:32 p.m.

New subject: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

From: Morten Rasmussen morten.rasmussen@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2. --- arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 317dac6..4d34e0e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

int arch_sd_share_power_line(void) { - return 0*SD_SHARE_POWERLINE; + return 1*SD_SHARE_POWERLINE; }

const struct cpumask *cpu_coregroup_mask(int cpu)

-- 1.7.9.5

Vincent Guittot

19 Nov 19 Nov

9:20 a.m.

New subject: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

Hi,

On 16 November 2012 19:32, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

From: Morten Rasmussen morten.rasmussen@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.

arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 317dac6..4d34e0e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

int arch_sd_share_power_line(void) {
  return 0*SD_SHARE_POWERLINE;
  return 1*SD_SHARE_POWERLINE;

I'm not sure to catch your goal. With this modification, the power line (or power domain) is shared at all level which should disable the packing mechanism. But in a previous patch you fix the update packing loop so I assume that you want to use it. Which kind of configuration you would like to have among the proposal below ?

cpu : CPU0 | CPU1 | CPU2 | CPU3 | CPU4 buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2 buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2 buddy conf 3 : -1 | -1 | -1 | -1 | -1

When we look at the git://git.linaro.org/arm/big.LITTLE/mp.git big-LITTLE-MP-master-v12, we can see that you have defined a custom sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so the flag is cleared at CPU level. Based on this, I would say that you want buddy conf 2 ? but I would say that buddy conf 1 should give better result. Have you tried both ?

Regards, Vincent

...

}

const struct cpumask *cpu_coregroup_mask(int cpu)

1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Morten Rasmussen

12:08 p.m.

New subject: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

Hi Vincent,

On 19/11/12 09:20, Vincent Guittot wrote:

...

Hi,

On 16 November 2012 19:32, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
From: Morten Rasmussen morten.rasmussen@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.

arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 317dac6..4d34e0e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

int arch_sd_share_power_line(void) {
  return 0*SD_SHARE_POWERLINE;
  return 1*SD_SHARE_POWERLINE;
I'm not sure to catch your goal. With this modification, the power line (or power domain) is shared at all level which should disable the packing mechanism. But in a previous patch you fix the update packing loop so I assume that you want to use it. Which kind of configuration you would like to have among the proposal below ?

cpu : CPU0 | CPU1 | CPU2 | CPU3 | CPU4 buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2 buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2 buddy conf 3 : -1 | -1 | -1 | -1 | -1

When we look at the git://git.linaro.org/arm/big.LITTLE/mp.git big-LITTLE-MP-master-v12, we can see that you have defined a custom sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so the flag is cleared at CPU level. Based on this, I would say that you want buddy conf 2 ? but I would say that buddy conf 1 should give better result. Have you tried both ?

My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they really are on TC2. It could have been done more elegantly. Since the HMP patches overrides the sched_domain flags at CPU level the SD_SHARE_POWERLINE is not being set by arch_sd_share_power_line(). With this fix we will get SD_SHARE_POWERLINE at MC level and no SD_SHARE_POWERLINE at CPU level, which I believe is the correct set up for TC2.

For the buddy configuration the goal is to get configuration 1 in your list above. You should get that when using the other patch to fix the buddy selection algorithm. I'm not sure if conf 1 or 2 is best. I think it depends on the power/performance trade-off of the specific platform. conf 1 may lead to CPU1->CPU0->CPU2 migrations which may be undesirable. If your cpus are very leaky it might make sense to not do packing at all inside a high performance cluster and always do packing directly on a another low power cluster like conf 2. I think this needs further investigation.

I have only tested with conf 1 on TC2.

Regards, Morten

...

Regards, Vincent

...
}

const struct cpumask *cpu_coregroup_mask(int cpu)

1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Vincent Guittot

12:23 p.m.

New subject: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

On 19 November 2012 13:08, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

Hi Vincent,

On 19/11/12 09:20, Vincent Guittot wrote:

...
Hi,

On 16 November 2012 19:32, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
From: Morten Rasmussen morten.rasmussen@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.

arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 317dac6..4d34e0e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

int arch_sd_share_power_line(void) {
  return 0*SD_SHARE_POWERLINE;
  return 1*SD_SHARE_POWERLINE;
I'm not sure to catch your goal. With this modification, the power line (or power domain) is shared at all level which should disable the packing mechanism. But in a previous patch you fix the update packing loop so I assume that you want to use it. Which kind of configuration you would like to have among the proposal below ?

cpu : CPU0 | CPU1 | CPU2 | CPU3 | CPU4 buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2 buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2 buddy conf 3 : -1 | -1 | -1 | -1 | -1

When we look at the git://git.linaro.org/arm/big.LITTLE/mp.git big-LITTLE-MP-master-v12, we can see that you have defined a custom sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so the flag is cleared at CPU level. Based on this, I would say that you want buddy conf 2 ? but I would say that buddy conf 1 should give better result. Have you tried both ?
My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they really are on TC2. It could have been done more elegantly. Since the HMP patches overrides the sched_domain flags at CPU level the SD_SHARE_POWERLINE is not being set by arch_sd_share_power_line(). With this fix we will get SD_SHARE_POWERLINE at MC level and no SD_SHARE_POWERLINE at CPU level, which I believe is the correct set up for TC2.

For the buddy configuration the goal is to get configuration 1 in your list above. You should get that when using the other patch to fix the buddy selection algorithm. I'm not sure if conf 1 or 2 is best. I think it depends on the power/performance trade-off of the specific platform. conf 1 may lead to CPU1->CPU0->CPU2 migrations which may be undesirable. If your cpus are very leaky it might make sense to not do packing at all inside a high performance cluster and always do packing directly on a another low power cluster like conf 2. I think this needs further investigation.

I have only tested with conf 1 on TC2.

Hi Morten,

Conf1 is the default configuration for ARM platform because SD_SHARE_POWERLINE is cleared at all levels for this architecture.

Conf2 should be used if you can't powergate the core independently but several tests have demonstrated that even if you can't powergate each core independently, it worth packing small task on few CPUs in a core so it's worth using conf1 on TC2 as well.

Based on your explanation, we should use the original configuration of SD_SHARE_POWERLINE (cleared at all level for ARM platform)

Regards Vincent

...

Regards, Morten

...
Regards, Vincent

...
}

const struct cpumask *cpu_coregroup_mask(int cpu)

1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Morten Rasmussen

1:36 p.m.

New subject: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

On 19/11/12 12:23, Vincent Guittot wrote:

...

On 19 November 2012 13:08, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
Hi Vincent,

On 19/11/12 09:20, Vincent Guittot wrote:

...
Hi,

On 16 November 2012 19:32, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
From: Morten Rasmussen morten.rasmussen@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.

arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 317dac6..4d34e0e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

int arch_sd_share_power_line(void) {
  return 0*SD_SHARE_POWERLINE;
  return 1*SD_SHARE_POWERLINE;
I'm not sure to catch your goal. With this modification, the power line (or power domain) is shared at all level which should disable the packing mechanism. But in a previous patch you fix the update packing loop so I assume that you want to use it. Which kind of configuration you would like to have among the proposal below ?

cpu : CPU0 | CPU1 | CPU2 | CPU3 | CPU4 buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2 buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2 buddy conf 3 : -1 | -1 | -1 | -1 | -1

When we look at the git://git.linaro.org/arm/big.LITTLE/mp.git big-LITTLE-MP-master-v12, we can see that you have defined a custom sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so the flag is cleared at CPU level. Based on this, I would say that you want buddy conf 2 ? but I would say that buddy conf 1 should give better result. Have you tried both ?
My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they really are on TC2. It could have been done more elegantly. Since the HMP patches overrides the sched_domain flags at CPU level the SD_SHARE_POWERLINE is not being set by arch_sd_share_power_line(). With this fix we will get SD_SHARE_POWERLINE at MC level and no SD_SHARE_POWERLINE at CPU level, which I believe is the correct set up for TC2.

For the buddy configuration the goal is to get configuration 1 in your list above. You should get that when using the other patch to fix the buddy selection algorithm. I'm not sure if conf 1 or 2 is best. I think it depends on the power/performance trade-off of the specific platform. conf 1 may lead to CPU1->CPU0->CPU2 migrations which may be undesirable. If your cpus are very leaky it might make sense to not do packing at all inside a high performance cluster and always do packing directly on a another low power cluster like conf 2. I think this needs further investigation.

I have only tested with conf 1 on TC2.
Hi Morten,

Conf1 is the default configuration for ARM platform because SD_SHARE_POWERLINE is cleared at all levels for this architecture.

Conf2 should be used if you can't powergate the core independently but several tests have demonstrated that even if you can't powergate each core independently, it worth packing small task on few CPUs in a core so it's worth using conf1 on TC2 as well.

Based on your explanation, we should use the original configuration of SD_SHARE_POWERLINE (cleared at all level for ARM platform)

I agree that the result is the same, but I don't like disabling SD_SHARE_POWERLINE for all level when the cpus in each cluster actually are in the same power domain as it is the case on TC2. The name SHARE_POWERLINE implies a clear relation to the actual hardware design, thus setting the flags differently than the actual hardware design is misleading in my opinion. If the buddy selection algorithm doesn't select appropriate buddies when flags are set to reflect the actual hardware design I would suggest changing the buddy selection algorithm instead of changing the sched_domain flags.

If it is chosen to not have a direct relation between the flags and the hardware design, I think that the flag should be renamed so it doesn't give the wrong impression.

Morten

...

Regards Vincent

...
Regards, Morten

...
Regards, Vincent

...
}

const struct cpumask *cpu_coregroup_mask(int cpu)

1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Vincent Guittot

2:09 p.m.

New subject: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

On 19 November 2012 14:36, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On 19/11/12 12:23, Vincent Guittot wrote:

...
On 19 November 2012 13:08, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
Hi Vincent,

On 19/11/12 09:20, Vincent Guittot wrote:

...
Hi,

On 16 November 2012 19:32, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
From: Morten Rasmussen morten.rasmussen@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.

arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 317dac6..4d34e0e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];

int arch_sd_share_power_line(void) {
  return 0*SD_SHARE_POWERLINE;
  return 1*SD_SHARE_POWERLINE;
I'm not sure to catch your goal. With this modification, the power line (or power domain) is shared at all level which should disable the packing mechanism. But in a previous patch you fix the update packing loop so I assume that you want to use it. Which kind of configuration you would like to have among the proposal below ?

cpu : CPU0 | CPU1 | CPU2 | CPU3 | CPU4 buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2 buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2 buddy conf 3 : -1 | -1 | -1 | -1 | -1

When we look at the git://git.linaro.org/arm/big.LITTLE/mp.git big-LITTLE-MP-master-v12, we can see that you have defined a custom sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so the flag is cleared at CPU level. Based on this, I would say that you want buddy conf 2 ? but I would say that buddy conf 1 should give better result. Have you tried both ?
My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they really are on TC2. It could have been done more elegantly. Since the HMP patches overrides the sched_domain flags at CPU level the SD_SHARE_POWERLINE is not being set by arch_sd_share_power_line(). With this fix we will get SD_SHARE_POWERLINE at MC level and no SD_SHARE_POWERLINE at CPU level, which I believe is the correct set up for TC2.

For the buddy configuration the goal is to get configuration 1 in your list above. You should get that when using the other patch to fix the buddy selection algorithm. I'm not sure if conf 1 or 2 is best. I think it depends on the power/performance trade-off of the specific platform. conf 1 may lead to CPU1->CPU0->CPU2 migrations which may be undesirable. If your cpus are very leaky it might make sense to not do packing at all inside a high performance cluster and always do packing directly on a another low power cluster like conf 2. I think this needs further investigation.

I have only tested with conf 1 on TC2.
Hi Morten,

Conf1 is the default configuration for ARM platform because SD_SHARE_POWERLINE is cleared at all levels for this architecture.

Conf2 should be used if you can't powergate the core independently but several tests have demonstrated that even if you can't powergate each core independently, it worth packing small task on few CPUs in a core so it's worth using conf1 on TC2 as well.

Based on your explanation, we should use the original configuration of SD_SHARE_POWERLINE (cleared at all level for ARM platform)
I agree that the result is the same, but I don't like disabling SD_SHARE_POWERLINE for all level when the cpus in each cluster actually are in the same power domain as it is the case on TC2. The name SHARE_POWERLINE implies a clear relation to the actual hardware design, thus setting the flags differently than the actual hardware design is misleading in my opinion. If the buddy selection algorithm doesn't select appropriate buddies when flags are set to reflect the actual hardware design I would suggest changing the buddy selection algorithm instead of changing the sched_domain flags.

If it is chosen to not have a direct relation between the flags and the hardware design, I think that the flag should be renamed so it doesn't give the wrong impression.

There is a direct link between the powergating and the SHARE_POWERLINE and if you want that the buddy selection strictly reflects your HW configuration, you must use conf2 and not conf1. Now, beside the packing small task patch and the TC2 configuration, it has been proven that packing small tasks on an ARM platform (dual cortex-A9) which can only powergate the cluster, improves the power consumption of some low cpu load use cases like the MP3 playback (we had used cpu hotplug at that time). This assumption has been proven only for ARM platform and that's why the SHARE_POWERLINE is cleared at all level for ARM platform even if you can only do some WFI inside the cluster.

But i agree that if you want to strictly reflect the TC2 HW config, you should use the conf2 and set the SHARE_POWERLINE at MC level for the TC2 HMP configuration

Regards, Vincent

...

Morten

...
Regards Vincent

...
Regards, Morten

...
Regards, Vincent

...
}

const struct cpumask *cpu_coregroup_mask(int cpu)

1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Morten Rasmussen

2:27 p.m.

New subject: [HMP tunables v2][PATCH 3/7] ARM: TC2: Re-enable SD_SHARE_POWERLINE

On 19/11/12 14:09, Vincent Guittot wrote:

...

On 19 November 2012 14:36, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
On 19/11/12 12:23, Vincent Guittot wrote:

...
On 19 November 2012 13:08, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
Hi Vincent,

On 19/11/12 09:20, Vincent Guittot wrote:

...
Hi,

On 16 November 2012 19:32, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
From: Morten Rasmussen morten.rasmussen@arm.com

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.
arch/arm/kernel/topology.c |    2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 317dac6..4d34e0e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -228,7 +228,7 @@ struct cputopo_arm cpu_topology[NR_CPUS];
int arch_sd_share_power_line(void)
{
  return 0*SD_SHARE_POWERLINE;
  return 1*SD_SHARE_POWERLINE;
I'm not sure to catch your goal. With this modification, the power line (or power domain) is shared at all level which should disable the packing mechanism. But in a previous patch you fix the update packing loop so I assume that you want to use it. Which kind of configuration you would like to have among the proposal below ?

cpu : CPU0 | CPU1 | CPU2 | CPU3 | CPU4 buddy conf 1 : CPU2 | CPU0 | CPU2 | CPU2 | CPU2 buddy conf 2 : CPU2 | CPU2 | CPU2 | CPU2 | CPU2 buddy conf 3 : -1 | -1 | -1 | -1 | -1

When we look at the git://git.linaro.org/arm/big.LITTLE/mp.git big-LITTLE-MP-master-v12, we can see that you have defined a custom sched_domain which hasn't been updated with SD_SHARE_POWERLINE flag so the flag is cleared at CPU level. Based on this, I would say that you want buddy conf 2 ? but I would say that buddy conf 1 should give better result. Have you tried both ?
My goal with this fix is to set up the SD_SHARE_POWERLINE flags as they really are on TC2. It could have been done more elegantly. Since the HMP patches overrides the sched_domain flags at CPU level the SD_SHARE_POWERLINE is not being set by arch_sd_share_power_line(). With this fix we will get SD_SHARE_POWERLINE at MC level and no SD_SHARE_POWERLINE at CPU level, which I believe is the correct set up for TC2.

For the buddy configuration the goal is to get configuration 1 in your list above. You should get that when using the other patch to fix the buddy selection algorithm. I'm not sure if conf 1 or 2 is best. I think it depends on the power/performance trade-off of the specific platform. conf 1 may lead to CPU1->CPU0->CPU2 migrations which may be undesirable. If your cpus are very leaky it might make sense to not do packing at all inside a high performance cluster and always do packing directly on a another low power cluster like conf 2. I think this needs further investigation.

I have only tested with conf 1 on TC2.
Hi Morten,

Conf1 is the default configuration for ARM platform because SD_SHARE_POWERLINE is cleared at all levels for this architecture.

Conf2 should be used if you can't powergate the core independently but several tests have demonstrated that even if you can't powergate each core independently, it worth packing small task on few CPUs in a core so it's worth using conf1 on TC2 as well.

Based on your explanation, we should use the original configuration of SD_SHARE_POWERLINE (cleared at all level for ARM platform)
I agree that the result is the same, but I don't like disabling SD_SHARE_POWERLINE for all level when the cpus in each cluster actually are in the same power domain as it is the case on TC2. The name SHARE_POWERLINE implies a clear relation to the actual hardware design, thus setting the flags differently than the actual hardware design is misleading in my opinion. If the buddy selection algorithm doesn't select appropriate buddies when flags are set to reflect the actual hardware design I would suggest changing the buddy selection algorithm instead of changing the sched_domain flags.

If it is chosen to not have a direct relation between the flags and the hardware design, I think that the flag should be renamed so it doesn't give the wrong impression.
There is a direct link between the powergating and the SHARE_POWERLINE and if you want that the buddy selection strictly reflects your HW configuration, you must use conf2 and not conf1.

I just want the buddy selection to be reasonable when the SHARE_POWERLINE flags are reflecting the true hardware configuration. I haven't tested whether conf 1 or 2 is best yet. As long as I am getting one them it is definitely an improvement over not having task packing at all :)

...

Now, beside the packing small task patch and the TC2 configuration, it has been proven that packing small tasks on an ARM platform (dual cortex-A9) which can only powergate the cluster, improves the power consumption of some low cpu load use cases like the MP3 playback (we had used cpu hotplug at that time). This assumption has been proven only for ARM platform and that's why the SHARE_POWERLINE is cleared at all level for ARM platform even if you can only do some WFI inside the cluster.

Yes, but other cpus may be significantly more expensive to have in WFI so the power difference between WFI and running some lightweight task might be small. In that case it could make sense to spread tasks instead and hope for more opportunities for cluster shutdown. I haven't got any proof of this being correct yet, but I think it is worth considering.

Morten

...

But i agree that if you want to strictly reflect the TC2 HW config, you should use the conf2 and set the SHARE_POWERLINE at MC level for the TC2 HMP configuration

Regards, Vincent

...
Morten

...
Regards Vincent

...
Regards, Morten

...
Regards, Vincent

...
}

const struct cpumask *cpu_coregroup_mask(int cpu)
-- 1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Liviu Dudau

16 Nov 16 Nov

6:32 p.m.

New subject: [HMP tunables v2][PATCH 4/7] Revert "HMP: Sysfs support for switching hmp on/off at runtime"

This reverts commit ad363efa74968250033ce2d9de00488d0cba9134.

Conflicts:

kernel/sched/core.c

--- arch/arm/Kconfig | 9 ----- arch/arm/include/asm/topology.h | 4 +- kernel/sched/core.c | 4 -- kernel/sched/fair.c | 81 --------------------------------------- 4 files changed, 2 insertions(+), 96 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index c957bb1..f15c657 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1603,15 +1603,6 @@ config HMP_SLOW_CPU_MASK Specify the cpuids of the slow CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.

-config SCHED_HMP_SYSFS - bool "HMP Scheduling tuning options" - depends on SCHED_HMP - help - Export sysfs tunables for experimental scheduler optimizations for - SCHED HMP. Allows modification migration thresholds and minimum - priorities. These tuning parameters are intended for engineering - tuning and are not enabled by default. - config HAVE_ARM_SCU bool help diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index b76a898..983fa7c 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -42,8 +42,8 @@ int cluster_to_logical_mask(unsigned int socket_id, cpumask_t *cluster_mask); .newidle_idx = 0, \ .wake_idx = 0, \ .forkexec_idx = 0, \ - .flags = (!hmp_enbld)*SD_LOAD_BALANCE \ - | (!hmp_enbld)*SD_PREFER_SIBLING \ + \ + .flags = 0*SD_LOAD_BALANCE \ | 1*SD_BALANCE_NEWIDLE \ | 1*SD_BALANCE_EXEC \ | 1*SD_BALANCE_FORK \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9bada60..e94e274 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5839,10 +5839,6 @@ int __weak arch_sd_sibling_asym_packing(void) return 0*SD_ASYM_PACKING; }

-#ifdef CONFIG_SCHED_HMP -extern int hmp_enbld; -#endif - int __weak arch_sd_share_power_line(void) { return 1*SD_SHARE_POWERLINE; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d758086..e24f2e3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -31,11 +31,6 @@

#include "sched.h"

-#ifdef CONFIG_SCHED_HMP_SYSFS -#include <linux/cpuset.h> -#include <linux/sysfs.h> -#endif - /* * Targeted preemption latency for CPU-bound tasks: * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds) @@ -3330,66 +3325,6 @@ static inline void hmp_next_down_delay(struct sched_entity *se, int cpu) se->avg.hmp_last_down_migration = cfs_rq_clock_task(cfs_rq); se->avg.hmp_last_up_migration = 0; } - -/* Tracks Task migration enabled or not */ -int hmp_enbld = 1; - -#ifdef CONFIG_SCHED_HMP_SYSFS -struct hmp_attr { - struct attribute attr; - ssize_t (*show)(struct kobject *kobj, - struct attribute *attr, char *buf); - ssize_t (*store)(struct kobject *a, struct attribute *b, - const char *c, size_t count); -}; - -#define nops() ; - -#define SCHED_HMP_SYSFS_INFO(_name, _lower_limit, _upper_limit, _fn) \ -static ssize_t sched_##_name##_show(struct kobject *kobj, \ - struct attribute *attr, char *buf) \ -{ \ - return sprintf(buf, "%u\n", _name); \ -} \ - \ -static ssize_t sched_##_name##_store(struct kobject *a, \ - struct attribute *b, const char *buf, size_t count) \ -{ \ - unsigned int level = 0; \ - if (sscanf(buf, "%u", &level) != 1) \ - return -EINVAL; \ - if (level < _lower_limit || level > _upper_limit) \ - return -EINVAL; \ - _name = level; \ - _fn(); \ - return count; \ -} \ - \ -static struct hmp_attr attr_##_name = \ -__ATTR(_name, 0644, sched_##_name##_show, sched_##_name##_store) - -SCHED_HMP_SYSFS_INFO(hmp_enbld, 0, 1, rebuild_sched_domains); -SCHED_HMP_SYSFS_INFO(hmp_up_threshold, 0, 1023, nops); -SCHED_HMP_SYSFS_INFO(hmp_down_threshold, 0, 1023, nops); -#ifdef CONFIG_SCHED_HMP_PRIO_FILTER -SCHED_HMP_SYSFS_INFO(hmp_up_prio, 100, 140, nops); -#endif - -static struct attribute *hmp_attributes[] = { - &attr_hmp_enbld.attr, - &attr_hmp_up_threshold.attr, - &attr_hmp_down_threshold.attr, -#ifdef CONFIG_SCHED_HMP_PRIO_FILTER - &attr_hmp_up_prio.attr, -#endif - NULL -}; - -static struct attribute_group hmp_attr_group = { - .attrs = hmp_attributes, - .name = "hmp_tuning", -}; -#endif /* CONFIG_SCHED_HMP_SYSFS */ #endif /* CONFIG_SCHED_HMP */

static inline bool is_buddy_busy(int cpu) @@ -3543,9 +3478,6 @@ unlock: rcu_read_unlock();

#ifdef CONFIG_SCHED_HMP - if (!hmp_enbld) - goto hmp_end; - if (hmp_up_migration(prev_cpu, &p->se)) { new_cpu = hmp_select_faster_cpu(p, prev_cpu); hmp_next_up_delay(&p->se, new_cpu); @@ -3561,7 +3493,6 @@ unlock: /* Make sure that the task stays in its previous hmp domain */ if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus)) return prev_cpu; -hmp_end: #endif

return new_cpu; @@ -6170,9 +6101,6 @@ static void hmp_force_up_migration(int this_cpu) unsigned int force; struct task_struct *p;

- if (!hmp_enbld) - return; - if (!spin_trylock(&hmp_force_migration)) return; for_each_online_cpu(cpu) { @@ -6725,12 +6653,3 @@ __init void init_sched_fair_class(void) #endif /* SMP */

} - -#ifdef CONFIG_SCHED_HMP_SYSFS -static int __init hmp_sched(void) -{ - BUG_ON(sysfs_create_group(&cpu_subsys.dev_root->kobj, &hmp_attr_group)); - return 0; -} -late_initcall(hmp_sched); -#endif

-- 1.7.9.5

Liviu Dudau

6:32 p.m.

New subject: [HMP tunables v2][PATCH 5/7] ARM: Change load tracking scale using sysfs

From: Olivier Cozette olivier.cozette@arm.com

These functions allow to change the load average period used in the task load average computation through /sys/kernel/hmp/load_avg_period_ms. This period is the time in ms to go from 0 to 0.5 load average while running or the time from 1 to 0.5 while sleeping.

The default one used is 32 and gives the same load_avg_ratio computation than without this patch. These functions also allow to change the up and down threshold of HMP using /sys/kernel/hmp/{up,down}_threshold. Both must be between 0 and 1024. The thresholds are divided by 1024 before being compared to the load_avg_ratio.

If /sys/kernel/hmp/load_avg_period_ms is 128 and /sys/kernel/hmp/up_threshold is 512, a task will be migrated to a bigger cluster after running for 128ms. Because after load_avg_period_ms the load average is 0.5 and real up_threshold us 512 / 1024 = 0.5.

Signed-off-by: Olivier Cozette olivier.cozette@arm.com Signed-off-by: Chris Redpath chris.redpath@arm.com --- arch/arm/Kconfig | 23 +++++++ kernel/sched/fair.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 204 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f15c657..92ef121 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1603,6 +1603,29 @@ config HMP_SLOW_CPU_MASK Specify the cpuids of the slow CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.

+config HMP_VARIABLE_SCALE + bool "Allows changing the load tracking scale through sysfs" + depends on SCHED_HMP + help + When turned on, this option exports the thresholds and load average + period value for the load tracking patches through sysfs. + The values can be modified to change the rate of load accumulation + and the thresholds used for HMP migration. + The load_avg_period_ms is the time in ms to reach a load average of + 0.5 for an idle task of 0 load average ratio that start a busy loop. + The up_threshold and down_threshold is the value to go to a faster + CPU or to go back to a slower cpu. + The {up,down}_threshold are devided by 1024 before being compared + to the load average. + For examples, with load_avg_period_ms = 128 and up_threshold = 512, + a running task with a load of 0 will be migrated to a bigger CPU after + 128ms, because after 128ms its load_avg_ratio is 0.5 and the real + up_threshold is 0.5. + This patch has the same behavior as changing the Y of the load + average computation to + (1002/1024)^(LOAD_AVG_PERIOD/load_avg_period_ms) + but it remove intermadiate overflows in computation. + config HAVE_ARM_SCU bool help diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e24f2e3..0108da6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -28,6 +28,10 @@ #include <linux/interrupt.h>

#include <trace/events/sched.h> +#ifdef CONFIG_HMP_VARIABLE_SCALE +#include <linux/sysfs.h> +#include <linux/vmalloc.h> +#endif

#include "sched.h"

@@ -1043,8 +1047,10 @@ static u32 __compute_runnable_contrib(u64 n) return contrib + runnable_avg_yN_sum[n]; }

-/* - * We can represent the historical contribution to runnable average as the +#ifdef CONFIG_HMP_VARIABLE_SCALE +static u64 hmp_variable_scale_convert(u64 delta); +#endif +/* We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable * history into segments of approximately 1ms (1024us); label the segment that * occurred N-ms ago p_N, with p_0 corresponding to the current period, e.g. @@ -1081,6 +1087,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int delta_w, decayed = 0;

delta = now - sa->last_runnable_update; +#ifdef CONFIG_HMP_VARIABLE_SCALE + delta = hmp_variable_scale_convert(delta); +#endif /* * This should only happen when time goes backwards, which it * unfortunately does during sched clock init when we swap over to TSC. @@ -3325,6 +3334,176 @@ static inline void hmp_next_down_delay(struct sched_entity *se, int cpu) se->avg.hmp_last_down_migration = cfs_rq_clock_task(cfs_rq); se->avg.hmp_last_up_migration = 0; } + +#ifdef CONFIG_HMP_VARIABLE_SCALE +/* + * Heterogenous multiprocessor (HMP) optimizations + * + * These functions allow to change the growing speed of the load_avg_ratio + * by default it goes from 0 to 0.5 in LOAD_AVG_PERIOD = 32ms + * This can now be changed with /sys/kernel/hmp/load_avg_period_ms. + * + * These functions also allow to change the up and down threshold of HMP + * using /sys/kernel/hmp/{up,down}_threshold. + * Both must be between 0 and 1023. The threshold that is compared + * to the load_avg_ratio is up_threshold/1024 and down_threshold/1024. + * + * For instance, if load_avg_period = 64 and up_threshold = 512, an idle + * task with a load of 0 will reach the threshold after 64ms of busy loop. + * + * Changing load_avg_periods_ms has the same effect than changing the + * default scaling factor Y=1002/1024 in the load_avg_ratio computation to + * (1002/1024.0)^(LOAD_AVG_PERIOD/load_avg_period_ms), but the last one + * could trigger overflows. + * For instance, with Y = 1023/1024 in __update_task_entity_contrib() + * "contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);" + * could be overflowed for a weight > 2^12 even is the load_avg_contrib + * should still be a 32bits result. This would not happen by multiplicating + * delta time by 1/22 and setting load_avg_period_ms = 706. + */ + +#define HMP_VARIABLE_SCALE_SHIFT 16ULL +struct hmp_global_attr { + struct attribute attr; + ssize_t (*show)(struct kobject *kobj, + struct attribute *attr, char *buf); + ssize_t (*store)(struct kobject *a, struct attribute *b, + const char *c, size_t count); + int *value; + int (*to_sysfs)(int); + int (*from_sysfs)(int); +}; + +#define HMP_DATA_SYSFS_MAX 3 + +struct hmp_data_struct { + int multiplier; /* used to scale the time delta */ + struct attribute_group attr_group; + struct attribute *attributes[HMP_DATA_SYSFS_MAX + 1]; + struct hmp_global_attr attr[HMP_DATA_SYSFS_MAX]; +} hmp_data; + +/* + * By scaling the delta time it end-up increasing or decrease the + * growing speed of the per entity load_avg_ratio + * The scale factor hmp_data.multiplier is a fixed point + * number: (32-HMP_VARIABLE_SCALE_SHIFT).HMP_VARIABLE_SCALE_SHIFT + */ +static u64 hmp_variable_scale_convert(u64 delta) +{ + u64 high = delta >> 32ULL; + u64 low = delta & 0xffffffffULL; + low *= hmp_data.multiplier; + high *= hmp_data.multiplier; + return (low >> HMP_VARIABLE_SCALE_SHIFT) + + (high << (32ULL - HMP_VARIABLE_SCALE_SHIFT)); +} + +static ssize_t hmp_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + ssize_t ret = 0; + struct hmp_global_attr *hmp_attr = + container_of(attr, struct hmp_global_attr, attr); + int temp = *(hmp_attr->value); + if (hmp_attr->to_sysfs != NULL) + temp = hmp_attr->to_sysfs(temp); + ret = sprintf(buf, "%d\n", temp); + return ret; +} + +static ssize_t hmp_store(struct kobject *a, struct attribute *attr, + const char *buf, size_t count) +{ + int temp; + ssize_t ret = count; + struct hmp_global_attr *hmp_attr = + container_of(attr, struct hmp_global_attr, attr); + char *str = vmalloc(count + 1); + if (str == NULL) + return -ENOMEM; + memcpy(str, buf, count); + str[count] = 0; + if (sscanf(str, "%d", &temp) < 1) + ret = -EINVAL; + else { + if (hmp_attr->from_sysfs != NULL) + temp = hmp_attr->from_sysfs(temp); + if (temp < 0) + ret = -EINVAL; + else + *(hmp_attr->value) = temp; + } + vfree(str); + return ret; +} + +static int hmp_period_tofrom_sysfs(int value) +{ + return (LOAD_AVG_PERIOD << HMP_VARIABLE_SCALE_SHIFT) / value; +} + +/* max value for threshold is 1024 */ +static int hmp_theshold_from_sysfs(int value) +{ + if (value > 1024) + return -1; + return value; +} + +static void hmp_attr_add( + const char *name, + int *value, + int (*to_sysfs)(int), + int (*from_sysfs)(int)) +{ + int i = 0; + while (hmp_data.attributes[i] != NULL) { + i++; + if (i >= HMP_DATA_SYSFS_MAX) + return; + } + hmp_data.attr[i].attr.mode = 0644; + hmp_data.attr[i].show = hmp_show; + hmp_data.attr[i].store = hmp_store; + hmp_data.attr[i].attr.name = name; + hmp_data.attr[i].value = value; + hmp_data.attr[i].to_sysfs = to_sysfs; + hmp_data.attr[i].from_sysfs = from_sysfs; + hmp_data.attributes[i] = &hmp_data.attr[i].attr; + hmp_data.attributes[i + 1] = NULL; +} + +static int hmp_attr_init(void) +{ + int ret; + memset(&hmp_data, sizeof(hmp_data), 0); + /* by default load_avg_period_ms == LOAD_AVG_PERIOD + * meaning no change + */ + hmp_data.multiplier = hmp_period_tofrom_sysfs(LOAD_AVG_PERIOD); + + hmp_attr_add("load_avg_period_ms", + &hmp_data.multiplier, + hmp_period_tofrom_sysfs, + hmp_period_tofrom_sysfs); + hmp_attr_add("up_threshold", + &hmp_up_threshold, + NULL, + hmp_theshold_from_sysfs); + hmp_attr_add("down_threshold", + &hmp_down_threshold, + NULL, + hmp_theshold_from_sysfs); + + hmp_data.attr_group.name = "hmp"; + hmp_data.attr_group.attrs = hmp_data.attributes; + ret = sysfs_create_group(kernel_kobj, + &hmp_data.attr_group); + return 0; +} +late_initcall(hmp_attr_init); +#endif /* CONFIG_HMP_VARIABLE_SCALE */ #endif /* CONFIG_SCHED_HMP */

static inline bool is_buddy_busy(int cpu)

-- 1.7.9.5

Liviu Dudau

6:32 p.m.

New subject: [HMP tunables v2][PATCH 6/7] ARM: Experimental Frequency-Invariant Load Scaling Patch

From: Chris Redpath chris.redpath@arm.com

Evaluation Patch to investigate using load as a representation of the amount of POTENTIAL cpu compute capacity used rather than a representation of the CURRENT cpu compute capacity.

If CPUFreq is enabled, scales load in accordance with frequency.

Powersave/performance CPUFreq governors are detected and scaling is disabled while these governors are in use. This is because when a single-frequency governor is in use, potential CPU capacity is static.

So long as the governors and CPUFreq subsystem correctly report the frequencies available, the scaling should self tune.

Adds an additional file to sysfs to allow this feature to be disabled for experimentation.

/sys/kernel/hmp/frequency_invariant_load_scale

write 0 to disable, 1 to enable.

Signed-off-by: Chris Redpath chris.redpath@arm.com --- arch/arm/Kconfig | 15 +++ kernel/sched/fair.c | 320 ++++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 305 insertions(+), 30 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 92ef121..22c3adc 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1626,6 +1626,21 @@ config HMP_VARIABLE_SCALE (1002/1024)^(LOAD_AVG_PERIOD/load_avg_period_ms) but it remove intermadiate overflows in computation.

+config HMP_FREQUENCY_INVARIANT_SCALE + bool "(EXPERIMENTAL) Frequency-Invariant Tracked Load for HMP" + depends on HMP_VARIABLE_SCALE && CPU_FREQ + help + Scales the current load contribution in line with the frequency + of the CPU that the task was executed on. + In this version, we use a simple linear scale derived from the + maximum frequency reported by CPUFreq. + Restricting tracked load to be scaled by the CPU's frequency + represents the consumption of possible compute capacity + (rather than consumption of actual instantaneous capacity as + normal) and allows the HMP migration's simple threshold + migration strategy to interact more predictably with CPUFreq's + asynchronous compute capacity changes. + config HAVE_ARM_SCU bool help diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0108da6..18969b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -31,10 +31,17 @@ #ifdef CONFIG_HMP_VARIABLE_SCALE #include <linux/sysfs.h> #include <linux/vmalloc.h> -#endif +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +/* Include cpufreq header to add a notifier so that cpu frequency + * scaling can track the current CPU frequency + */ +#include <linux/cpufreq.h> +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ +#endif /* CONFIG_HMP_VARIABLE_SCALE */

#include "sched.h"

+ /* * Targeted preemption latency for CPU-bound tasks: * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds) @@ -1048,8 +1055,93 @@ static u32 __compute_runnable_contrib(u64 n) }

#ifdef CONFIG_HMP_VARIABLE_SCALE -static u64 hmp_variable_scale_convert(u64 delta); + +#define HMP_VARIABLE_SCALE_SHIFT 16ULL +struct hmp_global_attr { + struct attribute attr; + ssize_t (*show)(struct kobject *kobj, + struct attribute *attr, char *buf); + ssize_t (*store)(struct kobject *a, struct attribute *b, + const char *c, size_t count); + int *value; + int (*to_sysfs)(int); + int (*from_sysfs)(int); +}; + +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +#define HMP_DATA_SYSFS_MAX 4 +#else +#define HMP_DATA_SYSFS_MAX 3 +#endif + +struct hmp_data_struct { +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + int freqinvar_load_scale_enabled; #endif + int multiplier; /* used to scale the time delta */ + struct attribute_group attr_group; + struct attribute *attributes[HMP_DATA_SYSFS_MAX + 1]; + struct hmp_global_attr attr[HMP_DATA_SYSFS_MAX]; +} hmp_data; + +static u64 hmp_variable_scale_convert(u64 delta); +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +/* Frequency-Invariant Load Modification: + * Loads are calculated as in PJT's patch however we also scale the current + * contribution in line with the frequency of the CPU that the task was + * executed on. + * In this version, we use a simple linear scale derived from the maximum + * frequency reported by CPUFreq. As an example: + * + * Consider that we ran a task for 100% of the previous interval. + * + * Our CPU was under asynchronous frequency control through one of the + * CPUFreq governors. + * + * The CPUFreq governor reports that it is able to scale the CPU between + * 500MHz and 1GHz. + * + * During the period, the CPU was running at 1GHz. + * + * In this case, our load contribution for that period is calculated as + * 1 * (number_of_active_microseconds) + * + * This results in our task being able to accumulate maximum load as normal. + * + * + * Consider now that our CPU was executing at 500MHz. + * + * We now scale the load contribution such that it is calculated as + * 0.5 * (number_of_active_microseconds) + * + * Our task can only record 50% maximum load during this period. + * + * This represents the task consuming 50% of the CPU's *possible* compute + * capacity. However the task did consume 100% of the CPU's *available* + * compute capacity which is the value seen by the CPUFreq governor and + * user-side CPU Utilization tools. + * + * Restricting tracked load to be scaled by the CPU's frequency accurately + * represents the consumption of possible compute capacity and allows the + * HMP migration's simple threshold migration strategy to interact more + * predictably with CPUFreq's asynchronous compute capacity changes. + */ +#define SCHED_FREQSCALE_SHIFT 10 +struct cpufreq_extents { + u32 curr_scale; + u32 min; + u32 max; + u32 flags; +}; +/* Flag set when the governor in use only allows one frequency. + * Disables scaling. + */ +#define SCHED_LOAD_FREQINVAR_SINGLEFREQ 0x01 + +static struct cpufreq_extents freq_scale[CONFIG_NR_CPUS]; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ +#endif /* CONFIG_HMP_VARIABLE_SCALE */ + /* We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable * history into segments of approximately 1ms (1024us); label the segment that @@ -1080,11 +1172,18 @@ static u64 hmp_variable_scale_convert(u64 delta); static __always_inline int __update_entity_runnable_avg(u64 now, struct sched_avg *sa, int runnable, - int running) + int running, + int cpu) { u64 delta, periods; u32 runnable_contrib; int delta_w, decayed = 0; +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + u64 scaled_delta; + u32 scaled_runnable_contrib; + int scaled_delta_w; + u32 curr_scale = 1024; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */

delta = now - sa->last_runnable_update; #ifdef CONFIG_HMP_VARIABLE_SCALE @@ -1108,6 +1207,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, return 0; sa->last_runnable_update = now;

+#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + /* retrieve scale factor for load */ + if (hmp_data.freqinvar_load_scale_enabled) + curr_scale = freq_scale[cpu].curr_scale; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ + /* delta_w is the amount already accumulated against our next period */ delta_w = sa->runnable_avg_period % 1024; if (delta + delta_w >= 1024) { @@ -1120,10 +1225,20 @@ static __always_inline int __update_entity_runnable_avg(u64 now, * period and accrue it. */ delta_w = 1024 - delta_w; + /* scale runnable time if necessary */ +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + scaled_delta_w = (delta_w * curr_scale) + >> SCHED_FREQSCALE_SHIFT; + if (runnable) + sa->runnable_avg_sum += scaled_delta_w; + if (running) + sa->usage_avg_sum += scaled_delta_w; +#else if (runnable) sa->runnable_avg_sum += delta_w; if (running) sa->usage_avg_sum += delta_w; +#endif /* #ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ sa->runnable_avg_period += delta_w;

delta -= delta_w; @@ -1131,27 +1246,49 @@ static __always_inline int __update_entity_runnable_avg(u64 now, /* Figure out how many additional periods this update spans */ periods = delta / 1024; delta %= 1024; - + /* decay the load we have accumulated so far */ sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum, periods + 1); sa->runnable_avg_period = decay_load(sa->runnable_avg_period, periods + 1); sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1); - + /* add the contribution from this period */ /* Efficiently calculate \sum (1..n_period) 1024*y^i */ runnable_contrib = __compute_runnable_contrib(periods); + /* Apply load scaling if necessary. + * Note that multiplying the whole series is same as + * multiplying all terms + */ +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + scaled_runnable_contrib = (runnable_contrib * curr_scale) + >> SCHED_FREQSCALE_SHIFT; + if (runnable) + sa->runnable_avg_sum += scaled_runnable_contrib; + if (running) + sa->usage_avg_sum += scaled_runnable_contrib; +#else if (runnable) sa->runnable_avg_sum += runnable_contrib; if (running) sa->usage_avg_sum += runnable_contrib; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ sa->runnable_avg_period += runnable_contrib; }

/* Remainder of delta accrued against u_0` */ + /* scale if necessary */ +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + scaled_delta = ((delta * curr_scale) >> SCHED_FREQSCALE_SHIFT); + if (runnable) + sa->runnable_avg_sum += scaled_delta; + if (running) + sa->usage_avg_sum += scaled_delta; +#else if (runnable) sa->runnable_avg_sum += delta; if (running) sa->usage_avg_sum += delta; +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */ sa->runnable_avg_period += delta;

return decayed; @@ -1330,7 +1467,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, now = cfs_rq_clock_task(group_cfs_rq(se));

if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq, - cfs_rq->curr == se)) + cfs_rq->curr == se, se->cfs_rq->rq->cpu)) return;

contrib_delta = __update_entity_load_avg_contrib(se); @@ -1377,7 +1514,7 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { u32 contrib; __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable, - runnable); + runnable, rq->cpu); __update_tg_runnable_avg(&rq->avg, &rq->cfs); contrib = rq->avg.runnable_avg_sum * scale_load_down(1024); contrib /= (rq->avg.runnable_avg_period + 1); @@ -3362,27 +3499,6 @@ static inline void hmp_next_down_delay(struct sched_entity *se, int cpu) * delta time by 1/22 and setting load_avg_period_ms = 706. */

-#define HMP_VARIABLE_SCALE_SHIFT 16ULL -struct hmp_global_attr { - struct attribute attr; - ssize_t (*show)(struct kobject *kobj, - struct attribute *attr, char *buf); - ssize_t (*store)(struct kobject *a, struct attribute *b, - const char *c, size_t count); - int *value; - int (*to_sysfs)(int); - int (*from_sysfs)(int); -}; - -#define HMP_DATA_SYSFS_MAX 3 - -struct hmp_data_struct { - int multiplier; /* used to scale the time delta */ - struct attribute_group attr_group; - struct attribute *attributes[HMP_DATA_SYSFS_MAX + 1]; - struct hmp_global_attr attr[HMP_DATA_SYSFS_MAX]; -} hmp_data; - /* * By scaling the delta time it end-up increasing or decrease the * growing speed of the per entity load_avg_ratio @@ -3450,7 +3566,15 @@ static int hmp_theshold_from_sysfs(int value) return -1; return value; } - +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +/* freqinvar control is only 0,1 off/on */ +static int hmp_freqinvar_from_sysfs(int value) +{ + if (value < 0 || value > 1) + return -1; + return value; +} +#endif static void hmp_attr_add( const char *name, int *value, @@ -3495,7 +3619,14 @@ static int hmp_attr_init(void) &hmp_down_threshold, NULL, hmp_theshold_from_sysfs); - +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE + /* default frequency-invariant scaling ON */ + hmp_data.freqinvar_load_scale_enabled = 1; + hmp_attr_add("frequency_invariant_load_scale", + &hmp_data.freqinvar_load_scale_enabled, + NULL, + hmp_freqinvar_from_sysfs); +#endif hmp_data.attr_group.name = "hmp"; hmp_data.attr_group.attrs = hmp_data.attributes; ret = sysfs_create_group(kernel_kobj, @@ -6832,3 +6963,132 @@ __init void init_sched_fair_class(void) #endif /* SMP */

} + +#ifdef CONFIG_HMP_FREQUENCY_INVARIANT_SCALE +static u32 cpufreq_calc_scale(u32 min, u32 max, u32 curr) +{ + u32 result = curr / max; + return result; +} + +/* Called when the CPU Frequency is changed. + * Once for each CPU. + */ +static int cpufreq_callback(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_freqs *freq = data; + int cpu = freq->cpu; + struct cpufreq_extents *extents; + + if (freq->flags & CPUFREQ_CONST_LOOPS) + return NOTIFY_OK; + + if (val != CPUFREQ_POSTCHANGE) + return NOTIFY_OK; + + /* if dynamic load scale is disabled, set the load scale to 1.0 */ + if (!hmp_data.freqinvar_load_scale_enabled) { + freq_scale[cpu].curr_scale = 1024; + return NOTIFY_OK; + } + + extents = &freq_scale[cpu]; + if (extents->flags & SCHED_LOAD_FREQINVAR_SINGLEFREQ) { + /* If our governor was recognised as a single-freq governor, + * use 1.0 + */ + extents->curr_scale = 1024; + } else { + extents->curr_scale = cpufreq_calc_scale(extents->min, + extents->max, freq->new); + } + + return NOTIFY_OK; +} + +/* Called when the CPUFreq governor is changed. + * Only called for the CPUs which are actually changed by the + * userspace. + */ +static int cpufreq_policy_callback(struct notifier_block *nb, + unsigned long event, void *data) +{ + struct cpufreq_policy *policy = data; + struct cpufreq_extents *extents; + int cpu, singleFreq = 0; + static const char performance_governor[] = "performance"; + static const char powersave_governor[] = "powersave"; + + if (event == CPUFREQ_START) + return 0; + + if (event != CPUFREQ_INCOMPATIBLE) + return 0; + + /* CPUFreq governors do not accurately report the range of + * CPU Frequencies they will choose from. + * We recognise performance and powersave governors as + * single-frequency only. + */ + if (!strncmp(policy->governor->name, performance_governor, + strlen(performance_governor)) || + !strncmp(policy->governor->name, powersave_governor, + strlen(powersave_governor))) + singleFreq = 1; + + /* Make sure that all CPUs impacted by this policy are + * updated since we will only get a notification when the + * user explicitly changes the policy on a CPU. + */ + for_each_cpu(cpu, policy->cpus) { + extents = &freq_scale[cpu]; + extents->max = policy->max >> SCHED_FREQSCALE_SHIFT; + extents->min = policy->min >> SCHED_FREQSCALE_SHIFT; + if (!hmp_data.freqinvar_load_scale_enabled) { + extents->curr_scale = 1024; + } else if (singleFreq) { + extents->flags |= SCHED_LOAD_FREQINVAR_SINGLEFREQ; + extents->curr_scale = 1024; + } else { + extents->flags &= ~SCHED_LOAD_FREQINVAR_SINGLEFREQ; + extents->curr_scale = cpufreq_calc_scale(extents->min, + extents->max, policy->cur); + } + } + + return 0; +} + +static struct notifier_block cpufreq_notifier = { + .notifier_call = cpufreq_callback, +}; +static struct notifier_block cpufreq_policy_notifier = { + .notifier_call = cpufreq_policy_callback, +}; + +static int __init register_sched_cpufreq_notifier(void) +{ + int ret = 0; + + /* init safe defaults since there are no policies at registration */ + for (ret = 0; ret < CONFIG_NR_CPUS; ret++) { + /* safe defaults */ + freq_scale[ret].max = 1024; + freq_scale[ret].min = 1024; + freq_scale[ret].curr_scale = 1024; + } + + pr_info("sched: registering cpufreq notifiers for scale-invariant loads\n"); + ret = cpufreq_register_notifier(&cpufreq_policy_notifier, + CPUFREQ_POLICY_NOTIFIER); + + if (ret != -EINVAL) + ret = cpufreq_register_notifier(&cpufreq_notifier, + CPUFREQ_TRANSITION_NOTIFIER); + + return ret; +} + +core_initcall(register_sched_cpufreq_notifier); +#endif /* CONFIG_HMP_FREQUENCY_INVARIANT_SCALE */

-- 1.7.9.5

Liviu Dudau

6:32 p.m.

New subject: [HMP tunables v2][PATCH 7/7] linaro/configs: big-LITTLE-MP: Enable the new tunable sysfs interface by default.

Enable the new tunable sysfs interface for HMP scaling invariants.

Signed-of-by: Liviu Dudau Liviu.Dudau@arm.com --- linaro/configs/big-LITTLE-MP.conf | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/linaro/configs/big-LITTLE-MP.conf b/linaro/configs/big-LITTLE-MP.conf index 9ee51e4..b88ee53 100644 --- a/linaro/configs/big-LITTLE-MP.conf +++ b/linaro/configs/big-LITTLE-MP.conf @@ -8,5 +8,7 @@ CONFIG_SCHED_HMP=y CONFIG_SCHED_HMP_SYSFS=y CONFIG_HMP_FAST_CPU_MASK="" CONFIG_HMP_SLOW_CPU_MASK="" +CONFIG_HMP_VARIABLE_SCALE=y +CONFIG_HMP_FREQUENCY_INVARIANT_SCALE=y CONFIG_SCHED_HMP_PRIO_FILTER=y CONFIG_SCHED_HMP_PRIO_FILTER_VAL=5

-- 1.7.9.5

Viresh Kumar

17 Nov 17 Nov

7:19 a.m.

On 17 November 2012 00:02, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

Here are the latest patches for HMP tunables to be included in the MP branch for the 12.11 release. They depend on Vincent Guittot's patches that you have on -exp-v12 branch which we need included in the PULL request. Testing shows that they improve power consumption for HMP.

Hi Liviu,

I don't apply patches over a merge, but over topic branches. I have placed your patches over vincent's, morten's and config-fragment branches separately.

-- viresh

Viresh Kumar

5 Dec 5 Dec

5:27 a.m.

On 17 November 2012 00:02, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

Here are the latest patches for HMP tunables to be included in the MP branch for the 12.11 release. They depend on Vincent Guittot's patches that you have on -exp-v12 branch which we need included in the PULL request. Testing shows that they improve power consumption for HMP.

Hi Liviu,

There are two branches present currently for sched-pack-small-task: - sched-pack-small-tasks-v1-fixed: Original one from Vincent + 2 fixes - sched-pack-small-tasks-v1-arm: One fix reverted for now + patches from this mail thread.

I don't really want to keep the fix from Vincent reverted. Can we please test again, the -fixed branch, and check which patches are still required ?

-- viresh

Liviu Dudau

10:58 a.m.

On Wed, Dec 05, 2012 at 05:27:53AM +0000, Viresh Kumar wrote:

...

On 17 November 2012 00:02, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
Here are the latest patches for HMP tunables to be included in the MP branch for the 12.11 release. They depend on Vincent Guittot's patches that you have on -exp-v12 branch which we need included in the PULL request. Testing shows that they improve power consumption for HMP.

Hi Liviu,

There are two branches present currently for sched-pack-small-task:

sched-pack-small-tasks-v1-fixed: Original one from Vincent + 2 fixes

sched-pack-small-tasks-v1-arm: One fix reverted for now + patches from this mail thread.

I don't really want to keep the fix from Vincent reverted. Can we please test again, the -fixed branch, and check which patches are still required ?

-- viresh

Hi Viresh,

The revert request came at Morten's suggestion. He has comments on the code and technical reasons why he believes that the approach is not the best one as well as some scenarios where possible race conditions can occur.

Morten, what is the latest update in this area. I'm not sure I have followed your discussion with Vincent on the subject.

Regards, Liviu

Viresh Kumar

11:01 a.m.

On 5 December 2012 16:28, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

The revert request came at Morten's suggestion. He has comments on the code and technical reasons why he believes that the approach is not the best one as well as some scenarios where possible race conditions can occur.

Morten, what is the latest update in this area. I'm not sure I have followed your discussion with Vincent on the subject.

Just to make it more clear.. There are two reverts now. Please look at the latest tree/branches. Vincent has provided another fixup patch after which he commented we no longer need Mortens fix.

I have reverted that too, for the moment to keep things same as the last release. Can Morten test with latest patches from Vincent (from his branch) ? And provide fixups again ?

Morten Rasmussen

11:28 a.m.

On 05/12/12 11:01, Viresh Kumar wrote:

...

On 5 December 2012 16:28, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
The revert request came at Morten's suggestion. He has comments on the code and technical reasons why he believes that the approach is not the best one as well as some scenarios where possible race conditions can occur.

Morten, what is the latest update in this area. I'm not sure I have followed your discussion with Vincent on the subject.

Just to make it more clear.. There are two reverts now. Please look at the latest tree/branches. Vincent has provided another fixup patch after which he commented we no longer need Mortens fix.

I have reverted that too, for the moment to keep things same as the last release. Can Morten test with latest patches from Vincent (from his branch) ? And provide fixups again ?

Hi,

I tested Vincent's fix ("sched: pack small tasks: fix update packing domain") for the buddy selection some weeks ago and confirmed that it works. So my quick fixes are no longer necessary.

The issues around the reverted "sched: secure access to other CPU statistics" have not yet been resolved. I don't think that we should re-enable it until we are clear about what it is doing.

Morten

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Viresh Kumar

11:35 a.m.

On 5 December 2012 16:58, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

I tested Vincent's fix ("sched: pack small tasks: fix update packing domain") for the buddy selection some weeks ago and confirmed that it works. So my quick fixes are no longer necessary.

The issues around the reverted "sched: secure access to other CPU statistics" have not yet been resolved. I don't think that we should re-enable it until we are clear about what it is doing.

There are four patches i am carrying from ARM

4a29297 ARM: TC2: Re-enable SD_SHARE_POWERLINE a1924a4 sched: SD_SHARE_POWERLINE buddy selection fix 39b0e77 Revert "sched: secure access to other CPU statistics" eed72c8 Revert "sched: pack small tasks: fix update packing domain"

You want me to drop eed72c8 and a1924a4 ? Correct.

-- viresh

Morten Rasmussen

11:40 a.m.

On 05/12/12 11:35, Viresh Kumar wrote:

...

On 5 December 2012 16:58, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
I tested Vincent's fix ("sched: pack small tasks: fix update packing domain") for the buddy selection some weeks ago and confirmed that it works. So my quick fixes are no longer necessary.

The issues around the reverted "sched: secure access to other CPU statistics" have not yet been resolved. I don't think that we should re-enable it until we are clear about what it is doing.

There are four patches i am carrying from ARM

4a29297 ARM: TC2: Re-enable SD_SHARE_POWERLINE a1924a4 sched: SD_SHARE_POWERLINE buddy selection fix 39b0e77 Revert "sched: secure access to other CPU statistics" eed72c8 Revert "sched: pack small tasks: fix update packing domain"

You want me to drop eed72c8 and a1924a4 ? Correct.

Yes.

Morten

...

-- viresh

Viresh Kumar

11:51 a.m.

On 5 December 2012 17:10, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

...
You want me to drop eed72c8 and a1924a4 ? Correct.

Yes.

Done. Updated my repo with v13.

Vincent Guittot

6 Dec 6 Dec

9:20 a.m.

On 5 December 2012 12:51, Viresh Kumar viresh.kumar@linaro.org wrote:

...

On 5 December 2012 17:10, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
...
You want me to drop eed72c8 and a1924a4 ? Correct.

Yes.

Done. Updated my repo with v13.

Viresh,

It's look like you have disabled packing small task in your v13 with 4a29297

In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

Regards, Vincent

Viresh Kumar

9:32 a.m.

Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...

It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...

In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

...

According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

I didn't get this one. This is what i have included in my tree.

-- viresh

Vincent Guittot

9:40 a.m.

On 6 December 2012 10:32, Viresh Kumar viresh.kumar@linaro.org wrote:

...

Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

I have looked into http://git.linaro.org/gitweb?p=people/vireshk/linux-linaro-big-LITTLE-MP.git...

and more precisely the file http://git.linaro.org/gitweb?p=people/vireshk/linux-linaro-big-LITTLE-MP.git...

...

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

...
According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

s/sched-pack-small-tasks-v1-arm/sched-pack-small-tasks-v1-fixed. wrong copy paste sched-pack-small-tasks-v1-arm ius the one that is merged

...

I didn't get this one. This is what i have included in my tree.

-- viresh

Viresh Kumar

9:46 a.m.

On 6 December 2012 15:10, Vincent Guittot vincent.guittot@linaro.org wrote:

...

On 6 December 2012 10:32, Viresh Kumar viresh.kumar@linaro.org wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

I have looked into http://git.linaro.org/gitweb?p=people/vireshk/linux-linaro-big-LITTLE-MP.git...

So, sha-id is 3f1dff11 not 4a29297 :)

http://git.linaro.org/gitweb?p=people/vireshk/linux-linaro-big-LITTLE-MP.git...

...

...
...
According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

s/sched-pack-small-tasks-v1-arm/sched-pack-small-tasks-v1-fixed. wrong copy paste sched-pack-small-tasks-v1-arm ius the one that is merged

Looks sensible now :)

-- viresh

Liviu Dudau

10:27 a.m.

On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...

Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where things are:

- Morten has reviewed your patch and had provided comments, mainly pointing towards the fact that the guarantees that the code is trying to provide are not met. - You have replied/promised to review and update the patch (is that correct?) - Nothing has happened since in this area (is that true?)

To us, looking at the code (mainly the while loop in the is_buddy_busy() function), is seem that you are trying to guarantee that the runnable_avg_sum and the runnable_avg_period that you read from the runqueue have not been updated before deciding if that cpu is busy. But there is no reason* why another thread could not update the runnable_avg_sum and then get preempted before updating the runnable_avg_period, which means you will still use the wrong values when doing the "greater than" check in the return statement.

The other point is that even if your code is "broken" (i.e. you read stale values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the behaviour of the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

Hope I'm making sense.

* my understanding of the scheduler code is rather incomplete, so I'm expecting to be proven wrong here.

Best regards, Liviu

...

...
According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

I didn't get this one. This is what i have included in my tree.

-- viresh

Vincent Guittot

11:50 a.m.

On 6 December 2012 11:27, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where things are:

Morten has reviewed your patch and had provided comments, mainly pointing towards the fact that the guarantees that the code is trying to provide are not met.

You have replied/promised to review and update the patch (is that correct?)

Nothing has happened since in this area (is that true?)

both are true

...

To us, looking at the code (mainly the while loop in the is_buddy_busy() function), is seem that you are trying to guarantee that the runnable_avg_sum and the runnable_avg_period that you read from the runqueue have not been updated before deciding if that cpu is busy. But there is no reason* why another thread could not update the runnable_avg_sum and then get preempted before updating the runnable_avg_period, which means you will still use the wrong values when doing the "greater than" check in the return statement.

No, I don't want to ensure that values have not been updated before I use them but that both values are coherent

...

The other point is that even if your code is "broken" (i.e. you read stale values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the behaviour of the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

So if it doesn't change the behavior why removing it ? It make the maintenance of patches series more complex.

...

Hope I'm making sense.

my understanding of the scheduler code is rather incomplete, so I'm

expecting to be proven wrong here.

Best regards, Liviu

...
...
According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

I didn't get this one. This is what i have included in my tree.

-- viresh

--

| I would like to | | fix the world, | | but they're not | | giving me the | \ source code! /
¯\_(ツ)_/¯

Liviu Dudau

12:06 p.m.

On Thu, Dec 06, 2012 at 11:50:15AM +0000, Vincent Guittot wrote:

...

On 6 December 2012 11:27, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where things are:

Morten has reviewed your patch and had provided comments, mainly pointing towards the fact that the guarantees that the code is trying to provide are not met.

You have replied/promised to review and update the patch (is that correct?)

Nothing has happened since in this area (is that true?)

both are true

...
To us, looking at the code (mainly the while loop in the is_buddy_busy() function), is seem that you are trying to guarantee that the runnable_avg_sum and the runnable_avg_period that you read from the runqueue have not been updated before deciding if that cpu is busy. But there is no reason* why another thread could not update the runnable_avg_sum and then get preempted before updating the runnable_avg_period, which means you will still use the wrong values when doing the "greater than" check in the return statement.

No, I don't want to ensure that values have not been updated before I use them but that both values are coherent.

But that to me and others seems to be false. I've suggested a way that the values might not be coherent and yet your while loop will exit.

...

...
The other point is that even if your code is "broken" (i.e. you read stale values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the behaviour of the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

So if it doesn't change the behavior why removing it ? It make the maintenance of patches series more complex.

Because we don't like adding code that doesn't work as claimed and then justify its presence by arguing that even if it's broken it doesn't change behaviour.

Why having that code in the first place?

All we are suggesting here is that the while loop is broken in its assumptions and that you get the same behaviour if you remove it. If you agree then push a patch that removes the while loop and we are going to be likely in favor of the new patch being included. If you really want to have the values coherent then you need a better mechanism for ensuring that.

Best regards, Liviu

...

...
Hope I'm making sense.

my understanding of the scheduler code is rather incomplete, so I'm

expecting to be proven wrong here.

Best regards, Liviu

...
...
According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

I didn't get this one. This is what i have included in my tree.

-- viresh

Vincent Guittot

12:25 p.m.

On 6 December 2012 13:06, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

On Thu, Dec 06, 2012 at 11:50:15AM +0000, Vincent Guittot wrote:

...
On 6 December 2012 11:27, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where things are:

Morten has reviewed your patch and had provided comments, mainly pointing towards the fact that the guarantees that the code is trying to provide are not met.

You have replied/promised to review and update the patch (is that correct?)

Nothing has happened since in this area (is that true?)

both are true

...
To us, looking at the code (mainly the while loop in the is_buddy_busy() function), is seem that you are trying to guarantee that the runnable_avg_sum and the runnable_avg_period that you read from the runqueue have not been updated before deciding if that cpu is busy. But there is no reason* why another thread could not update the runnable_avg_sum and then get preempted before updating the runnable_avg_period, which means you will still use the wrong values when doing the "greater than" check in the return statement.

No, I don't want to ensure that values have not been updated before I use them but that both values are coherent.

But that to me and others seems to be false. I've suggested a way that the values might not be coherent and yet your while loop will exit.

Your example is not possible because these values are only updated by the cpu which owns the rq and no thread can preempt this sequence.

...

...
...
The other point is that even if your code is "broken" (i.e. you read stale values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the behaviour of the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

So if it doesn't change the behavior why removing it ? It make the maintenance of patches series more complex.

Because we don't like adding code that doesn't work as claimed and then justify its presence by arguing that even if it's broken it doesn't change behaviour.

Why having that code in the first place?

All we are suggesting here is that the while loop is broken in its assumptions and that you get the same behaviour if you remove it. If you agree then push a patch that removes the while loop and we are going to be likely in favor of the new patch being included. If you really want to have the values coherent then you need a better mechanism for ensuring that.

Best regards, Liviu

...
...
Hope I'm making sense.

my understanding of the scheduler code is rather incomplete, so I'm

expecting to be proven wrong here.

Best regards, Liviu

...
...
According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

I didn't get this one. This is what i have included in my tree.

-- viresh

--

| I would like to | | fix the world, | | but they're not | | giving me the | \ source code! /
¯\_(ツ)_/¯

Liviu Dudau

1:46 p.m.

On Thu, Dec 06, 2012 at 12:25:21PM +0000, Vincent Guittot wrote:

...

On 6 December 2012 13:06, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
On Thu, Dec 06, 2012 at 11:50:15AM +0000, Vincent Guittot wrote:

...
On 6 December 2012 11:27, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where things are:

Morten has reviewed your patch and had provided comments, mainly pointing towards the fact that the guarantees that the code is trying to provide are not met.

You have replied/promised to review and update the patch (is that correct?)

Nothing has happened since in this area (is that true?)

both are true

...
To us, looking at the code (mainly the while loop in the is_buddy_busy() function), is seem that you are trying to guarantee that the runnable_avg_sum and the runnable_avg_period that you read from the runqueue have not been updated before deciding if that cpu is busy. But there is no reason* why another thread could not update the runnable_avg_sum and then get preempted before updating the runnable_avg_period, which means you will still use the wrong values when doing the "greater than" check in the return statement.

No, I don't want to ensure that values have not been updated before I use them but that both values are coherent.

But that to me and others seems to be false. I've suggested a way that the values might not be coherent and yet your while loop will exit.

Your example is not possible because these values are only updated by the cpu which owns the rq and no thread can preempt this sequence.

So, you're saying that they are going to be coherent anyway? Because the sequence that updates the values cannot be preempted, that's how I read your statement. In that case, why do you need the while loop (and the whole patch) ?

...

...
...
...
The other point is that even if your code is "broken" (i.e. you read stale values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the behaviour of the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

So if it doesn't change the behavior why removing it ? It make the maintenance of patches series more complex.

Because we don't like adding code that doesn't work as claimed and then justify its presence by arguing that even if it's broken it doesn't change behaviour.

Why having that code in the first place?

All we are suggesting here is that the while loop is broken in its assumptions and that you get the same behaviour if you remove it. If you agree then push a patch that removes the while loop and we are going to be likely in favor of the new patch being included. If you really want to have the values coherent then you need a better mechanism for ensuring that.

Best regards, Liviu

...
...
Hope I'm making sense.

my understanding of the scheduler code is rather incomplete, so I'm

expecting to be proven wrong here.

Best regards, Liviu

...
...
According to my test, sched-pack-small-tasks-v1-arm is enough if you want to use packing small task on tc2

I didn't get this one. This is what i have included in my tree.

-- viresh

Vincent Guittot

4:09 p.m.

On 6 December 2012 14:46, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

On Thu, Dec 06, 2012 at 12:25:21PM +0000, Vincent Guittot wrote:

...
On 6 December 2012 13:06, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
On Thu, Dec 06, 2012 at 11:50:15AM +0000, Vincent Guittot wrote:

...
On 6 December 2012 11:27, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org wrote: > It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

> In addition, I don't fully understand why you absolutely want to > revert 39b0e77. In the worst case, it will not fully solve what it > should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where things are:

Morten has reviewed your patch and had provided comments, mainly pointing towards the fact that the guarantees that the code is trying to provide are not met.

You have replied/promised to review and update the patch (is that correct?)

Nothing has happened since in this area (is that true?)

both are true

...
To us, looking at the code (mainly the while loop in the is_buddy_busy() function), is seem that you are trying to guarantee that the runnable_avg_sum and the runnable_avg_period that you read from the runqueue have not been updated before deciding if that cpu is busy. But there is no reason* why another thread could not update the runnable_avg_sum and then get preempted before updating the runnable_avg_period, which means you will still use the wrong values when doing the "greater than" check in the return statement.

No, I don't want to ensure that values have not been updated before I use them but that both values are coherent.

But that to me and others seems to be false. I've suggested a way that the values might not be coherent and yet your while loop will exit.

Your example is not possible because these values are only updated by the cpu which owns the rq and no thread can preempt this sequence.

So, you're saying that they are going to be coherent anyway? Because the sequence that updates the values cannot be preempted, that's how I read your statement. In that case, why do you need the while loop (and the whole patch) ?

Liviu,

I'm just saying that the use case above can't happen because the fields are not updated by thread but by scheduler and it can't be pre-empted by a thread while updating them. We can have the situation where cpu A updates its fields and cpu B reads them simultaneously. This is the use case that i'm trying to address with this patch.

Regards, Vincent

...

...
...
...
...
The other point is that even if your code is "broken" (i.e. you read stale values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the behaviour of the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

So if it doesn't change the behavior why removing it ? It make the maintenance of patches series more complex.

Because we don't like adding code that doesn't work as claimed and then justify its presence by arguing that even if it's broken it doesn't change behaviour.

Why having that code in the first place?

All we are suggesting here is that the while loop is broken in its assumptions and that you get the same behaviour if you remove it. If you agree then push a patch that removes the while loop and we are going to be likely in favor of the new patch being included. If you really want to have the values coherent then you need a better mechanism for ensuring that.

Best regards, Liviu

...
...
Hope I'm making sense.

my understanding of the scheduler code is rather incomplete, so I'm

expecting to be proven wrong here.

Best regards, Liviu

...
> According to my test, sched-pack-small-tasks-v1-arm is enough if you > want to use packing small task on tc2

I didn't get this one. This is what i have included in my tree.

-- viresh

--

| I would like to | | fix the world, | | but they're not | | giving me the | \ source code! /
¯\_(ツ)_/¯

Liviu Dudau

4:31 p.m.

[discussion not relevant to this reply removed]

...

Liviu,

I'm just saying that the use case above can't happen because the fields are not updated by thread but by scheduler and it can't be pre-empted by a thread while updating them. We can have the situation where cpu A updates its fields and cpu B reads them simultaneously. This is the use case that i'm trying to address with this patch.

And me and Morten are saying that the compiler and/or system can reorder the reads on CPU B and you have

CPU A (running scheduler) CPU B (running is_buddy_busy) update rq->avg.runnable_avg_sum period = rq->avg.runnable_avg_period new_period = rq->avg.runnable_avg_period update rq->avg.runnable_avg_period sum = rq->avg.runnable_avg_sum new_sum = rq->avg.runnable_avg_sum

while () condition is met and you break out of the loop, using stale value for one of the variables that you then use to decide if the buddy is busy.

Regards, Liviu

...

Regards, Vincent

[rest of the thread discussion removed for clarity]

Vincent Guittot

5:11 p.m.

On 6 December 2012 17:31, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

[discussion not relevant to this reply removed]

...
Liviu,

I'm just saying that the use case above can't happen because the fields are not updated by thread but by scheduler and it can't be pre-empted by a thread while updating them. We can have the situation where cpu A updates its fields and cpu B reads them simultaneously. This is the use case that i'm trying to address with this patch.

And me and Morten are saying that the compiler and/or system can reorder the reads on CPU B and you have

That's a new use case ;-) and the result of the compilation had shown that it was not reordered. But I agree that the sequence need to be more deeply studied to ensure that all use case are solved and that's why I had not pushed any improvement/modification yet.

...

CPU A (running scheduler)            CPU B (running is_buddy_busy)
update rq->avg.runnable_avg_sum      period = rq->avg.runnable_avg_period
                                     new_period = rq->avg.runnable_avg_period
update rq->avg.runnable_avg_period   sum = rq->avg.runnable_avg_sum
                                     new_sum = rq->avg.runnable_avg_sum
while () condition is met and you break out of the loop, using stale value for one of the variables that you then use to decide if the buddy is busy.

Regards, Liviu

...
Regards, Vincent

[rest of the thread discussion removed for clarity]

--

| I would like to | | fix the world, | | but they're not | | giving me the | \ source code! /
¯\_(ツ)_/¯

Amit Kucheria

2:06 p.m.

On Thu, Dec 6, 2012 at 5:36 PM, Liviu Dudau Liviu.Dudau@arm.com wrote:

...

On Thu, Dec 06, 2012 at 11:50:15AM +0000, Vincent Guittot wrote:

...
On 6 December 2012 11:27, Liviu Dudau Liviu.Dudau@arm.com wrote:

...
On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot vincent.guittot@linaro.org

wrote:

...
...
...
...
It's look like you have disabled packing small task in your v13

with 4a29297

...
...
...
It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where

things are:

...
...

Morten has reviewed your patch and had provided comments, mainly

pointing

...
...
towards the fact that the guarantees that the code is trying to

provide are

...
...
not met.

You have replied/promised to review and update the patch (is that

correct?)

...
...

Nothing has happened since in this area (is that true?)

both are true

...
To us, looking at the code (mainly the while loop in the

is_buddy_busy()

...
...
function), is seem that you are trying to guarantee that the

runnable_avg_sum

...
...
and the runnable_avg_period that you read from the runqueue have not

been

...
...
updated before deciding if that cpu is busy. But there is no reason*

why

...
...
another thread could not update the runnable_avg_sum and then get

preempted

...
...
before updating the runnable_avg_period, which means you will still

use the

...
...
wrong values when doing the "greater than" check in the return

statement.

...
...
No, I don't want to ensure that values have not been updated before I use them but that both values are coherent.

But that to me and others seems to be false. I've suggested a way that the values might not be coherent and yet your while loop will exit.

...
...
The other point is that even if your code is "broken" (i.e. you read

stale

...
...
values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the

behaviour of

...
...
the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

So if it doesn't change the behavior why removing it ? It make the maintenance of patches series more complex.

Because we don't like adding code that doesn't work as claimed and then justify its presence by arguing that even if it's broken it doesn't change behaviour.

Why having that code in the first place?

All we are suggesting here is that the while loop is broken in its assumptions and that you get the same behaviour if you remove it. If you agree then push a patch that removes the while loop and we are going to be likely in favor of the new patch being included. If you really want to have the values coherent then you need a better mechanism for ensuring that.

Even if that is the case, I'm afraid I don't quite like the way this was done. IMHO, you shouldn't just revert bits of another author's patches that you don't agree with.

If there are issues regarding the the patches from Vincent, I'd do the following things in order of priority: 1. Prove to him that the race exists, preferably with a reproducible test case 2. Give him a chance to convince you otherwise 3. Share test results that show bad things happen as a result of some code 4. Ask _him_ to separate out that bit from the original patch so you can only pick the bits you like

I haven't seen this happen. All I've seen is one side claim it can happen and the other claim that it can't. *shrug*

Viresh, as an experienced maintainer, I hope you see the value of this approach rather than just pull in the tree.

I realise we're all under pressure here. So let's take a deep breath, step back and do it the right way.

/Amit

Viresh Kumar

2:36 p.m.

On 6 December 2012 19:36, Amit Kucheria amit.kucheria@linaro.org wrote:

...

Even if that is the case, I'm afraid I don't quite like the way this was done. IMHO, you shouldn't just revert bits of another author's patches that you don't agree with.

If there are issues regarding the the patches from Vincent, I'd do the following things in order of priority:

Prove to him that the race exists, preferably with a reproducible test

case 2. Give him a chance to convince you otherwise 3. Share test results that show bad things happen as a result of some code 4. Ask _him_ to separate out that bit from the original patch so you can only pick the bits you like

I haven't seen this happen. All I've seen is one side claim it can happen and the other claim that it can't. *shrug*

Viresh, as an experienced maintainer, I hope you see the value of this approach rather than just pull in the tree.

I realise we're all under pressure here. So let's take a deep breath, step back and do it the right way.

First of all i must admit, i haven't followed the discussion closely, as this part of kernel is still rocket science for me :)

Secondly, what you said is correct Amit. But, i must say there has been a long time since the last time release happened and all this must have been sorted out in that time both from Linaro and ARM side. And i didn't saw any effort on that. Only when i came back to sort out issues in my tree, this issue is still highlighted.

Now, getting so close to release and making a big change, that will eventually affect the core part we are working on is not a great idea.

But we still have some time for a meaningful discussion to happen and one party to agree. Both can't be correct. I need to send the pull request by Monday and so whatever is required to be done, must be done by tomorrow evening.

-- viresh

Vincent Guittot

3:38 p.m.

On 6 December 2012 15:36, Viresh Kumar viresh.kumar@linaro.org wrote:

...

On 6 December 2012 19:36, Amit Kucheria amit.kucheria@linaro.org wrote:

...
Even if that is the case, I'm afraid I don't quite like the way this was done. IMHO, you shouldn't just revert bits of another author's patches that you don't agree with.

If there are issues regarding the the patches from Vincent, I'd do the following things in order of priority:

Prove to him that the race exists, preferably with a reproducible test

case 2. Give him a chance to convince you otherwise 3. Share test results that show bad things happen as a result of some code 4. Ask _him_ to separate out that bit from the original patch so you can only pick the bits you like

I haven't seen this happen. All I've seen is one side claim it can happen and the other claim that it can't. *shrug*

Viresh, as an experienced maintainer, I hope you see the value of this approach rather than just pull in the tree.

I realise we're all under pressure here. So let's take a deep breath, step back and do it the right way.

First of all i must admit, i haven't followed the discussion closely, as this part of kernel is still rocket science for me :)

Secondly, what you said is correct Amit. But, i must say there has been a long time since the last time release happened and all this must have been sorted out in that time both from Linaro and ARM side. And i didn't saw any effort on that. Only when i came back to sort out issues in my tree, this issue is still highlighted.

Now, getting so close to release and making a big change, that will eventually affect the core part we are working on is not a great idea.

IMHO, we should keep sched-pack-small-tasks-v1-fixed instead of sched-pack-small-tasks-v1-arm - packing small task is disable with sched-pack-small-tasks-v1-arm. - sched: secure access to other CPU statistics ensure value coherency. This patch solves perhaps not all use cases and I have to study this point but it solves most of racing accesses for sure.We will just enlarge the breach by removing it

...

But we still have some time for a meaningful discussion to happen and one party to agree. Both can't be correct. I need to send the pull request by Monday and so whatever is required to be done, must be done by tomorrow evening.

-- viresh

Viresh Kumar

19 Dec 19 Dec

6:34 a.m.

Morten/Vincent,

On 6 December 2012 20:06, Viresh Kumar viresh.kumar@linaro.org wrote:

...

First of all i must admit, i haven't followed the discussion closely, as this part of kernel is still rocket science for me :)

Secondly, what you said is correct Amit. But, i must say there has been a long time since the last time release happened and all this must have been sorted out in that time both from Linaro and ARM side. And i didn't saw any effort on that. Only when i came back to sort out issues in my tree, this issue is still highlighted.

Now, getting so close to release and making a big change, that will eventually affect the core part we are working on is not a great idea.

But we still have some time for a meaningful discussion to happen and one party to agree. Both can't be correct. I need to send the pull request by Monday and so whatever is required to be done, must be done by tomorrow

Can we resolve this issue now? I don't want anything during the release period this time. :)

-- viresh

Vincent Guittot

9:23 a.m.

Le 19 déc. 2012 07:34, "Viresh Kumar" viresh.kumar@linaro.org a écrit :

...

Morten/Vincent,

On 6 December 2012 20:06, Viresh Kumar viresh.kumar@linaro.org wrote:

...
First of all i must admit, i haven't followed the discussion closely,

as this

...

...
part of kernel is still rocket science for me :)

Secondly, what you said is correct Amit. But, i must say there has been

...

...
long time since the last time release happened and all this must have been sorted out in that time both from Linaro and ARM side. And i didn't saw any effort on that. Only when i came back to sort out issues in my tree, this issue is still highlighted.

Now, getting so close to release and making a big change, that will

eventually

...

...
affect the core part we are working on is not a great idea.

But we still have some time for a meaningful discussion to happen and one party to agree. Both can't be correct. I need to send the pull

request

...

...
by Monday and so whatever is required to be done, must be done by

tomorrow

...

Can we resolve this issue now? I don't want anything during the release

period

...

this time.

The new version of the patchset should solve the concerns of everybody

...

-- viresh

Viresh Kumar

9:34 a.m.

On 19 December 2012 14:53, Vincent Guittot vincent.guittot@linaro.org wrote:

...

Le 19 déc. 2012 07:34, "Viresh Kumar" viresh.kumar@linaro.org a écrit :

...
Can we resolve this issue now? I don't want anything during the release period this time.

The new version of the patchset should solve the concerns of everybody

Morten,

Can you confirm or cross-check that? Branch is: sched-pack-small-tasks-v2

Morten Rasmussen

10:57 a.m.

On 19/12/12 09:34, Viresh Kumar wrote:

...

On 19 December 2012 14:53, Vincent Guittot vincent.guittot@linaro.org wrote:

...
Le 19 déc. 2012 07:34, "Viresh Kumar" viresh.kumar@linaro.org a écrit :

...
Can we resolve this issue now? I don't want anything during the release period this time.

The new version of the patchset should solve the concerns of everybody

Morten,

Can you confirm or cross-check that? Branch is: sched-pack-small-tasks-v2

If I understand the new version of "sched: secure access to other CPU statistics" correctly, the effect of the patch is:

Without the patch the cpu will appear to be busy if sum/period are not coherent (sum>period). The same is true with the patch except in the case where nr_running is 0. In this particular case the cpu will appear not to be busy. I assume there is good reason why this particular case is important?

In any case the patch is fine by me.

Morten

Vincent Guittot

20 Dec 20 Dec

8:11 a.m.

On 19 December 2012 11:57, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On 19/12/12 09:34, Viresh Kumar wrote:

...
On 19 December 2012 14:53, Vincent Guittot vincent.guittot@linaro.org wrote:

...
Le 19 déc. 2012 07:34, "Viresh Kumar" viresh.kumar@linaro.org a écrit :

...
Can we resolve this issue now? I don't want anything during the release period this time.

The new version of the patchset should solve the concerns of everybody

Morten,

Can you confirm or cross-check that? Branch is: sched-pack-small-tasks-v2

If I understand the new version of "sched: secure access to other CPU statistics" correctly, the effect of the patch is:

Without the patch the cpu will appear to be busy if sum/period are not coherent (sum>period). The same is true with the patch except in the case where nr_running is 0. In this particular case the cpu will appear not to be busy. I assume there is good reason why this particular case is important?

Sorry for this late reply.

It's not really more important than other but it's one case we can safely detect to prevent spurious spread of tasks. In addition, The incoherency occurs if both value are close so nr_running == 0 was the only condition that left to be tested

...

In any case the patch is fine by me.

Morten

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Viresh Kumar

2 Jan 2 Jan

5:28 a.m.

On 20 December 2012 13:41, Vincent Guittot vincent.guittot@linaro.org wrote:

...

On 19 December 2012 11:57, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
If I understand the new version of "sched: secure access to other CPU statistics" correctly, the effect of the patch is:

Without the patch the cpu will appear to be busy if sum/period are not coherent (sum>period). The same is true with the patch except in the case where nr_running is 0. In this particular case the cpu will appear not to be busy. I assume there is good reason why this particular case is important?

Sorry for this late reply.

It's not really more important than other but it's one case we can safely detect to prevent spurious spread of tasks. In addition, The incoherency occurs if both value are close so nr_running == 0 was the only condition that left to be tested

...
In any case the patch is fine by me.

Hmm... I am still confused :(

We have two patches from ARM, do let me know if i can drop these:

commit 3f1dff11ac95eda2772bef577e368bc124bfe087 Author: Morten Rasmussen morten.rasmussen@arm.com Date: Fri Nov 16 18:32:40 2012 +0000

ARM: TC2: Re-enable SD_SHARE_POWERLINE

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.

arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

commit e8cceacd3913e3a3e955614bacc1bc81866bc243 Author: Liviu Dudau Liviu.Dudau@arm.com Date: Fri Nov 16 18:32:38 2012 +0000

Revert "sched: secure access to other CPU statistics"

This reverts commit 2aa14d0379cc54bc0ec44adb7a2e0ad02ae293d0.

The way this functionality is implemented is under review and the current implementation is considered not safe.

Signed-of-by: Liviu Dudau Liviu.Dudau@arm.com

kernel/sched/fair.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-)

Vincent Guittot

10:29 a.m.

On 2 January 2013 06:28, Viresh Kumar viresh.kumar@linaro.org wrote:

...

On 20 December 2012 13:41, Vincent Guittot vincent.guittot@linaro.org wrote:

...
On 19 December 2012 11:57, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
If I understand the new version of "sched: secure access to other CPU statistics" correctly, the effect of the patch is:

Without the patch the cpu will appear to be busy if sum/period are not coherent (sum>period). The same is true with the patch except in the case where nr_running is 0. In this particular case the cpu will appear not to be busy. I assume there is good reason why this particular case is important?

Sorry for this late reply.

It's not really more important than other but it's one case we can safely detect to prevent spurious spread of tasks. In addition, The incoherency occurs if both value are close so nr_running == 0 was the only condition that left to be tested

...
In any case the patch is fine by me.

Hmm... I am still confused :(

We have two patches from ARM, do let me know if i can drop these:

I think you can drop them as they don't apply anymore for V2. Morten, do you confirm ?

Vincent

...

commit 3f1dff11ac95eda2772bef577e368bc124bfe087 Author: Morten Rasmussen morten.rasmussen@arm.com Date: Fri Nov 16 18:32:40 2012 +0000
ARM: TC2: Re-enable SD_SHARE_POWERLINE

Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.
arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

commit e8cceacd3913e3a3e955614bacc1bc81866bc243 Author: Liviu Dudau Liviu.Dudau@arm.com Date: Fri Nov 16 18:32:38 2012 +0000
Revert "sched: secure access to other CPU statistics"

This reverts commit 2aa14d0379cc54bc0ec44adb7a2e0ad02ae293d0.

The way this functionality is implemented is under review and the
current implementation is considered not safe.
Signed-of-by: Liviu Dudau <Liviu.Dudau@arm.com>
kernel/sched/fair.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-)

Morten Rasmussen

5:10 p.m.

On 02/01/13 10:29, Vincent Guittot wrote:

...

On 2 January 2013 06:28, Viresh Kumar viresh.kumar@linaro.org wrote:

...
On 20 December 2012 13:41, Vincent Guittot vincent.guittot@linaro.org wrote:

...
On 19 December 2012 11:57, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
If I understand the new version of "sched: secure access to other CPU statistics" correctly, the effect of the patch is:

Without the patch the cpu will appear to be busy if sum/period are not coherent (sum>period). The same is true with the patch except in the case where nr_running is 0. In this particular case the cpu will appear not to be busy. I assume there is good reason why this particular case is important?

Sorry for this late reply.

It's not really more important than other but it's one case we can safely detect to prevent spurious spread of tasks. In addition, The incoherency occurs if both value are close so nr_running == 0 was the only condition that left to be tested

...
In any case the patch is fine by me.

Hmm... I am still confused :(

We have two patches from ARM, do let me know if i can drop these:

I think you can drop them as they don't apply anymore for V2. Morten, do you confirm ?

Confirmed. I don't see any problems with the v2 patch. The overhead of the check should be minimal.

Morten

...

Vincent

...
commit 3f1dff11ac95eda2772bef577e368bc124bfe087 Author: Morten Rasmussen morten.rasmussen@arm.com Date: Fri Nov 16 18:32:40 2012 +0000
 ARM: TC2: Re-enable SD_SHARE_POWERLINE

 Re-enable SD_SHARE_POWERLINE to reflect the power domains of TC2.
arch/arm/kernel/topology.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

commit e8cceacd3913e3a3e955614bacc1bc81866bc243 Author: Liviu Dudau Liviu.Dudau@arm.com Date: Fri Nov 16 18:32:38 2012 +0000
 Revert "sched: secure access to other CPU statistics"

 This reverts commit 2aa14d0379cc54bc0ec44adb7a2e0ad02ae293d0.

 The way this functionality is implemented is under review and the
current implementation is considered not safe.
 Signed-of-by: Liviu Dudau <Liviu.Dudau@arm.com>
kernel/sched/fair.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-)

Liviu Dudau

6 Dec 6 Dec

3:21 p.m.

On Thu, Dec 06, 2012 at 02:06:30PM +0000, Amit Kucheria wrote:

...

On Thu, Dec 6, 2012 at 5:36 PM, Liviu Dudau <Liviu.Dudau@arm.commailto:Liviu.Dudau@arm.com> wrote: On Thu, Dec 06, 2012 at 11:50:15AM +0000, Vincent Guittot wrote:

...
On 6 December 2012 11:27, Liviu Dudau <Liviu.Dudau@arm.commailto:Liviu.Dudau@arm.com> wrote:

...
On Thu, Dec 06, 2012 at 09:32:11AM +0000, Viresh Kumar wrote:

...
Hi Vincent,

On 6 December 2012 14:50, Vincent Guittot <vincent.guittot@linaro.orgmailto:vincent.guittot@linaro.org> wrote:

...
It's look like you have disabled packing small task in your v13 with 4a29297

It looks like you are referring to an older version of this branch as the commit id's don't match. Can you please check that again?

...
In addition, I don't fully understand why you absolutely want to revert 39b0e77. In the worst case, it will not fully solve what it should have solved but it doesn't add any regression.

I don't want to, but Liviu and Morten do :) @Liviu/Morten: Your comments

Hi Vincent,

I was just a messenger here, but this is my understanding of where things are:

Morten has reviewed your patch and had provided comments, mainly pointing towards the fact that the guarantees that the code is trying to provide are not met.

You have replied/promised to review and update the patch (is that correct?)

Nothing has happened since in this area (is that true?)

both are true

...
To us, looking at the code (mainly the while loop in the is_buddy_busy() function), is seem that you are trying to guarantee that the runnable_avg_sum and the runnable_avg_period that you read from the runqueue have not been updated before deciding if that cpu is busy. But there is no reason* why another thread could not update the runnable_avg_sum and then get preempted before updating the runnable_avg_period, which means you will still use the wrong values when doing the "greater than" check in the return statement.

No, I don't want to ensure that values have not been updated before I use them but that both values are coherent.

But that to me and others seems to be false. I've suggested a way that the values might not be coherent and yet your while loop will exit.

...
...
The other point is that even if your code is "broken" (i.e. you read stale values that still get you out of the while loop) the code works but your estimate of a CPU being busy is off by .... 10% ? In other words, having that while loop in is_buddy_busy() does not change the behaviour of the rest of the code. So the objection is on having the while loop and trying to claim guarantees that are not met. Why not removing the loop?

So if it doesn't change the behavior why removing it ? It make the maintenance of patches series more complex.

Because we don't like adding code that doesn't work as claimed and then justify its presence by arguing that even if it's broken it doesn't change behaviour.

Why having that code in the first place?

All we are suggesting here is that the while loop is broken in its assumptions and that you get the same behaviour if you remove it. If you agree then push a patch that removes the while loop and we are going to be likely in favor of the new patch being included. If you really want to have the values coherent then you need a better mechanism for ensuring that.

Even if that is the case, I'm afraid I don't quite like the way this was done. IMHO, you shouldn't just revert bits of another author's patches that you don't agree with.

Hi Amit,

We didn't reverted bits, we have just asked that one of the patches from the series not to be included into the final pull request as it was controversial. It did not change the functionality of the series.

...

If there are issues regarding the the patches from Vincent, I'd do the following things in order of priority:

Prove to him that the race exists, preferably with a reproducible test case

We did. Morten and Olivier sent emails to Vincent.

...

Give him a chance to convince you otherwise

We did. Initial comments were sent about a month ago.

...

Share test results that show bad things happen as a result of some code

N/A. We are claiming that the code works to the same outcome on both cases. It's just that the guarantees that the code is trying to make don't hold water.

...

Ask _him_ to separate out that bit from the original patch so you can only pick the bits you like

We are not the ones picking up the bits. It was Viresh. And we have told Viresh that we have comments on the one patch in the series that make it unsuitable for that release. One suggestion that we made was to revert that particular patch. Now Viresh was free to choose a different path to reach the outcome, like asking Vincent to split the patch out of his series. But I guess it was too late in the game for that and it was more expedient to do the revert.

...

I haven't seen this happen. All I've seen is one side claim it can happen and the other claim that it can't. *shrug*

Not seeing the whole picture then?

...

Viresh, as an experienced maintainer, I hope you see the value of this approach rather than just pull in the tree.

I realise we're all under pressure here. So let's take a deep breath, step back and do it the right way.

I'm really deploring the fact that we are turning this into a big issue when it is not. We are only nitpicking. If you want to go ahead and include the code just because it make patch management easier, go ahead. Our comments were on the technical side.

Regards, Liviu

...

/Amit

4593

days inactive

4640

days old

linaro-dev@lists.linaro.org

48 comments

participants

tags (0)

participants (5)

Amit Kucheria
Liviu Dudau
Morten Rasmussen
Vincent Guittot
Viresh Kumar