[RFC 0/6] sched: packing small tasks

List overview All Threads
Download

newer

older

[PATCH] dt: add helper function to...

[GIT PULL]; big LITTLE MP master...

Vincent Guittot

7 Oct 2012 7 Oct '12

7:43 a.m.

Hi,

This patch-set takes advantage of the new statistics that are going to be available in the kernel thanks to the per-entity load-tracking: http://thread.gmane.org/gmane.linux.kernel/1348522. It packs the small tasks in as few as possible CPU/Cluster/Core. The main goal of packing small tasks is to reduce the power consumption by minimizing the number of power domain that are used. The packing is done in 2 steps:

The 1st step looks for the best place to pack tasks on a system according to its topology and it defines a pack buddy CPU for each CPU if there is one available. The policy for setting a pack buddy CPU is that we pack at all levels where the power line is not shared by groups of CPUs. For describing this capability, a new flag has been introduced SD_SHARE_POWERLINE that is used to describe where CPUs of a scheduling domain are sharing their power rails. This flag has been set in all sched_domain in order to keep unchanged the default behaviour of the scheduler.

In a 2nd step, the scheduler checks the load level of the task which wakes up and the business of the buddy CPU. Then, It can decide to migrate the task on the buddy.

The patch-set has been tested on ARM platforms: quad CA-9 SMP and TC2 HMP (dual CA-15 and 3xCA-7 cluster). For ARM platform, the results have demonstrated that it's worth packing small tasks at all topology levels.

The performance tests have been done on both platforms with sysbench. The results don't show any performance regressions. These results are aligned with the policy which uses the normal behavior with heavy use cases.

test: sysbench --test=cpu --num-threads=N --max-requests=R run

Results below is the average duration of 3 tests on the quad CA-9. default is the current scheduler behavior (pack buddy CPU is -1) pack is the scheduler with the pack mecanism

| default | pack | ----------------------------------- N=8; R=200 | 3.1999 | 3.1921 | N=8; R=2000 | 31.4939 | 31.4844 | N=12; R=200 | 3.2043 | 3.2084 | N=12; R=2000 | 31.4897 | 31.4831 | N=16; R=200 | 3.1774 | 3.1824 | N=16; R=2000 | 31.4899 | 31.4897 | -----------------------------------

The power consumption tests have been done only on TC2 platform which has got accessible power lines and I have used cyclictest to simulate small tasks. The tests show some power consumption improvements.

test: cyclictest -t 8 -q -e 1000000 -D 20 & cyclictest -t 8 -q -e 1000000 -D 20

The measurements have been done during 16 seconds and the result has been normalized to 100

| CA15 | CA7 | total | ------------------------------------- default | 100 | 40 | 140 | pack | <1 | 45 | <46 | -------------------------------------

The A15 cluster is less power efficient than the A7 cluster but if we assume that the tasks is well spread on both clusters, we can guest estimate that the power consumption on a dual cluster of CA7 would have been for a default kernel:

| CA7 | CA7 | total | ------------------------------------- default | 40 | 40 | 80 | -------------------------------------

Vincent Guittot (6): Revert "sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" sched: add a new SD SHARE_POWERLINE flag for sched_domain sched: pack small task at wakeup sched: secure access to other CPU statistics sched: pack the idle load balance ARM: sched: clear SD_SHARE_POWERLINE

-- 1.7.9.5

Show replies by date

Vincent Guittot

7 Oct 7 Oct

7:43 a.m.

New subject: [RFC 1/6] Revert "sched: introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"

This reverts commit f0494c69d387db9496b6f9aa71b3bc49f1dcf474.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 8 +------- kernel/sched/core.c | 7 +------ kernel/sched/fair.c | 13 ++----------- kernel/sched/sched.h | 9 +-------- 4 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h index f45da5f..4786b20 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1214,13 +1214,7 @@ struct sched_entity { /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; #endif -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) - /* Per-entity load-tracking */ +#ifdef CONFIG_SMP struct sched_avg avg; #endif }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b81915c..d50fbac 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1715,12 +1715,7 @@ static void __sched_fork(struct task_struct *p) p->se.vruntime = 0; INIT_LIST_HEAD(&p->se.group_node);

-/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP p->se.avg.runnable_avg_period = 0; p->se.avg.runnable_avg_sum = 0; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 095d86c..4f4a4f6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -877,8 +877,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq) } #endif /* CONFIG_FAIR_GROUP_SCHED */

-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */ -#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) +#ifdef CONFIG_SMP /* * We choose a half-life close to 1 scheduling period. * Note: The tables below are dependent on this value. @@ -3204,12 +3203,6 @@ unlock: }

/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED -/* * Called immediately before a task is migrated to a new cpu; task_cpu(p) and * cfs_rq_of(p) references at time of call are still valid and identify the * previous cpu. However, the caller only guarantees p->pi_lock is held; no @@ -3232,7 +3225,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load); } } -#endif #endif /* CONFIG_SMP */

static unsigned long @@ -5812,9 +5804,8 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP .select_task_rq = select_task_rq_fair, -#ifdef CONFIG_FAIR_GROUP_SCHED .migrate_task_rq = migrate_task_rq_fair, -#endif + .rq_online = rq_online_fair, .rq_offline = rq_offline_fair,

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81135f9..a95d5c1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -225,12 +225,6 @@ struct cfs_rq { #endif

#ifdef CONFIG_SMP -/* - * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be - * removed when useful for applications beyond shares distribution (e.g. - * load-balance). - */ -#ifdef CONFIG_FAIR_GROUP_SCHED /* * CFS Load tracking * Under CFS, load is tracked on a per-entity basis and aggregated up. @@ -240,8 +234,7 @@ struct cfs_rq { u64 runnable_load_avg, blocked_load_avg; atomic64_t decay_counter, removed_load; u64 last_decay; -#endif /* CONFIG_FAIR_GROUP_SCHED */ -/* These always depend on CONFIG_FAIR_GROUP_SCHED */ + #ifdef CONFIG_FAIR_GROUP_SCHED u32 tg_runnable_contrib, tg_usage_contrib; u64 tg_load_contrib;

-- 1.7.9.5

Vincent Guittot

7:43 a.m.

New subject: [RFC 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain

This new flag SD SHARE_POWERLINE reflects the sharing of the power rail between the members of a domain. As this is the current assumption of the scheduler, the flag is added to all sched_domain

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 3 +++ kernel/sched/core.c | 5 +++++ 5 files changed, 11 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..065c720 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \ + | arch_sd_share_power_line() \ .last_balance = jiffies, \ .balance_interval = 1, \ .nr_balance_failed = 0, \ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 7a7ce39..d39ed0b 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -72,6 +72,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_share_power_line() \ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 4786b20..74f2daf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -862,6 +862,7 @@ enum cpu_idle_type { #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_PREFER_LOCAL 0x0040 /* Prefer to keep tasks local to this domain */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERLINE 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/include/linux/topology.h b/include/linux/topology.h index fec12d6..20964ab 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -99,6 +99,7 @@ int arch_update_cpu_topology(void); | 1*SD_WAKE_AFFINE \ | 1*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_share_power_line() \ | 0*SD_SERIALIZE \ | 0*SD_PREFER_SIBLING \ | arch_sd_sibling_asym_packing() \ @@ -132,6 +133,7 @@ int arch_update_cpu_topology(void); | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ | 1*SD_SHARE_PKG_RESOURCES \ + | arch_sd_share_power_line() \ | 0*SD_SERIALIZE \ , \ .last_balance = jiffies, \ @@ -163,6 +165,7 @@ int arch_update_cpu_topology(void); | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \ + | arch_sd_share_power_line() \ | 0*SD_SERIALIZE \ | 1*SD_PREFER_SIBLING \ , \ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d50fbac..dab7908 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6407,6 +6407,11 @@ int __weak arch_sd_sibling_asym_packing(void) return 0*SD_ASYM_PACKING; }

+int __weak arch_sd_share_power_line(void) +{ + return 1*SD_SHARE_POWERLINE; +} + /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains()

-- 1.7.9.5

Santosh Shilimkar

24 Oct 24 Oct

3:17 p.m.

New subject: [RFC 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain

Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...

This new flag SD SHARE_POWERLINE reflects the sharing of the power rail between the members of a domain. As this is the current assumption of the scheduler, the flag is added to all sched_domain

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 3 +++ kernel/sched/core.c | 5 +++++ 5 files changed, 11 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..065c720 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \
		| arch_sd_share_power_line()		\
.last_balance = jiffies, \ .balance_interval = 1, \ .nr_balance_failed = 0, \
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 7a7ce39..d39ed0b 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -72,6 +72,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
		| arch_sd_share_power_line()		\
	| 0*SD_SERIALIZE			\
	,					\
.last_balance = jiffies, \
diff --git a/include/linux/sched.h b/include/linux/sched.h index 4786b20..74f2daf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -862,6 +862,7 @@ enum cpu_idle_type { #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_PREFER_LOCAL 0x0040 /* Prefer to keep tasks local to this domain */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERLINE 0x0100 /* Domain members share power domain */

If you ignore the current use of SD_SHARE_CPUPOWER, isn't the meaning of CPUPOWER and POWERLINE is same here. Just trying to understand the clear meaning of this new flag. Have you not considered SD_SHARE_CPUPOWER because it is being used for cpu_power and needs at least minimum two domains ? SD_PACKING would have been probably more appropriate based on the way it is being used in further series.

Regards Santosh

Vincent Guittot

29 Oct 29 Oct

9:40 a.m.

New subject: [RFC 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain

On 24 October 2012 17:17, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...

Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
This new flag SD SHARE_POWERLINE reflects the sharing of the power rail between the members of a domain. As this is the current assumption of the scheduler, the flag is added to all sched_domain

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 3 +++ kernel/sched/core.c | 5 +++++ 5 files changed, 11 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..065c720 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \
                          | arch_sd_share_power_line()            \
  .last_balance           = jiffies,              \
  .balance_interval       = 1,                    \
  .nr_balance_failed      = 0,                    \
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 7a7ce39..d39ed0b 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -72,6 +72,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
                          | arch_sd_share_power_line()            \
                          | 0*SD_SERIALIZE                        \
                          ,                                       \
  .last_balance           = jiffies,                              \
diff --git a/include/linux/sched.h b/include/linux/sched.h index 4786b20..74f2daf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -862,6 +862,7 @@ enum cpu_idle_type { #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_PREFER_LOCAL 0x0040 /* Prefer to keep tasks local to this domain */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERLINE 0x0100 /* Domain members share power domain */
If you ignore the current use of SD_SHARE_CPUPOWER, isn't the meaning of CPUPOWER and POWERLINE is same here. Just trying to understand the clear meaning of this new flag. Have you not considered SD_SHARE_CPUPOWER because it is being used for cpu_power and needs at least minimum two domains ? SD_PACKING would have been probably more appropriate based on the way it is being used in further series.

CPUPOWER reflects the share of hw ressources between cores like for hyper threading. POWERLINE describes the fact that cores are sharing the same power line amore precisely the powergate.

...

Regards Santosh

Vincent Guittot

9:50 a.m.

New subject: [RFC 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain

It looks like i need to describe more what

On 29 October 2012 10:40, Vincent Guittot vincent.guittot@linaro.org wrote:

...

On 24 October 2012 17:17, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
This new flag SD SHARE_POWERLINE reflects the sharing of the power rail between the members of a domain. As this is the current assumption of the scheduler, the flag is added to all sched_domain

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 3 +++ kernel/sched/core.c | 5 +++++ 5 files changed, 11 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..065c720 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \
                          | arch_sd_share_power_line()            \
  .last_balance           = jiffies,              \
  .balance_interval       = 1,                    \
  .nr_balance_failed      = 0,                    \
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 7a7ce39..d39ed0b 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -72,6 +72,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
                          | arch_sd_share_power_line()            \
                          | 0*SD_SERIALIZE                        \
                          ,                                       \
  .last_balance           = jiffies,                              \
diff --git a/include/linux/sched.h b/include/linux/sched.h index 4786b20..74f2daf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -862,6 +862,7 @@ enum cpu_idle_type { #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_PREFER_LOCAL 0x0040 /* Prefer to keep tasks local to this domain */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERLINE 0x0100 /* Domain members share power domain */
If you ignore the current use of SD_SHARE_CPUPOWER, isn't the meaning of CPUPOWER and POWERLINE is same here. Just trying to understand the clear meaning of this new flag. Have you not considered SD_SHARE_CPUPOWER because it is being used for cpu_power and needs at least minimum two domains ? SD_PACKING would have been probably more appropriate based on the way it is being used in further series.
CPUPOWER reflects the share of hw ressources between cores like for hyper threading. POWERLINE describes the fact that cores are sharing the same power line amore precisely the powergate.

Sorry, the mail has been sent too early while I was writing it

CPUPOWER reflects the share of hw ressources between cores like for hyper threading. POWERLINE describes the fact that cores are sharing the same power line and more precisely the same power gating. It looks like I need to describe more precisely what i would mean with SHARE_POWERLINE.

I don't want to use PACKING because it's more a behavior than a feature. If cores can power gate independently (!SD_SHARE_POWERLINE), packing small tasks is one interesting behavior but it may be not the only one. I want to make a difference between the HW configuration and the behavior we want to have above it

Vincent

...

...
Regards Santosh

Santosh Shilimkar

2 Nov 2 Nov

10:27 a.m.

New subject: [RFC 2/6] sched: add a new SD SHARE_POWERLINE flag for sched_domain

On Monday 29 October 2012 03:20 PM, Vincent Guittot wrote:

...

It looks like i need to describe more what

On 29 October 2012 10:40, Vincent Guittot vincent.guittot@linaro.org wrote:

...
On 24 October 2012 17:17, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
This new flag SD SHARE_POWERLINE reflects the sharing of the power rail between the members of a domain. As this is the current assumption of the scheduler, the flag is added to all sched_domain

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/ia64/include/asm/topology.h | 1 + arch/tile/include/asm/topology.h | 1 + include/linux/sched.h | 1 + include/linux/topology.h | 3 +++ kernel/sched/core.c | 5 +++++ 5 files changed, 11 insertions(+)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..065c720 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -65,6 +65,7 @@ void build_cpu_to_node_map(void); | SD_BALANCE_EXEC \ | SD_BALANCE_FORK \ | SD_WAKE_AFFINE, \
                          | arch_sd_share_power_line()            \
   .last_balance           = jiffies,              \
   .balance_interval       = 1,                    \
   .nr_balance_failed      = 0,                    \
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 7a7ce39..d39ed0b 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -72,6 +72,7 @@ static inline const struct cpumask *cpumask_of_node(int node) | 0*SD_PREFER_LOCAL \ | 0*SD_SHARE_CPUPOWER \ | 0*SD_SHARE_PKG_RESOURCES \
                          | arch_sd_share_power_line()            \
                           | 0*SD_SERIALIZE                        \
                           ,                                       \
   .last_balance           = jiffies,                              \
diff --git a/include/linux/sched.h b/include/linux/sched.h index 4786b20..74f2daf 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -862,6 +862,7 @@ enum cpu_idle_type { #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_PREFER_LOCAL 0x0040 /* Prefer to keep tasks local to this domain */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERLINE 0x0100 /* Domain members share power domain */
If you ignore the current use of SD_SHARE_CPUPOWER, isn't the meaning of CPUPOWER and POWERLINE is same here. Just trying to understand the clear meaning of this new flag. Have you not considered SD_SHARE_CPUPOWER because it is being used for cpu_power and needs at least minimum two domains ? SD_PACKING would have been probably more appropriate based on the way it is being used in further series.
CPUPOWER reflects the share of hw ressources between cores like for hyper threading. POWERLINE describes the fact that cores are sharing the same power line amore precisely the powergate.
Sorry, the mail has been sent too early while I was writing it

CPUPOWER reflects the share of hw ressources between cores like for hyper threading. POWERLINE describes the fact that cores are sharing the same power line and more precisely the same power gating. It looks like I need to describe more precisely what i would mean with SHARE_POWERLINE.

Yes. More description will help. I see bit of overlap POWERLINE flag with SD_SHARE_CPUPOWER and SD_SHARE_PKG_RESOURCES and hence the questions.

...

I don't want to use PACKING because it's more a behavior than a feature. If cores can power gate independently (!SD_SHARE_POWERLINE), packing small tasks is one interesting behavior but it may be not the only one. I want to make a difference between the HW configuration and the behavior we want to have above it

Fair enough. Thanks for clarification.

Regards, Santosh

Vincent Guittot

7 Oct 7 Oct

7:43 a.m.

New subject: [RFC 3/6] sched: pack small tasks

During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 | ----------------------------------- buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);

+ update_packing_domain(cpu); update_top_cache_domain(cpu); }

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+ +/* + * Save the id of the optimal CPU that should be used to pack small tasks + * The value -1 is used when no buddy has been found + */ +DEFINE_PER_CPU(int, sd_pack_buddy); + +/* Look for the best buddy CPU that can be used to pack small tasks + * We make the assumption that it doesn't wort to pack on CPU that share the + * same powerline. We looks for the 1st sched_domain without the + * SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest + * power per core based on the assumption that their power efficiency is + * better */ +void update_packing_domain(int cpu) +{ + struct sched_domain *sd; + int id = -1; + + sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE); + if (!sd) + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + else + sd = sd->parent; + + while (sd) { + struct sched_group *sg = sd->groups; + struct sched_group *pack = sg; + struct sched_group *tmp = sg->next; + + /* 1st CPU of the sched domain is a good candidate */ + if (id == -1) + id = cpumask_first(sched_domain_span(sd)); + + /* loop the sched groups to find the best one */ + while (tmp != sg) { + if (tmp->sgp->power * sg->group_weight < + sg->sgp->power * tmp->group_weight) + pack = tmp; + tmp = tmp->next; + } + + /* we have found a better group */ + if (pack != sg) + id = cpumask_first(sched_group_cpus(pack)); + + /* Look for another CPU than itself */ + if ((id != cpu) + || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE))) + break; + + sd = sd->parent; + } + + pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id); + per_cpu(sd_pack_buddy, cpu) = id; +} + #if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + + /* + * A busy buddy is a CPU with a high load or a small load with a lot of + * running tasks. + */ + return ((rq->avg.usage_avg_sum << rq->nr_running) > + rq->avg.runnable_avg_period); +} + +static inline bool is_light_task(struct task_struct *p) +{ + /* A light task runs less than 25% in average */ + return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period); +} + +static int check_pack_buddy(int cpu, struct task_struct *p) +{ + int buddy = per_cpu(sd_pack_buddy, cpu); + + /* No pack buddy for this CPU */ + if (buddy == -1) + return false; + + /* + * If a task is waiting for running on the CPU which is its own buddy, + * let the default behavior to look for a better CPU if available + * The threshold has been set to 37.5% + */ + if ((buddy == cpu) + && ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5))) + return false; + + /* buddy is not an allowed CPU */ + if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p))) + return false; + + /* + * If the task is a small one and the buddy is not overloaded, + * we use buddy cpu + */ + if (!is_light_task(p) || is_buddy_busy(buddy)) + return false; + + return true; +} + /* * sched_balance_self: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and @@ -3098,6 +3204,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (p->nr_cpus_allowed == 1) return prev_cpu;

+ if (check_pack_buddy(cpu, p)) + return per_cpu(sd_pack_buddy, cpu); + if (sd_flag & SD_BALANCE_WAKE) { if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a95d5c1..086d8bf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -875,6 +875,7 @@ static inline void idle_balance(int cpu, struct rq *rq)

extern void sysrq_sched_debug_show(void); extern void sched_init_granularity(void); +extern void update_packing_domain(int cpu); extern void update_max_interval(void); extern void update_group_power(struct sched_domain *sd, int cpu); extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu);

-- 1.7.9.5

Santosh Shilimkar

24 Oct 24 Oct

3:20 p.m.

New subject: [RFC 3/6] sched: pack small tasks

Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...

During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |

^ Is that a typo ? Should it be CPU2 instead of CPU0 ?

...

Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

I have tried this series since I was looking at some of these packing bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary, I did see some additional filtering of threads with this series but its not making much difference in power. More on this below.

...

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);

update_packing_domain(cpu); update_top_cache_domain(cpu); }

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share the

s/wort/worth

...

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest

power per core based on the assumption that their power efficiency is

better */

Commenting style.. /* * */

Can you please expand the why the assumption is right ? "it doesn't wort to pack on CPU that share the same powerline"

Think about a scenario where you have quad core, ducal cluster system

|Cluster1| |cluster 2| | CPU0 | CPU1 | CPU2 | CPU3 | | CPU0 | CPU1 | CPU2 | CPU3 |

Both clusters run from same voltage rail and have same PLL clocking them. But the cluster have their own power domain and all CPU's can power gate them-self to low power states. Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack load on one cluster. No ?

...

+void update_packing_domain(int cpu) +{

struct sched_domain *sd;
int id = -1;
sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
if (!sd)

sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);

else
```
sd = sd->parent;
```
while (sd) {
```
struct sched_group *sg = sd->groups;
```
```
struct sched_group *pack = sg;
```
```
struct sched_group *tmp = sg->next;
```

/* 1st CPU of the sched domain is a good candidate */

```
if (id == -1)
```

	id = cpumask_first(sched_domain_span(sd));

/* loop the sched groups to find the best one */

```
while (tmp != sg) {
```

	if (tmp->sgp->power * sg->group_weight <

```
			sg->sgp->power * tmp->group_weight)
```
```
		pack = tmp;
```
```
	tmp = tmp->next;
```
```
}
```
```
/* we have found a better group */
```
```
if (pack != sg)
```

	id = cpumask_first(sched_group_cpus(pack));

```
/* Look for another CPU than itself */
```
```
if ((id != cpu)
```

 || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))

Is the condition "!(sd->parent->flags && SD_LOAD_BALANCE)" for big.LITTLE kind of system where SD_LOAD_BALANCE may not be used ?

...

	break;
sd = sd->parent;
}

pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);

per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else

@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);

/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
return ((rq->avg.usage_avg_sum << rq->nr_running) >
	rq->avg.runnable_avg_period);

I agree busy CPU is the one with high load, but many small threads may not make CPU fully busy, right ? Should we just stick to the load parameter alone here ?

...

+}

+static inline bool is_light_task(struct task_struct *p) +{

/* A light task runs less than 25% in average */

return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period);

+}

Since the whole packing logic relies on the light threads only, the overall effectiveness is not significant. Infact with multiple tries on Dual core system, I didn't see any major improvement in power. I think we need to be more aggressive here. From the cover letter, I noticed that, you were concerned about any performance drop due to packing and may be that is the reason you chose the conservative threshold. But the fact is, if we want to save meaningful power, there will be slight performance drop which is expected.

...

+static int check_pack_buddy(int cpu, struct task_struct *p) +{
int buddy = per_cpu(sd_pack_buddy, cpu);

/* No pack buddy for this CPU */

if (buddy == -1)
return false;
/*
* If a task is waiting for running on the CPU which is its own buddy,
* let the default behavior to look for a better CPU if available
* The threshold has been set to 37.5%
*/
if ((buddy == cpu)
&& ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5)))
return false;

I lost you here on better CPU , 37.5 % and last two conditions. Isn't the first condition 'buddy==cpu' enough to return since nothing really needs to be done in that case. Can you please expand this a bit?

...

/* buddy is not an allowed CPU */

if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
return false;
/*
* If the task is a small one and the buddy is not overloaded,
* we use buddy cpu
*/
if (!is_light_task(p) || is_buddy_busy(buddy))
return false;

This is right but both the evaluation needs update to be effective.

Regards Santosh

Vincent Guittot

29 Oct 29 Oct

1:12 p.m.

New subject: [RFC 3/6] sched: pack small tasks

On 24 October 2012 17:20, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...

Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |
                    ^
Is that a typo ? Should it be CPU2 instead of CPU0 ?

No it's not a typo. The system packs at each scheduling level. It starts to pack in cluster because each core can power gate independently so CPU1 tries to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in itself

...

...
Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

I have tried this series since I was looking at some of these packing bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary, I did see some additional filtering of threads with this series but its not making much difference in power. More on this below.

Can I ask you which configuration you have used ? how many cores and cluster ? Can they be power gated independently ?

...

...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
  update_packing_domain(cpu);
  update_top_cache_domain(cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share

the
s/wort/worth

yes

...

...

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the

lowest

power per core based on the assumption that their power efficiency is

better */

Commenting style.. /*

*/

yes

...

Can you please expand the why the assumption is right ? "it doesn't wort to pack on CPU that share the same powerline"

By "share the same power-line", I mean that the CPUs can't power off independently. So if some CPUs can't power off independently, it's worth to try to use most of them to race to idle.

...

Think about a scenario where you have quad core, ducal cluster system
    |Cluster1|                      |cluster 2|
| CPU0 | CPU1 | CPU2 | CPU3 | | CPU0 | CPU1 | CPU2 | CPU3 |

Both clusters run from same voltage rail and have same PLL clocking them. But the cluster have their own power domain and all CPU's can power gate them-self to low power states. Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack load on one cluster. No ?

yes, I need to update the description of SD_SHARE_POWERLINE because I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the power gating capacity of each core. For your example above, the SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.

...

+void update_packing_domain(int cpu) +{

```
  struct sched_domain *sd;
```
```
  int id = -1;
```

  sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);

```
  if (!sd)
```

          sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);

```
  else
```
```
          sd = sd->parent;
```
```
  while (sd) {
```

          struct sched_group *sg = sd->groups;

          struct sched_group *pack = sg;

          struct sched_group *tmp = sg->next;

          /* 1st CPU of the sched domain is a good candidate */

```
          if (id == -1)
```

                  id = cpumask_first(sched_domain_span(sd));

          /* loop the sched groups to find the best one */

```
          while (tmp != sg) {
```

                  if (tmp->sgp->power * sg->group_weight <

                                  sg->sgp->power *

tmp->group_weight)

```
                          pack = tmp;
```
```
                  tmp = tmp->next;
```
```
          }
```

          /* we have found a better group */

```
          if (pack != sg)
```

                  id = cpumask_first(sched_group_cpus(pack));

          /* Look for another CPU than itself */

```
          if ((id != cpu)
```

           || ((sd->parent) && !(sd->parent->flags &&

SD_LOAD_BALANCE)))

Is the condition "!(sd->parent->flags && SD_LOAD_BALANCE)" for big.LITTLE kind of system where SD_LOAD_BALANCE may not be used ?

No, packing small tasks is part of the load balance so if the LOAD_BALANCE flag is cleared, we will not try to pack which is a kind of load balance. There is no link with big.LITTLE

...

...
                  break;
          sd = sd->parent;
  }
  pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
  per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else

@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
  struct rq *rq = cpu_rq(cpu);
  /*
   * A busy buddy is a CPU with a high load or a small load with a
lot of
   * running tasks.
   */
  return ((rq->avg.usage_avg_sum << rq->nr_running) >
                  rq->avg.runnable_avg_period);
I agree busy CPU is the one with high load, but many small threads may not make CPU fully busy, right ? Should we just stick to the load parameter alone here ?

IMO, the busy state of a CPU isn't only the load but also how many threads are waiting for running on it. This formula tries to take into account both inputs. If you have dozen of small tasks on a CPU, the latency can be large even if the tasks are small.

...

...
+}

+static inline bool is_light_task(struct task_struct *p) +{
  /* A light task runs less than 25% in average */
  return ((p->se.avg.usage_avg_sum << 2) <
p->se.avg.runnable_avg_period); +}
Since the whole packing logic relies on the light threads only, the overall effectiveness is not significant. Infact with multiple tries on Dual core system, I didn't see any major improvement in power. I think we need to be more aggressive here. From the cover letter, I noticed that, you were concerned about any performance drop due to packing and may be that is the reason you chose the conservative threshold. But the fact is, if we want to save meaningful power, there will be slight performance drop which is expected.

I think that everybody agrees that packing small tasks will save power whereas it seems to be not so obvious for heavy task. But I may have set the threshold a bit too low

Up to which load, you would like to pack on 1 core of your dual core system ? Can you provide more details of your load ? Have you got a trace that you can share ?

...

...
+static int check_pack_buddy(int cpu, struct task_struct *p) +{
  int buddy = per_cpu(sd_pack_buddy, cpu);
  /* No pack buddy for this CPU */
  if (buddy == -1)
          return false;
  /*
   * If a task is waiting for running on the CPU which is its own
buddy,
   * let the default behavior to look for a better CPU if available
   * The threshold has been set to 37.5%
   */
  if ((buddy == cpu)
   && ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum
5)))
          return false;
I lost you here on better CPU , 37.5 % and last two conditions. Isn't the first condition 'buddy==cpu' enough to return since nothing really needs to be done in that case. Can you please expand this a bit?

If you have a lot of small tasks waking up and running simultaneously, Some tasks will wait for runnning and we could short the running time by parallelizing tasks if possible (at MC level for example)

...

```
  /* buddy is not an allowed CPU */
```

  if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))

```
          return false;
```
```
  /*
```

   * If the task is a small one and the buddy is not overloaded,

```
   * we use buddy cpu
```
```
   */
```

   if (!is_light_task(p) || is_buddy_busy(buddy))

```
          return false;
```

This is right but both the evaluation needs update to be effective.

Regards Santosh

Santosh Shilimkar

2 Nov 2 Nov

10:53 a.m.

New subject: [RFC 3/6] sched: pack small tasks

On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:

...

On 24 October 2012 17:20, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |
                     ^
Is that a typo ? Should it be CPU2 instead of CPU0 ?
No it's not a typo. The system packs at each scheduling level. It starts to pack in cluster because each core can power gate independently so CPU1 tries to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in itself

I get it. Though in above example a task may migrate from say CPU3->CPU2->CPU0 as part of packing. I was just thinking whether moving such task from say CPU3 to CPU0 might be best instead.

...

...
...
Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

I have tried this series since I was looking at some of these packing bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary, I did see some additional filtering of threads with this series but its not making much difference in power. More on this below.

Can I ask you which configuration you have used ? how many cores and cluster ? Can they be power gated independently ?

I have been trying with couple of setups. Dual Core ARM machine and Quad core X86 box with single package thought most of the mobile workload analysis I was doing on ARM machine. On both setups CPUs can be gated independently.

...

...
...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
  update_packing_domain(cpu);
   update_top_cache_domain(cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share

the
s/wort/worth
yes

...
...

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the

lowest

power per core based on the assumption that their power efficiency is

better */

Commenting style.. /*

*/

yes

...
Can you please expand the why the assumption is right ? "it doesn't wort to pack on CPU that share the same powerline"

By "share the same power-line", I mean that the CPUs can't power off independently. So if some CPUs can't power off independently, it's worth to try to use most of them to race to idle.

In that case I suggest we use different word here. Power line can be treated as voltage line, power domain. May be SD_SHARE_CPU_POWERDOMAIN ?

...

...
Think about a scenario where you have quad core, ducal cluster system
     |Cluster1|                      |cluster 2|
| CPU0 | CPU1 | CPU2 | CPU3 | | CPU0 | CPU1 | CPU2 | CPU3 |

Both clusters run from same voltage rail and have same PLL clocking them. But the cluster have their own power domain and all CPU's can power gate them-self to low power states. Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack load on one cluster. No ?
yes, I need to update the description of SD_SHARE_POWERLINE because I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the power gating capacity of each core. For your example above, the SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.

Thanks for clarification.

...

+void update_packing_domain(int cpu) +{

```
  struct sched_domain *sd;
```
```
  int id = -1;
```

  sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);

```
  if (!sd)
```

          sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);

```
  else
```
```
          sd = sd->parent;
```
```
  while (sd) {
```

          struct sched_group *sg = sd->groups;

          struct sched_group *pack = sg;

          struct sched_group *tmp = sg->next;

          /* 1st CPU of the sched domain is a good candidate */

```
          if (id == -1)
```

                  id = cpumask_first(sched_domain_span(sd));

          /* loop the sched groups to find the best one */

```
          while (tmp != sg) {
```

                  if (tmp->sgp->power * sg->group_weight <

                                  sg->sgp->power *

tmp->group_weight)

```
                          pack = tmp;
```
```
                  tmp = tmp->next;
```
```
          }
```

          /* we have found a better group */

```
          if (pack != sg)
```

                  id = cpumask_first(sched_group_cpus(pack));

          /* Look for another CPU than itself */

```
          if ((id != cpu)
```

           || ((sd->parent) && !(sd->parent->flags &&

SD_LOAD_BALANCE)))

Is the condition "!(sd->parent->flags && SD_LOAD_BALANCE)" for big.LITTLE kind of system where SD_LOAD_BALANCE may not be used ?

No, packing small tasks is part of the load balance so if the LOAD_BALANCE flag is cleared, we will not try to pack which is a kind of load balance. There is no link with big.LITTLE

Now it make sense to me.

...

...
...
                  break;
          sd = sd->parent;
  }
  pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
  per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else

@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
  struct rq *rq = cpu_rq(cpu);
  /*
   * A busy buddy is a CPU with a high load or a small load with a
lot of
   * running tasks.
   */
  return ((rq->avg.usage_avg_sum << rq->nr_running) >
                  rq->avg.runnable_avg_period);
I agree busy CPU is the one with high load, but many small threads may not make CPU fully busy, right ? Should we just stick to the load parameter alone here ?
IMO, the busy state of a CPU isn't only the load but also how many threads are waiting for running on it. This formula tries to take into account both inputs. If you have dozen of small tasks on a CPU, the latency can be large even if the tasks are small.

Sure. Your point is to avoid throttling and probably use race for idle.

...

...
...
+}

+static inline bool is_light_task(struct task_struct *p) +{
  /* A light task runs less than 25% in average */
  return ((p->se.avg.usage_avg_sum << 2) <
p->se.avg.runnable_avg_period); +}
Since the whole packing logic relies on the light threads only, the overall effectiveness is not significant. Infact with multiple tries on Dual core system, I didn't see any major improvement in power. I think we need to be more aggressive here. From the cover letter, I noticed that, you were concerned about any performance drop due to packing and may be that is the reason you chose the conservative threshold. But the fact is, if we want to save meaningful power, there will be slight performance drop which is expected.
I think that everybody agrees that packing small tasks will save power whereas it seems to be not so obvious for heavy task. But I may have set the threshold a bit too low

I agree on packing saves power part for sure.

...

Up to which load, you would like to pack on 1 core of your dual core system ? Can you provide more details of your load ? Have you got a trace that you can share ?

More than how much load to pack, I was more looking from the power savings delta we can achieve by doing it. Some of the usecases like osidle, mp3, gallary are already very low power and that might be the reason I didn't notice major mA delta. Though the perf traces did show some filtering even at 25 % load. i tried upto 50 % threshold to see the effectiveness and there was more improvement and hence the suggestion about aggressiveness.

May be you can try some of these use-cases on your setup instead of synthetic workload and see the results.

Regards Santosh

Morten Rasmussen

9 Nov 9 Nov

4:46 p.m.

New subject: [RFC 3/6] sched: pack small tasks

On Fri, Nov 02, 2012 at 10:53:47AM +0000, Santosh Shilimkar wrote:

...

On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:

...
On 24 October 2012 17:20, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |
                     ^
Is that a typo ? Should it be CPU2 instead of CPU0 ?
No it's not a typo. The system packs at each scheduling level. It starts to pack in cluster because each core can power gate independently so CPU1 tries to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in itself
I get it. Though in above example a task may migrate from say CPU3->CPU2->CPU0 as part of packing. I was just thinking whether moving such task from say CPU3 to CPU0 might be best instead.

To me it seems suboptimal to pack the task twice, but the alternative is not good either. If you try to move the task directly to CPU0 you may miss packing opportunities if CPU0 is already busy, while CPU2 might have enough capacity to take it. It would probably be better to check the business of CPU0 and then back off and try CPU2 if CP0 is busy. This would require a buddy list for each CPU rather just a single buddy and thus might become expensive.

...

...
...
...
Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

I have tried this series since I was looking at some of these packing bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary, I did see some additional filtering of threads with this series but its not making much difference in power. More on this below.

Can I ask you which configuration you have used ? how many cores and cluster ? Can they be power gated independently ?

I have been trying with couple of setups. Dual Core ARM machine and Quad core X86 box with single package thought most of the mobile workload analysis I was doing on ARM machine. On both setups CPUs can be gated independently.

...
...
...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
  update_packing_domain(cpu);
   update_top_cache_domain(cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share

the
s/wort/worth
yes

...
...

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the

lowest

power per core based on the assumption that their power efficiency is

better */

Commenting style.. /*

*/

yes

...
Can you please expand the why the assumption is right ? "it doesn't wort to pack on CPU that share the same powerline"

By "share the same power-line", I mean that the CPUs can't power off independently. So if some CPUs can't power off independently, it's worth to try to use most of them to race to idle.
In that case I suggest we use different word here. Power line can be treated as voltage line, power domain. May be SD_SHARE_CPU_POWERDOMAIN ?

How about just SD_SHARE_POWERDOMAIN ?

...

...
...
Think about a scenario where you have quad core, ducal cluster system
     |Cluster1|                      |cluster 2|
| CPU0 | CPU1 | CPU2 | CPU3 | | CPU0 | CPU1 | CPU2 | CPU3 |

Both clusters run from same voltage rail and have same PLL clocking them. But the cluster have their own power domain and all CPU's can power gate them-self to low power states. Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack load on one cluster. No ?
yes, I need to update the description of SD_SHARE_POWERLINE because I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the power gating capacity of each core. For your example above, the SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.
Thanks for clarification.

...
...
...
+void update_packing_domain(int cpu) +{
  struct sched_domain *sd;
  int id = -1;
  sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
  if (!sd)
          sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
  else
          sd = sd->parent;
  while (sd) {
          struct sched_group *sg = sd->groups;
          struct sched_group *pack = sg;
          struct sched_group *tmp = sg->next;
          /* 1st CPU of the sched domain is a good candidate */
          if (id == -1)
                  id = cpumask_first(sched_domain_span(sd));
          /* loop the sched groups to find the best one */
          while (tmp != sg) {
                  if (tmp->sgp->power * sg->group_weight <
                                  sg->sgp->power *
tmp->group_weight)
                          pack = tmp;
                  tmp = tmp->next;
          }
          /* we have found a better group */
          if (pack != sg)
                  id = cpumask_first(sched_group_cpus(pack));
          /* Look for another CPU than itself */
          if ((id != cpu)
           || ((sd->parent) && !(sd->parent->flags &&
SD_LOAD_BALANCE)))
Is the condition "!(sd->parent->flags && SD_LOAD_BALANCE)" for big.LITTLE kind of system where SD_LOAD_BALANCE may not be used ?
No, packing small tasks is part of the load balance so if the LOAD_BALANCE flag is cleared, we will not try to pack which is a kind of load balance. There is no link with big.LITTLE
Now it make sense to me.

...
...
...
                  break;
          sd = sd->parent;
  }
  pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
  per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else

@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
  struct rq *rq = cpu_rq(cpu);
  /*
   * A busy buddy is a CPU with a high load or a small load with a
lot of
   * running tasks.
   */
  return ((rq->avg.usage_avg_sum << rq->nr_running) >
                  rq->avg.runnable_avg_period);
I agree busy CPU is the one with high load, but many small threads may not make CPU fully busy, right ? Should we just stick to the load parameter alone here ?
IMO, the busy state of a CPU isn't only the load but also how many threads are waiting for running on it. This formula tries to take into account both inputs. If you have dozen of small tasks on a CPU, the latency can be large even if the tasks are small.
Sure. Your point is to avoid throttling and probably use race for idle.

...
...
...
+}

+static inline bool is_light_task(struct task_struct *p) +{
  /* A light task runs less than 25% in average */
  return ((p->se.avg.usage_avg_sum << 2) <
p->se.avg.runnable_avg_period); +}
Since the whole packing logic relies on the light threads only, the overall effectiveness is not significant. Infact with multiple tries on Dual core system, I didn't see any major improvement in power. I think we need to be more aggressive here. From the cover letter, I noticed that, you were concerned about any performance drop due to packing and may be that is the reason you chose the conservative threshold. But the fact is, if we want to save meaningful power, there will be slight performance drop which is expected.
I think that everybody agrees that packing small tasks will save power whereas it seems to be not so obvious for heavy task. But I may have set the threshold a bit too low
I agree on packing saves power part for sure.

I'm not fully convinced that packing always saves power. For systems with multiple cpu clusters where each cluster is a power domain and the cpus have no individual power saving states it would probably be more power efficient to spread the tasks and hope for more opportunities for hitting cluster shutdown. If all tasks are packed on one cpu it will keep the whole cluster up, while the remaining cpus are idling without possibility for entering efficient power states.

Regards, Morten

...

...
Up to which load, you would like to pack on 1 core of your dual core system ? Can you provide more details of your load ? Have you got a trace that you can share ?

More than how much load to pack, I was more looking from the power savings delta we can achieve by doing it. Some of the usecases like osidle, mp3, gallary are already very low power and that might be the reason I didn't notice major mA delta. Though the perf traces did show some filtering even at 25 % load. i tried upto 50 % threshold to see the effectiveness and there was more improvement and hence the suggestion about aggressiveness.

May be you can try some of these use-cases on your setup instead of synthetic workload and see the results.

Regards Santosh

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Vincent Guittot

12 Nov 12 Nov

1:13 p.m.

New subject: [RFC 3/6] sched: pack small tasks

On 9 November 2012 17:46, Morten Rasmussen Morten.Rasmussen@arm.com wrote:

...

On Fri, Nov 02, 2012 at 10:53:47AM +0000, Santosh Shilimkar wrote:

...
On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:

...
On 24 October 2012 17:20, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |
                     ^
Is that a typo ? Should it be CPU2 instead of CPU0 ?
No it's not a typo. The system packs at each scheduling level. It starts to pack in cluster because each core can power gate independently so CPU1 tries to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in itself
I get it. Though in above example a task may migrate from say CPU3->CPU2->CPU0 as part of packing. I was just thinking whether moving such task from say CPU3 to CPU0 might be best instead.
To me it seems suboptimal to pack the task twice, but the alternative is not good either. If you try to move the task directly to CPU0 you may miss packing opportunities if CPU0 is already busy, while CPU2 might have enough capacity to take it. It would probably be better to check the business of CPU0 and then back off and try CPU2 if CP0 is busy. This would require a buddy list for each CPU rather just a single buddy and thus might become expensive.

...
...
...
...
Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

I have tried this series since I was looking at some of these packing bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary, I did see some additional filtering of threads with this series but its not making much difference in power. More on this below.

Can I ask you which configuration you have used ? how many cores and cluster ? Can they be power gated independently ?

I have been trying with couple of setups. Dual Core ARM machine and Quad core X86 box with single package thought most of the mobile workload analysis I was doing on ARM machine. On both setups CPUs can be gated independently.

...
...
...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
  update_packing_domain(cpu);
   update_top_cache_domain(cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share

the
s/wort/worth
yes

...
...

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the

lowest

power per core based on the assumption that their power efficiency is

better */

Commenting style.. /*

*/

yes

...
Can you please expand the why the assumption is right ? "it doesn't wort to pack on CPU that share the same powerline"

By "share the same power-line", I mean that the CPUs can't power off independently. So if some CPUs can't power off independently, it's worth to try to use most of them to race to idle.
In that case I suggest we use different word here. Power line can be treated as voltage line, power domain. May be SD_SHARE_CPU_POWERDOMAIN ?
How about just SD_SHARE_POWERDOMAIN ?

It looks better than SD_SHARE_POWERLINE. I will replace the name

...

...
...
...
Think about a scenario where you have quad core, ducal cluster system
     |Cluster1|                      |cluster 2|
| CPU0 | CPU1 | CPU2 | CPU3 | | CPU0 | CPU1 | CPU2 | CPU3 |

Both clusters run from same voltage rail and have same PLL clocking them. But the cluster have their own power domain and all CPU's can power gate them-self to low power states. Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack load on one cluster. No ?
yes, I need to update the description of SD_SHARE_POWERLINE because I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the power gating capacity of each core. For your example above, the SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.
Thanks for clarification.

...
...
...
+void update_packing_domain(int cpu) +{
  struct sched_domain *sd;
  int id = -1;
  sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
  if (!sd)
          sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
  else
          sd = sd->parent;
  while (sd) {
          struct sched_group *sg = sd->groups;
          struct sched_group *pack = sg;
          struct sched_group *tmp = sg->next;
          /* 1st CPU of the sched domain is a good candidate */
          if (id == -1)
                  id = cpumask_first(sched_domain_span(sd));
          /* loop the sched groups to find the best one */
          while (tmp != sg) {
                  if (tmp->sgp->power * sg->group_weight <
                                  sg->sgp->power *
tmp->group_weight)
                          pack = tmp;
                  tmp = tmp->next;
          }
          /* we have found a better group */
          if (pack != sg)
                  id = cpumask_first(sched_group_cpus(pack));
          /* Look for another CPU than itself */
          if ((id != cpu)
           || ((sd->parent) && !(sd->parent->flags &&
SD_LOAD_BALANCE)))
Is the condition "!(sd->parent->flags && SD_LOAD_BALANCE)" for big.LITTLE kind of system where SD_LOAD_BALANCE may not be used ?
No, packing small tasks is part of the load balance so if the LOAD_BALANCE flag is cleared, we will not try to pack which is a kind of load balance. There is no link with big.LITTLE
Now it make sense to me.

...
...
...
                  break;
          sd = sd->parent;
  }
  pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
  per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else

@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
  struct rq *rq = cpu_rq(cpu);
  /*
   * A busy buddy is a CPU with a high load or a small load with a
lot of
   * running tasks.
   */
  return ((rq->avg.usage_avg_sum << rq->nr_running) >
                  rq->avg.runnable_avg_period);
I agree busy CPU is the one with high load, but many small threads may not make CPU fully busy, right ? Should we just stick to the load parameter alone here ?
IMO, the busy state of a CPU isn't only the load but also how many threads are waiting for running on it. This formula tries to take into account both inputs. If you have dozen of small tasks on a CPU, the latency can be large even if the tasks are small.
Sure. Your point is to avoid throttling and probably use race for idle.

...
...
...
+}

+static inline bool is_light_task(struct task_struct *p) +{
  /* A light task runs less than 25% in average */
  return ((p->se.avg.usage_avg_sum << 2) <
p->se.avg.runnable_avg_period); +}
Since the whole packing logic relies on the light threads only, the overall effectiveness is not significant. Infact with multiple tries on Dual core system, I didn't see any major improvement in power. I think we need to be more aggressive here. From the cover letter, I noticed that, you were concerned about any performance drop due to packing and may be that is the reason you chose the conservative threshold. But the fact is, if we want to save meaningful power, there will be slight performance drop which is expected.
I think that everybody agrees that packing small tasks will save power whereas it seems to be not so obvious for heavy task. But I may have set the threshold a bit too low
I agree on packing saves power part for sure.
I'm not fully convinced that packing always saves power. For systems with multiple cpu clusters where each cluster is a power domain and the cpus have no individual power saving states it would probably be more power efficient to spread the tasks and hope for more opportunities for hitting cluster shutdown. If all tasks are packed on one cpu it will keep the whole cluster up, while the remaining cpus are idling without possibility for entering efficient power states.

Regards, Morten

...
...
Up to which load, you would like to pack on 1 core of your dual core system ? Can you provide more details of your load ? Have you got a trace that you can share ?

More than how much load to pack, I was more looking from the power savings delta we can achieve by doing it. Some of the usecases like osidle, mp3, gallary are already very low power and that might be the reason I didn't notice major mA delta. Though the perf traces did show some filtering even at 25 % load. i tried upto 50 % threshold to see the effectiveness and there was more improvement and hence the suggestion about aggressiveness.

May be you can try some of these use-cases on your setup instead of synthetic workload and see the results.

Regards Santosh

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Vincent Guittot

9:30 a.m.

New subject: [RFC 3/6] sched: pack small tasks

On 2 November 2012 11:53, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...

On Monday 29 October 2012 06:42 PM, Vincent Guittot wrote:

...
On 24 October 2012 17:20, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
Vincent,

Few comments/questions.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |
                     ^
Is that a typo ? Should it be CPU2 instead of CPU0 ?
No it's not a typo. The system packs at each scheduling level. It starts to pack in cluster because each core can power gate independently so CPU1 tries to pack its tasks in CPU0 and CPU3 in CPU2. Then, it packs at CPU level so CPU2 tries to pack in the cluster of CPU0 and CPU0 packs in itself
I get it. Though in above example a task may migrate from say CPU3->CPU2->CPU0 as part of packing. I was just thinking whether moving such task from say CPU3 to CPU0 might be best instead.

We pack in the cluster then at CPU level. Tasks could sometimes migrate directly to CPU0 but we would miss the case where CPU0 is busy but CPU2 is not

Vincent

...

...
...
...
Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

I have tried this series since I was looking at some of these packing bits. On Mobile workloads like OSIdle with Screen ON, MP3, gallary, I did see some additional filtering of threads with this series but its not making much difference in power. More on this below.

Can I ask you which configuration you have used ? how many cores and cluster ? Can they be power gated independently ?

I have been trying with couple of setups. Dual Core ARM machine and Quad core X86 box with single package thought most of the mobile workload analysis I was doing on ARM machine. On both setups CPUs can be gated independently.

...
...
...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
  update_packing_domain(cpu);
   update_top_cache_domain(cpu);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small

tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that

share the
s/wort/worth
yes

...
...

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the

lowest

power per core based on the assumption that their power efficiency

is

better */

Commenting style.. /*

*/

yes

...
Can you please expand the why the assumption is right ? "it doesn't wort to pack on CPU that share the same powerline"

By "share the same power-line", I mean that the CPUs can't power off independently. So if some CPUs can't power off independently, it's worth to try to use most of them to race to idle.
In that case I suggest we use different word here. Power line can be treated as voltage line, power domain. May be SD_SHARE_CPU_POWERDOMAIN ?

...
...
Think about a scenario where you have quad core, ducal cluster system
     |Cluster1|                      |cluster 2|
| CPU0 | CPU1 | CPU2 | CPU3 | | CPU0 | CPU1 | CPU2 | CPU3 |

Both clusters run from same voltage rail and have same PLL clocking them. But the cluster have their own power domain and all CPU's can power gate them-self to low power states. Clusters also have their own level2 caches.

In this case, you will still save power if you try to pack load on one cluster. No ?
yes, I need to update the description of SD_SHARE_POWERLINE because I'm afraid I was not clear enough. SD_SHARE_POWERLINE includes the power gating capacity of each core. For your example above, the SD_SHARE_POWERLINE shoud be cleared at both MC and CPU level.
Thanks for clarification.

...
...
...
+void update_packing_domain(int cpu) +{
  struct sched_domain *sd;
  int id = -1;
  sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
  if (!sd)
          sd =
rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
  else
          sd = sd->parent;
  while (sd) {
          struct sched_group *sg = sd->groups;
          struct sched_group *pack = sg;
          struct sched_group *tmp = sg->next;
          /* 1st CPU of the sched domain is a good candidate */
          if (id == -1)
                  id = cpumask_first(sched_domain_span(sd));
          /* loop the sched groups to find the best one */
          while (tmp != sg) {
                  if (tmp->sgp->power * sg->group_weight <
                                  sg->sgp->power *
tmp->group_weight)
                          pack = tmp;
                  tmp = tmp->next;
          }
          /* we have found a better group */
          if (pack != sg)
                  id = cpumask_first(sched_group_cpus(pack));
          /* Look for another CPU than itself */
          if ((id != cpu)
           || ((sd->parent) && !(sd->parent->flags &&
SD_LOAD_BALANCE)))
Is the condition "!(sd->parent->flags && SD_LOAD_BALANCE)" for big.LITTLE kind of system where SD_LOAD_BALANCE may not be used ?
No, packing small tasks is part of the load balance so if the LOAD_BALANCE flag is cleared, we will not try to pack which is a kind of load balance. There is no link with big.LITTLE
Now it make sense to me.

...
...
...
                  break;
          sd = sd->parent;
  }
  pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
  per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else

@@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
  struct rq *rq = cpu_rq(cpu);
  /*
   * A busy buddy is a CPU with a high load or a small load with a
lot of
   * running tasks.
   */
  return ((rq->avg.usage_avg_sum << rq->nr_running) >
                  rq->avg.runnable_avg_period);
I agree busy CPU is the one with high load, but many small threads may not make CPU fully busy, right ? Should we just stick to the load parameter alone here ?
IMO, the busy state of a CPU isn't only the load but also how many threads are waiting for running on it. This formula tries to take into account both inputs. If you have dozen of small tasks on a CPU, the latency can be large even if the tasks are small.
Sure. Your point is to avoid throttling and probably use race for idle.

...
...
...
+}

+static inline bool is_light_task(struct task_struct *p) +{
  /* A light task runs less than 25% in average */
  return ((p->se.avg.usage_avg_sum << 2) <
p->se.avg.runnable_avg_period); +}
Since the whole packing logic relies on the light threads only, the overall effectiveness is not significant. Infact with multiple tries on Dual core system, I didn't see any major improvement in power. I think we need to be more aggressive here. From the cover letter, I noticed that, you were concerned about any performance drop due to packing and may be that is the reason you chose the conservative threshold. But the fact is, if we want to save meaningful power, there will be slight performance drop which is expected.
I think that everybody agrees that packing small tasks will save power whereas it seems to be not so obvious for heavy task. But I may have set the threshold a bit too low
I agree on packing saves power part for sure.

...
Up to which load, you would like to pack on 1 core of your dual core system ? Can you provide more details of your load ? Have you got a trace that you can share ?

More than how much load to pack, I was more looking from the power savings delta we can achieve by doing it. Some of the usecases like osidle, mp3, gallary are already very low power and that might be the reason I didn't notice major mA delta. Though the perf traces did show some filtering even at 25 % load. i tried upto 50 % threshold to see the effectiveness and there was more improvement and hence the suggestion about aggressiveness.

May be you can try some of these use-cases on your setup instead of synthetic workload and see the results.

Yes, I will do that. The main advantage of synthetic workload is the reproducibility and the interdependency against the framework that is used. But I'm going to use more realistic use-cases.

Regards, Vincent

...

Regards Santosh

Morten Rasmussen

9 Nov 9 Nov

5:13 p.m.

New subject: [RFC 3/6] sched: pack small tasks

Hi Vincent,

I have experienced suboptimal buddy selection on a dual cluster setup (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at CPU level. This seems to be the correct flag settings for a system with only cluster level power gating.

To me it looks like update_packing_domain() is not doing the right thing. See inline comments below.

On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:

...

During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);

update_packing_domain(cpu); update_top_cache_domain(cpu);

} diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share the

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest

power per core based on the assumption that their power efficiency is

better */

+void update_packing_domain(int cpu) +{
struct sched_domain *sd;

int id = -1;

sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);

if (!sd)
sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
else
sd = sd->parent;

sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched groups of the parent level would represent the power domains. If get it right, we want to pack inside the cluster first and only let first cpu of the cluster do packing on another cluster. So all cpus - except the first one - in the current sched domain should find its buddy within the domain and only the first one should go to the parent sched domain to find its buddy.

I propose the following fix:

- sd = sd->parent; + if (cpumask_first(sched_domain_span(sd)) == cpu + || !sd->parent) + sd = sd->parent;

...

while (sd) {
```
struct sched_group *sg = sd->groups;
```
```
struct sched_group *pack = sg;
```
```
struct sched_group *tmp = sg->next;
```

/* 1st CPU of the sched domain is a good candidate */

```
if (id == -1)
```

	id = cpumask_first(sched_domain_span(sd));

There is no guarantee that id is in the sched group pointed to by sd->groups, which is implicitly assumed later in the search loop. We need to find the sched group that contains id and point sg to that instead. I haven't found an elegant way to find that group, but the fix below should at least give the right result.

+ /* Find sched group of candidate */ + tmp = sd->groups; + do { + if (cpumask_test_cpu(id, sched_group_cpus(tmp))) + { + sg = tmp; + break; + } + } while (tmp = tmp->next, tmp != sd->groups); + + pack = sg; + tmp = sg->next;

Regards, Morten

...

/* loop the sched groups to find the best one */
while (tmp != sg) {
	if (tmp->sgp->power * sg->group_weight <
			sg->sgp->power * tmp->group_weight)
		pack = tmp;
	tmp = tmp->next;
}
/* we have found a better group */
if (pack != sg)
	id = cpumask_first(sched_group_cpus(pack));
/* Look for another CPU than itself */
if ((id != cpu)
 || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
	break;
sd = sd->parent;
}

pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);

per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; } +static inline bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);

/*
* A busy buddy is a CPU with a high load or a small load with a lot of
* running tasks.
*/
return ((rq->avg.usage_avg_sum << rq->nr_running) >
	rq->avg.runnable_avg_period);
+}

+static inline bool is_light_task(struct task_struct *p) +{

/* A light task runs less than 25% in average */

return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period);

+}

+static int check_pack_buddy(int cpu, struct task_struct *p) +{
int buddy = per_cpu(sd_pack_buddy, cpu);

/* No pack buddy for this CPU */

if (buddy == -1)
return false;
/*
* If a task is waiting for running on the CPU which is its own buddy,
* let the default behavior to look for a better CPU if available
* The threshold has been set to 37.5%
*/
if ((buddy == cpu)
&& ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5)))
return false;
/* buddy is not an allowed CPU */

if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
return false;
/*
* If the task is a small one and the buddy is not overloaded,
* we use buddy cpu
*/
if (!is_light_task(p) || is_buddy_busy(buddy))
return false;
return true;
+}

/*

sched_balance_self: balance the current task (running on cpu) in domains

that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and

@@ -3098,6 +3204,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (p->nr_cpus_allowed == 1) return prev_cpu;
if (check_pack_buddy(cpu, p))
return per_cpu(sd_pack_buddy, cpu);
if (sd_flag & SD_BALANCE_WAKE) { if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) want_affine = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a95d5c1..086d8bf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -875,6 +875,7 @@ static inline void idle_balance(int cpu, struct rq *rq) extern void sysrq_sched_debug_show(void); extern void sched_init_granularity(void); +extern void update_packing_domain(int cpu); extern void update_max_interval(void); extern void update_group_power(struct sched_domain *sd, int cpu); extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu); -- 1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Vincent Guittot

12 Nov 12 Nov

1:51 p.m.

New subject: [RFC 3/6] sched: pack small tasks

On 9 November 2012 18:13, Morten Rasmussen Morten.Rasmussen@arm.com wrote:

...

Hi Vincent,

I have experienced suboptimal buddy selection on a dual cluster setup (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at CPU level. This seems to be the correct flag settings for a system with only cluster level power gating.

To me it looks like update_packing_domain() is not doing the right thing. See inline comments below.

Hi Morten,

Thanks for testing the patches.

It seems that I have too optimized the loop and remove some use cases.

...

On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
update_packing_domain(cpu);
update_top_cache_domain(cpu);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share the

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest

power per core based on the assumption that their power efficiency is

better */

+void update_packing_domain(int cpu) +{
struct sched_domain *sd;
int id = -1;
sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
if (!sd)
        sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
else
        sd = sd->parent;
sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched groups of the parent level would represent the power domains. If get it right, we want to pack inside the cluster first and only let first cpu

You probably wanted to use sched_group instead of cluster because cluster is only a special use case, didn't you ?

...

of the cluster do packing on another cluster. So all cpus - except the first one - in the current sched domain should find its buddy within the domain and only the first one should go to the parent sched domain to find its buddy.

We don't want to pack in the current sched_domain because it shares power domain. We want to pack at the parent level

...

I propose the following fix:

```
          sd = sd->parent;
```

          if (cpumask_first(sched_domain_span(sd)) == cpu

```
                  || !sd->parent)
```
```
                  sd = sd->parent;
```

We always look for the buddy in the parent level whatever the cpu position in the mask is.

...

```
while (sd) {
```

        struct sched_group *sg = sd->groups;

```
        struct sched_group *pack = sg;
```

        struct sched_group *tmp = sg->next;

        /* 1st CPU of the sched domain is a good candidate */

```
        if (id == -1)
```

                id = cpumask_first(sched_domain_span(sd));

          /* Find sched group of candidate */

```
          tmp = sd->groups;
```
```
          do {
```

                  if (cpumask_test_cpu(id, sched_group_cpus(tmp)))

```
                  {
```
```
                          sg = tmp;
```
```
                          break;
```
```
                  }
```

          } while (tmp = tmp->next, tmp != sd->groups);

```
          pack = sg;
```
```
          tmp = sg->next;
```

I have a new loop which solves your issue and others. I will use it for the next version

+ while (sd) { + struct sched_group *sg = sd->groups; + struct sched_group *pack = sg; + struct sched_group *tmp; + + /* The 1st CPU of the local group is a good candidate */ + id = cpumask_first(sched_group_cpus(pack)); + + /* loop the sched groups to find the best one */ + for (tmp = sg->next; tmp != sg; tmp = tmp->next) { + if (tmp->sgp->power * pack->group_weight > + pack->sgp->power * tmp->group_weight) + continue; + + if ((tmp->sgp->power * pack->group_weight == + pack->sgp->power * tmp->group_weight) + && (cpumask_first(sched_group_cpus(tmp)) >= id)) + continue; + + /* we have found a better group */ + pack = tmp; + + /* Take the 1st CPU of the new group */ + id = cpumask_first(sched_group_cpus(pack)); + } + + /* Look for another CPU than itself */ + if ((id != cpu) + || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE))) + break; + + sd = sd->parent; + }

Regards, Vincent

...

Regards, Morten

...
        /* loop the sched groups to find the best one */
        while (tmp != sg) {
                if (tmp->sgp->power * sg->group_weight <
                                sg->sgp->power * tmp->group_weight)
                        pack = tmp;
                tmp = tmp->next;
        }
        /* we have found a better group */
        if (pack != sg)
                id = cpumask_first(sched_group_cpus(pack));
        /* Look for another CPU than itself */
        if ((id != cpu)
         || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
                break;
        sd = sd->parent;
}
pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
 * A busy buddy is a CPU with a high load or a small load with a lot of
 * running tasks.
 */
return ((rq->avg.usage_avg_sum << rq->nr_running) >
                rq->avg.runnable_avg_period);
+}

+static inline bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 25% in average */
return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period);
+}

+static int check_pack_buddy(int cpu, struct task_struct *p) +{
int buddy = per_cpu(sd_pack_buddy, cpu);
/* No pack buddy for this CPU */
if (buddy == -1)
        return false;
/*
 * If a task is waiting for running on the CPU which is its own buddy,
 * let the default behavior to look for a better CPU if available
 * The threshold has been set to 37.5%
 */
if ((buddy == cpu)
 && ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5)))
        return false;
/* buddy is not an allowed CPU */
if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
        return false;
/*
 * If the task is a small one and the buddy is not overloaded,
 * we use buddy cpu
 */
 if (!is_light_task(p) || is_buddy_busy(buddy))
        return false;
return true;
+}

/*

sched_balance_self: balance the current task (running on cpu) in domains

that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and

@@ -3098,6 +3204,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (p->nr_cpus_allowed == 1) return prev_cpu;
if (check_pack_buddy(cpu, p))
        return per_cpu(sd_pack_buddy, cpu);
if (sd_flag & SD_BALANCE_WAKE) {
        if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
                want_affine = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a95d5c1..086d8bf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -875,6 +875,7 @@ static inline void idle_balance(int cpu, struct rq *rq)

extern void sysrq_sched_debug_show(void); extern void sched_init_granularity(void); +extern void update_packing_domain(int cpu); extern void update_max_interval(void); extern void update_group_power(struct sched_domain *sd, int cpu); extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu); -- 1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Morten Rasmussen

20 Nov 20 Nov

2:28 p.m.

New subject: [RFC 3/6] sched: pack small tasks

Hi Vincent,

On Mon, Nov 12, 2012 at 01:51:00PM +0000, Vincent Guittot wrote:

...

On 9 November 2012 18:13, Morten Rasmussen Morten.Rasmussen@arm.com wrote:

...
Hi Vincent,

I have experienced suboptimal buddy selection on a dual cluster setup (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at CPU level. This seems to be the correct flag settings for a system with only cluster level power gating.

To me it looks like update_packing_domain() is not doing the right thing. See inline comments below.

Hi Morten,

Thanks for testing the patches.

It seems that I have too optimized the loop and remove some use cases.

...
On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
update_packing_domain(cpu);
update_top_cache_domain(cpu);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share the

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest

power per core based on the assumption that their power efficiency is

better */

+void update_packing_domain(int cpu) +{
struct sched_domain *sd;
int id = -1;
sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
if (!sd)
        sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
else
        sd = sd->parent;
sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched groups of the parent level would represent the power domains. If get it right, we want to pack inside the cluster first and only let first cpu
You probably wanted to use sched_group instead of cluster because cluster is only a special use case, didn't you ?

...
of the cluster do packing on another cluster. So all cpus - except the first one - in the current sched domain should find its buddy within the domain and only the first one should go to the parent sched domain to find its buddy.

We don't want to pack in the current sched_domain because it shares power domain. We want to pack at the parent level

Yes. I think we mean the same thing. The packing takes place at the parent sched_domain but the sched_group that we are looking at only contains the cpus of the level below.

...

I propose the following fix:

```
          sd = sd->parent;
```

          if (cpumask_first(sched_domain_span(sd)) == cpu

```
                  || !sd->parent)
```
```
                  sd = sd->parent;
```

We always look for the buddy in the parent level whatever the cpu position in the mask is.

...

```
while (sd) {
```

        struct sched_group *sg = sd->groups;

```
        struct sched_group *pack = sg;
```

        struct sched_group *tmp = sg->next;

        /* 1st CPU of the sched domain is a good candidate */

```
        if (id == -1)
```

                id = cpumask_first(sched_domain_span(sd));

          /* Find sched group of candidate */

```
          tmp = sd->groups;
```
```
          do {
```

                  if (cpumask_test_cpu(id, sched_group_cpus(tmp)))

```
                  {
```
```
                          sg = tmp;
```
```
                          break;
```
```
                  }
```

          } while (tmp = tmp->next, tmp != sd->groups);

```
          pack = sg;
```
```
          tmp = sg->next;
```

I have a new loop which solves your issue and others. I will use it for the next version

while (sd) {
```
struct sched_group *sg = sd->groups;
```
```
struct sched_group *pack = sg;
```
```
struct sched_group *tmp;
```

/* The 1st CPU of the local group is a good candidate */

id = cpumask_first(sched_group_cpus(pack));

You make the assumption that the first sched_group in the list always contains the current cpu. I think that is always the case, but I haven't verified it. Maybe a comment about this would help people to understand the code easier.

...

/* loop the sched groups to find the best one */

for (tmp = sg->next; tmp != sg; tmp = tmp->next) {

	if (tmp->sgp->power * pack->group_weight >

			pack->sgp->power * tmp->group_weight)

```
		continue;
```

	if ((tmp->sgp->power * pack->group_weight ==

			pack->sgp->power * tmp->group_weight)

	 && (cpumask_first(sched_group_cpus(tmp)) >= id))

```
		continue;
```
```
	/* we have found a better group */
```
```
	pack = tmp;
```

	/* Take the 1st CPU of the new group */

	id = cpumask_first(sched_group_cpus(pack));

```
}
```

Works great on my setup.

...

```
/* Look for another CPU than itself */
```
```
if ((id != cpu)
```

 || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))

```
	break;
```

If I understand correctly the last part of this check should avoid selecting a buddy in a sched_group that is not load balanced with the current one. In that case, I think that this check (or a similar check) should be done before the loop as well. As it is, the first iteration of the loop will always search all the groups of the first domain where SD_SHARE_POWERLINE is disabled regardless of the state of SD_LOAD_BALANCE flag. So if they are both disabled at the same level packing will happen across groups that are not supposed to be load-balanced.

Regards, Morten

...

sd = sd->parent;
}
Regards, Vincent

...
Regards, Morten

...
        /* loop the sched groups to find the best one */
        while (tmp != sg) {
                if (tmp->sgp->power * sg->group_weight <
                                sg->sgp->power * tmp->group_weight)
                        pack = tmp;
                tmp = tmp->next;
        }
        /* we have found a better group */
        if (pack != sg)
                id = cpumask_first(sched_group_cpus(pack));
        /* Look for another CPU than itself */
        if ((id != cpu)
         || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
                break;
        sd = sd->parent;
}
pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
 * A busy buddy is a CPU with a high load or a small load with a lot of
 * running tasks.
 */
return ((rq->avg.usage_avg_sum << rq->nr_running) >
                rq->avg.runnable_avg_period);
+}

+static inline bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 25% in average */
return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period);
+}

+static int check_pack_buddy(int cpu, struct task_struct *p) +{
int buddy = per_cpu(sd_pack_buddy, cpu);
/* No pack buddy for this CPU */
if (buddy == -1)
        return false;
/*
 * If a task is waiting for running on the CPU which is its own buddy,
 * let the default behavior to look for a better CPU if available
 * The threshold has been set to 37.5%
 */
if ((buddy == cpu)
 && ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5)))
        return false;
/* buddy is not an allowed CPU */
if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
        return false;
/*
 * If the task is a small one and the buddy is not overloaded,
 * we use buddy cpu
 */
 if (!is_light_task(p) || is_buddy_busy(buddy))
        return false;
return true;
+}

/*

sched_balance_self: balance the current task (running on cpu) in domains

that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and

@@ -3098,6 +3204,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (p->nr_cpus_allowed == 1) return prev_cpu;
if (check_pack_buddy(cpu, p))
        return per_cpu(sd_pack_buddy, cpu);
if (sd_flag & SD_BALANCE_WAKE) {
        if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
                want_affine = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a95d5c1..086d8bf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -875,6 +875,7 @@ static inline void idle_balance(int cpu, struct rq *rq)

extern void sysrq_sched_debug_show(void); extern void sched_init_granularity(void); +extern void update_packing_domain(int cpu); extern void update_max_interval(void); extern void update_group_power(struct sched_domain *sd, int cpu); extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu); -- 1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Vincent Guittot

4:59 p.m.

New subject: [RFC 3/6] sched: pack small tasks

On 20 November 2012 15:28, Morten Rasmussen Morten.Rasmussen@arm.com wrote:

...

Hi Vincent,

On Mon, Nov 12, 2012 at 01:51:00PM +0000, Vincent Guittot wrote:

...
On 9 November 2012 18:13, Morten Rasmussen Morten.Rasmussen@arm.com wrote:

...
Hi Vincent,

I have experienced suboptimal buddy selection on a dual cluster setup (ARM TC2) if SD_SHARE_POWERLINE is enabled at MC level and disabled at CPU level. This seems to be the correct flag settings for a system with only cluster level power gating.

To me it looks like update_packing_domain() is not doing the right thing. See inline comments below.

Hi Morten,

Thanks for testing the patches.

It seems that I have too optimized the loop and remove some use cases.

...
On Sun, Oct 07, 2012 at 08:43:55AM +0100, Vincent Guittot wrote:

...
During sched_domain creation, we define a pack buddy CPU if available.

On a system that share the powerline at all level, the buddy is set to -1

On a dual clusters / dual cores system which can powergate each core and cluster independantly, the buddy configuration will be : | CPU0 | CPU1 | CPU2 | CPU3 |

buddy | CPU0 | CPU0 | CPU0 | CPU2 |

Small tasks tend to slip out of the periodic load balance. The best place to choose to migrate them is at their wake up.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/core.c | 1 + kernel/sched/fair.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 1 + 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index dab7908..70cadbe 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6131,6 +6131,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) rcu_assign_pointer(rq->sd, sd); destroy_sched_domains(tmp, cpu);
update_packing_domain(cpu);
update_top_cache_domain(cpu);
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4f4a4f6..8c9d3ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -157,6 +157,63 @@ void sched_init_granularity(void) update_sysctl(); }

+/*

Save the id of the optimal CPU that should be used to pack small tasks

The value -1 is used when no buddy has been found

*/

+DEFINE_PER_CPU(int, sd_pack_buddy);

+/* Look for the best buddy CPU that can be used to pack small tasks

We make the assumption that it doesn't wort to pack on CPU that share the

same powerline. We looks for the 1st sched_domain without the

SD_SHARE_POWERLINE flag. Then We look for the sched_group witht the lowest

power per core based on the assumption that their power efficiency is

better */

+void update_packing_domain(int cpu) +{
struct sched_domain *sd;
int id = -1;
sd = highest_flag_domain(cpu, SD_SHARE_POWERLINE);
if (!sd)
        sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
else
        sd = sd->parent;
sd is the highest level where SD_SHARE_POWERLINE is enabled so the sched groups of the parent level would represent the power domains. If get it right, we want to pack inside the cluster first and only let first cpu
You probably wanted to use sched_group instead of cluster because cluster is only a special use case, didn't you ?

...
of the cluster do packing on another cluster. So all cpus - except the first one - in the current sched domain should find its buddy within the domain and only the first one should go to the parent sched domain to find its buddy.

We don't want to pack in the current sched_domain because it shares power domain. We want to pack at the parent level
Yes. I think we mean the same thing. The packing takes place at the parent sched_domain but the sched_group that we are looking at only contains the cpus of the level below.

...
...
I propose the following fix:
          sd = sd->parent;
          if (cpumask_first(sched_domain_span(sd)) == cpu
                  || !sd->parent)
                  sd = sd->parent;
We always look for the buddy in the parent level whatever the cpu position in the mask is.

...
...
while (sd) {
        struct sched_group *sg = sd->groups;
        struct sched_group *pack = sg;
        struct sched_group *tmp = sg->next;
        /* 1st CPU of the sched domain is a good candidate */
        if (id == -1)
                id = cpumask_first(sched_domain_span(sd));
There is no guarantee that id is in the sched group pointed to by sd->groups, which is implicitly assumed later in the search loop. We need to find the sched group that contains id and point sg to that instead. I haven't found an elegant way to find that group, but the fix below should at least give the right result.
          /* Find sched group of candidate */
          tmp = sd->groups;
          do {
                  if (cpumask_test_cpu(id, sched_group_cpus(tmp)))
                  {
                          sg = tmp;
                          break;
                  }
          } while (tmp = tmp->next, tmp != sd->groups);
          pack = sg;
          tmp = sg->next;
I have a new loop which solves your issue and others. I will use it for the next version
while (sd) {
        struct sched_group *sg = sd->groups;
        struct sched_group *pack = sg;
        struct sched_group *tmp;
        /* The 1st CPU of the local group is a good candidate */
        id = cpumask_first(sched_group_cpus(pack));
You make the assumption that the first sched_group in the list always contains the current cpu. I think that is always the case, but I haven't verified it. Maybe a comment about this would help people to understand the code easier.

yes, the first sched_group contains the cpu. I will add a comment

...

        /* loop the sched groups to find the best one */

        for (tmp = sg->next; tmp != sg; tmp = tmp->next) {

                if (tmp->sgp->power * pack->group_weight >

                                pack->sgp->power * tmp->group_weight)

```
                        continue;
```

                if ((tmp->sgp->power * pack->group_weight ==

                                pack->sgp->power * tmp->group_weight)

                 && (cpumask_first(sched_group_cpus(tmp)) >= id))

```
                        continue;
```

                /* we have found a better group */

```
                pack = tmp;
```

                /* Take the 1st CPU of the new group */

                id = cpumask_first(sched_group_cpus(pack));

```
        }
```

Works great on my setup.

...

        /* Look for another CPU than itself */

```
        if ((id != cpu)
```

         || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))

```
                break;
```

you're right, i'm going to fix it

Thanks

...

Regards, Morten

...
        sd = sd->parent;
}
Regards, Vincent

...
Regards, Morten

...
        /* loop the sched groups to find the best one */
        while (tmp != sg) {
                if (tmp->sgp->power * sg->group_weight <
                                sg->sgp->power * tmp->group_weight)
                        pack = tmp;
                tmp = tmp->next;
        }
        /* we have found a better group */
        if (pack != sg)
                id = cpumask_first(sched_group_cpus(pack));
        /* Look for another CPU than itself */
        if ((id != cpu)
         || ((sd->parent) && !(sd->parent->flags && SD_LOAD_BALANCE)))
                break;
        sd = sd->parent;
}
pr_info(KERN_INFO "CPU%d packing on CPU%d\n", cpu, id);
per_cpu(sd_pack_buddy, cpu) = id;
+}

#if BITS_PER_LONG == 32 # define WMULT_CONST (~0UL) #else @@ -3073,6 +3130,55 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }

+static inline bool is_buddy_busy(int cpu) +{
struct rq *rq = cpu_rq(cpu);
/*
 * A busy buddy is a CPU with a high load or a small load with a lot of
 * running tasks.
 */
return ((rq->avg.usage_avg_sum << rq->nr_running) >
                rq->avg.runnable_avg_period);
+}

+static inline bool is_light_task(struct task_struct *p) +{
/* A light task runs less than 25% in average */
return ((p->se.avg.usage_avg_sum << 2) < p->se.avg.runnable_avg_period);
+}

+static int check_pack_buddy(int cpu, struct task_struct *p) +{
int buddy = per_cpu(sd_pack_buddy, cpu);
/* No pack buddy for this CPU */
if (buddy == -1)
        return false;
/*
 * If a task is waiting for running on the CPU which is its own buddy,
 * let the default behavior to look for a better CPU if available
 * The threshold has been set to 37.5%
 */
if ((buddy == cpu)
 && ((p->se.avg.usage_avg_sum << 3) < (p->se.avg.runnable_avg_sum * 5)))
        return false;
/* buddy is not an allowed CPU */
if (!cpumask_test_cpu(buddy, tsk_cpus_allowed(p)))
        return false;
/*
 * If the task is a small one and the buddy is not overloaded,
 * we use buddy cpu
 */
 if (!is_light_task(p) || is_buddy_busy(buddy))
        return false;
return true;
+}

/*

sched_balance_self: balance the current task (running on cpu) in domains

that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and

@@ -3098,6 +3204,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) if (p->nr_cpus_allowed == 1) return prev_cpu;
if (check_pack_buddy(cpu, p))
        return per_cpu(sd_pack_buddy, cpu);
if (sd_flag & SD_BALANCE_WAKE) {
        if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
                want_affine = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index a95d5c1..086d8bf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -875,6 +875,7 @@ static inline void idle_balance(int cpu, struct rq *rq)

extern void sysrq_sched_debug_show(void); extern void sched_init_granularity(void); +extern void update_packing_domain(int cpu); extern void update_max_interval(void); extern void update_group_power(struct sched_domain *sd, int cpu); extern int update_runtime(struct notifier_block *nfb, unsigned long action, void *hcpu); -- 1.7.9.5

linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev

Vincent Guittot

7 Oct 7 Oct

7:43 a.m.

New subject: [RFC 4/6] sched: secure access to other CPU statistics

The atomic update of runnable_avg_sum and runnable_avg_period are ensured by their size and the toolchain. But we must ensure to not read an old value for one field and a newly updated value for the other field. As we don't want to lock other CPU while reading these fields, we read twice each fields and check that no change have occured in the middle.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8c9d3ed..6df53b5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3133,13 +3133,28 @@ static int select_idle_sibling(struct task_struct *p, int target) static inline bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu); + volatile u32 *psum = &rq->avg.runnable_avg_sum; + volatile u32 *pperiod = &rq->avg.runnable_avg_period; + u32 sum, new_sum, period, new_period; + int timeout = 10; + + while (timeout) { + sum = *psum; + period = *pperiod; + new_sum = *psum; + new_period = *pperiod; + + if ((sum == new_sum) && (period == new_period)) + break; + + timeout--; + }

/* * A busy buddy is a CPU with a high load or a small load with a lot of * running tasks. */ - return ((rq->avg.usage_avg_sum << rq->nr_running) > - rq->avg.runnable_avg_period); + return ((new_sum << rq->nr_running) > new_period); }

static inline bool is_light_task(struct task_struct *p)

-- 1.7.9.5

Santosh Shilimkar

24 Oct 24 Oct

3:21 p.m.

New subject: [RFC 4/6] sched: secure access to other CPU statistics

$subject is bit confusing here.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...

The atomic update of runnable_avg_sum and runnable_avg_period are ensured by their size and the toolchain. But we must ensure to not read an old value for one field and a newly updated value for the other field. As we don't want to lock other CPU while reading these fields, we read twice each fields and check that no change have occured in the middle.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/fair.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8c9d3ed..6df53b5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3133,13 +3133,28 @@ static int select_idle_sibling(struct task_struct *p, int target) static inline bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu);

volatile u32 *psum = &rq->avg.runnable_avg_sum;

volatile u32 *pperiod = &rq->avg.runnable_avg_period;

u32 sum, new_sum, period, new_period;

int timeout = 10;

So it can be 2 times read or more as well.

...

while (timeout) {
```
sum = *psum;
```
```
period = *pperiod;
```
```
new_sum = *psum;
```
```
new_period = *pperiod;
```

if ((sum == new_sum) && (period == new_period))

```
	break;
```
```
timeout--;
```
}

Seems like you did notice incorrect pair getting read for rq runnable_avg_sum and runnable_avg_period. Seems like the fix is to update them together under some lock to avoid such issues.

Regards Santosh

Vincent Guittot

29 Oct 29 Oct

1:18 p.m.

New subject: [RFC 4/6] sched: secure access to other CPU statistics

On 24 October 2012 17:21, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...

$subject is bit confusing here.

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
The atomic update of runnable_avg_sum and runnable_avg_period are ensured by their size and the toolchain. But we must ensure to not read an old value for one field and a newly updated value for the other field. As we don't want to lock other CPU while reading these fields, we read twice each fields and check that no change have occured in the middle.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/fair.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8c9d3ed..6df53b5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3133,13 +3133,28 @@ static int select_idle_sibling(struct task_struct *p, int target) static inline bool is_buddy_busy(int cpu) { struct rq *rq = cpu_rq(cpu);
  volatile u32 *psum = &rq->avg.runnable_avg_sum;
  volatile u32 *pperiod = &rq->avg.runnable_avg_period;
  u32 sum, new_sum, period, new_period;
  int timeout = 10;
So it can be 2 times read or more as well.

...
  while (timeout) {
          sum = *psum;
          period = *pperiod;
          new_sum = *psum;
          new_period = *pperiod;
          if ((sum == new_sum) && (period == new_period))
                  break;
          timeout--;
  }
Seems like you did notice incorrect pair getting read for rq runnable_avg_sum and runnable_avg_period. Seems like the fix is to update them together under some lock to avoid such issues.

My goal is to have a lock free mechanism because I don't want to lock another CPU while reading its statistic

...

Regards Santosh

Vincent Guittot

7 Oct 7 Oct

7:43 a.m.

New subject: [RFC 5/6] sched: pack the idle load balance

Look for an idle CPU close the pack buddy CPU whenever possible. The goal is to prevent the wake up of a CPU which doesn't share the power line of the pack CPU

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6df53b5..f76acdc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5158,7 +5158,25 @@ static struct {

static inline int find_new_ilb(int call_cpu) { + struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask); + int buddy = per_cpu(sd_pack_buddy, call_cpu); + + /* + * If we have a pack buddy CPU, we try to run load balance on a CPU + * that is close to the buddy. + */ + if (buddy != -1) + for_each_domain(buddy, sd) { + if (sd->flags & SD_SHARE_CPUPOWER) + continue; + + ilb = cpumask_first_and(sched_domain_span(sd), + nohz.idle_cpus_mask); + + if (ilb < nr_cpu_ids) + break; + }

if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb;

-- 1.7.9.5

Santosh Shilimkar

24 Oct 24 Oct

3:21 p.m.

New subject: [RFC 5/6] sched: pack the idle load balance

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...

Look for an idle CPU close the pack buddy CPU whenever possible.

s/close/close to

...

The goal is to prevent the wake up of a CPU which doesn't share the power line of the pack CPU

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6df53b5..f76acdc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5158,7 +5158,25 @@ static struct {

static inline int find_new_ilb(int call_cpu) {
struct sched_domain *sd; int ilb = cpumask_first(nohz.idle_cpus_mask);

int buddy = per_cpu(sd_pack_buddy, call_cpu);

/*
* If we have a pack buddy CPU, we try to run load balance on a CPU
* that is close to the buddy.
*/
if (buddy != -1)
for_each_domain(buddy, sd) {
	if (sd->flags & SD_SHARE_CPUPOWER)
		continue;

Do you mean SD_SHARE_POWERLINE here ?

...

	ilb = cpumask_first_and(sched_domain_span(sd),

```
			nohz.idle_cpus_mask);
```
```
	if (ilb < nr_cpu_ids)
```
```
		break;
```
```
}
```
if (ilb < nr_cpu_ids && idle_cpu(ilb)) return ilb;

Can you please expand "idle CPU _close_ the pack buddy CPU" ?

Regards santosh

Vincent Guittot

29 Oct 29 Oct

1:27 p.m.

New subject: [RFC 5/6] sched: pack the idle load balance

On 24 October 2012 17:21, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
Look for an idle CPU close the pack buddy CPU whenever possible.

s/close/close to

yes

...

...
The goal is to prevent the wake up of a CPU which doesn't share the power line of the pack CPU

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6df53b5..f76acdc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5158,7 +5158,25 @@ static struct {

static inline int find_new_ilb(int call_cpu) {
  struct sched_domain *sd;
  int ilb = cpumask_first(nohz.idle_cpus_mask);
  int buddy = per_cpu(sd_pack_buddy, call_cpu);
  /*
   * If we have a pack buddy CPU, we try to run load balance on a
CPU
   * that is close to the buddy.
   */
  if (buddy != -1)
          for_each_domain(buddy, sd) {
                  if (sd->flags & SD_SHARE_CPUPOWER)
                          continue;
Do you mean SD_SHARE_POWERLINE here ?

No, I just don't want to take hyperthread level for ILB

...

                  ilb = cpumask_first_and(sched_domain_span(sd),

                                  nohz.idle_cpus_mask);

                  if (ilb < nr_cpu_ids)

```
                          break;
```

          }

  if (ilb < nr_cpu_ids && idle_cpu(ilb))
          return ilb;

Can you please expand "idle CPU _close_ the pack buddy CPU" ?

The goal is to packed the tasks on the pack buddy CPU so when the scheduler needs to start ILB, I try to wake up a CPU that is close to the buddy and preferably in the same cluster

...

Regards santosh

Santosh Shilimkar

2 Nov 2 Nov

10:59 a.m.

New subject: [RFC 5/6] sched: pack the idle load balance

On Monday 29 October 2012 06:57 PM, Vincent Guittot wrote:

...

On 24 October 2012 17:21, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
Look for an idle CPU close the pack buddy CPU whenever possible.

s/close/close to

yes

...
...
The goal is to prevent the wake up of a CPU which doesn't share the power line of the pack CPU

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

kernel/sched/fair.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6df53b5..f76acdc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5158,7 +5158,25 @@ static struct {

static inline int find_new_ilb(int call_cpu) {
  struct sched_domain *sd;
   int ilb = cpumask_first(nohz.idle_cpus_mask);
  int buddy = per_cpu(sd_pack_buddy, call_cpu);
  /*
   * If we have a pack buddy CPU, we try to run load balance on a
CPU
   * that is close to the buddy.
   */
  if (buddy != -1)
          for_each_domain(buddy, sd) {
                  if (sd->flags & SD_SHARE_CPUPOWER)
                          continue;
Do you mean SD_SHARE_POWERLINE here ?
No, I just don't want to take hyperthread level for ILB

...
...
                  ilb = cpumask_first_and(sched_domain_span(sd),
                                  nohz.idle_cpus_mask);
                  if (ilb < nr_cpu_ids)
                          break;
          }

   if (ilb < nr_cpu_ids && idle_cpu(ilb))
           return ilb;
Can you please expand "idle CPU _close_ the pack buddy CPU" ?
The goal is to packed the tasks on the pack buddy CPU so when the scheduler needs to start ILB, I try to wake up a CPU that is close to the buddy and preferably in the same cluster

I see your point now. Thanks for clarification.

Regards santosh

Vincent Guittot

7 Oct 7 Oct

7:43 a.m.

New subject: [RFC 6/6] ARM: sched: clear SD_SHARE_POWERLINE

The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be powergated independantly.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..00511d0 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -226,6 +226,11 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];

+int arch_sd_share_power_line(void) +{ + return 0*SD_SHARE_POWERLINE; +} + const struct cpumask *cpu_coregroup_mask(int cpu) { return &cpu_topology[cpu].core_sibling;

-- 1.7.9.5

Santosh Shilimkar

24 Oct 24 Oct

3:21 p.m.

New subject: [RFC 6/6] ARM: sched: clear SD_SHARE_POWERLINE

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...

The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be powergated independently.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/arm/kernel/topology.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..00511d0 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -226,6 +226,11 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];

+int arch_sd_share_power_line(void) +{

return 0*SD_SHARE_POWERLINE;

+}

Making this selection of policy based on sched domain will better. Just gives the flexibility to choose a separate scheme for big and little systems which will be very convenient.

Regards Santosh

Vincent Guittot

29 Oct 29 Oct

1:28 p.m.

New subject: [RFC 6/6] ARM: sched: clear SD_SHARE_POWERLINE

On 24 October 2012 17:21, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...

On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be powergated independently.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/arm/kernel/topology.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..00511d0 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -226,6 +226,11 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];

+int arch_sd_share_power_line(void) +{
  return 0*SD_SHARE_POWERLINE;
+}
Making this selection of policy based on sched domain will better. Just gives the flexibility to choose a separate scheme for big and little systems which will be very convenient.

I agree that it would be more flexible to be able to set it for each level

...

Regards Santosh

Santosh Shilimkar

2 Nov 2 Nov

11 a.m.

New subject: [RFC 6/6] ARM: sched: clear SD_SHARE_POWERLINE

On Monday 29 October 2012 06:58 PM, Vincent Guittot wrote:

...

On 24 October 2012 17:21, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be powergated independently.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/arm/kernel/topology.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..00511d0 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -226,6 +226,11 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];

+int arch_sd_share_power_line(void) +{
  return 0*SD_SHARE_POWERLINE;
+}
Making this selection of policy based on sched domain will better. Just gives the flexibility to choose a separate scheme for big and little systems which will be very convenient.
I agree that it would be more flexible to be able to set it for each level

Will you be addressing that in next version then ?

Regards santosh

Vincent Guittot

12 Nov 12 Nov

8:23 a.m.

New subject: [RFC 6/6] ARM: sched: clear SD_SHARE_POWERLINE

On 2 November 2012 12:00, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...

On Monday 29 October 2012 06:58 PM, Vincent Guittot wrote:

...
On 24 October 2012 17:21, Santosh Shilimkar santosh.shilimkar@ti.com wrote:

...
On Sunday 07 October 2012 01:13 PM, Vincent Guittot wrote:

...
The ARM platforms take advantage of packing small tasks on few cores. This is true even when the cores of a cluster can't be powergated independently.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/arm/kernel/topology.c | 5 +++++ 1 file changed, 5 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..00511d0 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -226,6 +226,11 @@ static inline void update_cpu_power(unsigned int cpuid, unsigned int mpidr) {} */ struct cputopo_arm cpu_topology[NR_CPUS];

+int arch_sd_share_power_line(void) +{
  return 0*SD_SHARE_POWERLINE;
+}
Making this selection of policy based on sched domain will better. Just gives the flexibility to choose a separate scheme for big and little systems which will be very convenient.
I agree that it would be more flexible to be able to set it for each level
Will you be addressing that in next version then ?

Hi Santosh,

yes, I will try to address this point for the next version.

Vincent

...

Regards santosh

4890

days inactive

4934

days old

linaro-dev@lists.linaro.org

29 comments

participants

tags (0)

participants (3)

Morten Rasmussen
Santosh Shilimkar
Vincent Guittot