[RFC 0/6] rework sched_domain topology description

List overview All Threads
Download

newer

older

next boot: 37 pass, 7 fail...

[PATCH 0/5] aarch64 BE kvm support

Vincent Guittot

5 Mar 2014 5 Mar '14

7:18 a.m.

This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table

arch/arm/kernel/topology.c | 26 ++++ arch/ia64/include/asm/topology.h | 24 ---- arch/metag/include/asm/topology.h | 27 ----- arch/powerpc/kernel/smp.c | 35 ++++-- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 30 +++++ include/linux/topology.h | 128 ++------------------ kernel/sched/core.c | 235 ++++++++++++++++++------------------- 10 files changed, 237 insertions(+), 339 deletions(-)

-- 1.7.9.5

Show replies by date

Vincent Guittot

5 Mar 5 Mar

7:18 a.m.

New subject: [RFC 1/6] sched: remove unused SCHED_INIT_NODE

not used since new numa scheduler init sequence

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/metag/include/asm/topology.h | 27 --------------------------- 1 file changed, 27 deletions(-)

diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h index 8e9c0b3..e95f874 100644 --- a/arch/metag/include/asm/topology.h +++ b/arch/metag/include/asm/topology.h @@ -3,33 +3,6 @@

#ifdef CONFIG_NUMA

-/* sched_domains SD_NODE_INIT for Meta machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_FORK \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_NEWIDLE \ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .nr_balance_failed = 0, \ - .max_newidle_lb_cost = 0, \ - .next_decay_max_lb_cost = jiffies, \ -} - #define cpu_to_node(cpu) ((void)(cpu), 0) #define parent_node(node) ((void)(node), 0)

-- 1.7.9.5

Vincent Guittot

7:18 a.m.

New subject: [PATCH 2/6] sched: rework of sched_domain topology definition

We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which wants to add new level where a load balance make sense like BOOK or powergating level.

For each level, we need a function pointer that returns cpumask for each cpu, the flags configuration and a name. Each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and relevant information for load balancing than its child .

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/ia64/include/asm/topology.h | 24 ---- arch/s390/include/asm/topology.h | 2 - arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 29 +++++ include/linux/topology.h | 128 +++------------------ kernel/sched/core.c | 227 +++++++++++++++++++------------------- 6 files changed, 156 insertions(+), 287 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..20d12fa 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -46,30 +46,6 @@

void build_cpu_to_node_map(void);

-#define SD_CPU_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 1, \ - .max_interval = 4, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_NEWIDLE \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_FORK \ - | SD_WAKE_AFFINE, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .nr_balance_failed = 0, \ -} - #endif /* CONFIG_NUMA */

#ifdef CONFIG_SMP diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index 05425b1..07763bd 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void) }; #endif

-#define SD_BOOK_INIT SD_CPU_INIT - #include <asm-generic/topology.h>

#endif /* _ASM_S390_TOPOLOGY_H */ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d15c0d8..9383118 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node) /* For now, use numa node -1 for global allocation. */ #define pcibus_to_node(bus) ((void)(bus), -1)

-/* - * TILE architecture has many cores integrated in one processor, so we need - * setup bigger balance_interval for both CPU/NODE scheduling domains to - * reduce process scheduling costs. - */ - -/* sched_domains SD_CPU_INIT for TILE architecture */ -#define SD_CPU_INIT (struct sched_domain) { \ - .min_interval = 4, \ - .max_interval = 128, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 0*SD_WAKE_AFFINE \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - , \ - .last_balance = jiffies, \ - .balance_interval = 32, \ -} - /* By definition, we create nodes based on online memory. */ #define node_has_online_mem(nid) 1

diff --git a/include/linux/sched.h b/include/linux/sched.h index 825ed83..dbc35dd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -976,6 +976,35 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);

bool cpus_share_cache(int this_cpu, int that_cpu);

+typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); + +#define SDTL_OVERLAP 0x01 + +struct sd_data { + struct sched_domain **__percpu sd; + struct sched_group **__percpu sg; + struct sched_group_power **__percpu sgp; +}; + +struct sched_domain_topology_level { + sched_domain_mask_f mask; + int sd_flags; + int flags; + int numa_level; + struct sd_data data; +#ifdef CONFIG_SCHED_DEBUG + char *name; +#endif +}; + +extern struct sched_domain_topology_level *sched_domain_topology; + +#ifdef CONFIG_SCHED_DEBUG +# define SD_INIT_NAME(type) .name = #type +#else +# define SD_INIT_NAME(type) +#endif + #else /* CONFIG_SMP */

struct sched_domain_attr; diff --git a/include/linux/topology.h b/include/linux/topology.h index 12ae6ce..3a9db05 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -66,121 +66,6 @@ int arch_update_cpu_topology(void); #define PENALTY_FOR_NODE_WITH_CPUS (1) #endif

-/* - * Below are the 3 major initializers used in building sched_domains: - * SD_SIBLING_INIT, for SMT domains - * SD_CPU_INIT, for SMP domains - * - * Any architecture that cares to do any tuning to these values should do so - * by defining their own arch-specific initializer in include/asm/topology.h. - * A definition there will automagically override these default initializers - * and allow arch-specific performance tuning of sched_domains. - * (Only non-zero and non-null fields need be specified.) - */ - -#ifdef CONFIG_SCHED_SMT -/* MCD - Do we really need this? It is always on if CONFIG_SCHED_SMT is, - * so can't we drop this in favor of CONFIG_SCHED_SMT? - */ -#define ARCH_HAS_SCHED_WAKE_IDLE -/* Common values for SMT siblings */ -#ifndef SD_SIBLING_INIT -#define SD_SIBLING_INIT (struct sched_domain) { \ - .min_interval = 1, \ - .max_interval = 2, \ - .busy_factor = 64, \ - .imbalance_pct = 110, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 1*SD_SHARE_CPUPOWER \ - | 1*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ - | arch_sd_sibling_asym_packing() \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .smt_gain = 1178, /* 15% */ \ - .max_newidle_lb_cost = 0, \ - .next_decay_max_lb_cost = jiffies, \ -} -#endif -#endif /* CONFIG_SCHED_SMT */ - -#ifdef CONFIG_SCHED_MC -/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */ -#ifndef SD_MC_INIT -#define SD_MC_INIT (struct sched_domain) { \ - .min_interval = 1, \ - .max_interval = 4, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_SHARE_CPUPOWER \ - | 1*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .max_newidle_lb_cost = 0, \ - .next_decay_max_lb_cost = jiffies, \ -} -#endif -#endif /* CONFIG_SCHED_MC */ - -/* Common values for CPUs */ -#ifndef SD_CPU_INIT -#define SD_CPU_INIT (struct sched_domain) { \ - .min_interval = 1, \ - .max_interval = 4, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - | 1*SD_PREFER_SIBLING \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .max_newidle_lb_cost = 0, \ - .next_decay_max_lb_cost = jiffies, \ -} -#endif - -#ifdef CONFIG_SCHED_BOOK -#ifndef SD_BOOK_INIT -#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!! -#endif -#endif /* CONFIG_SCHED_BOOK */ - #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DECLARE_PER_CPU(int, numa_node);

@@ -295,4 +180,17 @@ static inline int cpu_to_mem(int cpu) #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif

+#ifdef CONFIG_SCHED_SMT +static inline const struct cpumask *cpu_smt_mask(int cpu) +{ + return topology_thread_cpumask(cpu); +} +#endif + +static inline const struct cpumask *cpu_cpu_mask(int cpu) +{ + return cpumask_of_node(cpu_to_node(cpu)); +} + + #endif /* _LINUX_TOPOLOGY_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee8004c..6f6a8f0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5588,17 +5588,6 @@ static int __init isolated_cpu_setup(char *str)

__setup("isolcpus=", isolated_cpu_setup);

-static const struct cpumask *cpu_cpu_mask(int cpu) -{ - return cpumask_of_node(cpu_to_node(cpu)); -} - -struct sd_data { - struct sched_domain **__percpu sd; - struct sched_group **__percpu sg; - struct sched_group_power **__percpu sgp; -}; - struct s_data { struct sched_domain ** __percpu sd; struct root_domain *rd; @@ -5611,21 +5600,6 @@ enum s_alloc { sa_none, };

-struct sched_domain_topology_level; - -typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu); -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); - -#define SDTL_OVERLAP 0x01 - -struct sched_domain_topology_level { - sched_domain_init_f init; - sched_domain_mask_f mask; - int flags; - int numa_level; - struct sd_data data; -}; - /* * Build an iteration mask that can exclude certain CPUs from the upwards * domain traversal. @@ -5854,34 +5828,6 @@ int __weak arch_sd_sibling_asym_packing(void) * Non-inlined to reduce accumulated stack pressure in build_sched_domains() */

-#ifdef CONFIG_SCHED_DEBUG -# define SD_INIT_NAME(sd, type) sd->name = #type -#else -# define SD_INIT_NAME(sd, type) do { } while (0) -#endif - -#define SD_INIT_FUNC(type) \ -static noinline struct sched_domain * \ -sd_init_##type(struct sched_domain_topology_level *tl, int cpu) \ -{ \ - struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu); \ - *sd = SD_##type##_INIT; \ - SD_INIT_NAME(sd, type); \ - sd->private = &tl->data; \ - return sd; \ -} - -SD_INIT_FUNC(CPU) -#ifdef CONFIG_SCHED_SMT - SD_INIT_FUNC(SIBLING) -#endif -#ifdef CONFIG_SCHED_MC - SD_INIT_FUNC(MC) -#endif -#ifdef CONFIG_SCHED_BOOK - SD_INIT_FUNC(BOOK) -#endif - static int default_relax_domain_level = -1; int sched_domain_level_max;

@@ -5969,97 +5915,148 @@ static void claim_allocations(int cpu, struct sched_domain *sd) *per_cpu_ptr(sdd->sgp, cpu) = NULL; }

-#ifdef CONFIG_SCHED_SMT -static const struct cpumask *cpu_smt_mask(int cpu) -{ - return topology_thread_cpumask(cpu); -} -#endif - -/* - * Topology list, bottom-up. - */ -static struct sched_domain_topology_level default_topology[] = { -#ifdef CONFIG_SCHED_SMT - { sd_init_SIBLING, cpu_smt_mask, }, -#endif -#ifdef CONFIG_SCHED_MC - { sd_init_MC, cpu_coregroup_mask, }, -#endif -#ifdef CONFIG_SCHED_BOOK - { sd_init_BOOK, cpu_book_mask, }, -#endif - { sd_init_CPU, cpu_cpu_mask, }, - { NULL, }, -}; - -static struct sched_domain_topology_level *sched_domain_topology = default_topology; - -#define for_each_sd_topology(tl) \ - for (tl = sched_domain_topology; tl->init; tl++) +static int sched_domains_curr_level;

#ifdef CONFIG_NUMA - static int sched_domains_numa_levels; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; -static int sched_domains_curr_level; - -static inline int sd_local_flags(int level) -{ - if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) - return 0; +#endif

- return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; -} +/* + * SD_flags allowed in topology descriptions. + * + * SD_SHARE_CPUPOWER - describes SMT topologies + * SD_SHARE_PKG_RESOURCES - describes shared caches + * SD_NUMA - describes NUMA topologies + * + * Odd one out: + * SD_ASYM_PACKING - describes SMT quirks + */ +#define TOPOLOGY_SD_FLAGS \ + (SD_SHARE_CPUPOWER | \ + SD_SHARE_PKG_RESOURCES | \ + SD_NUMA | \ + SD_ASYM_PACKING)

static struct sched_domain * -sd_numa_init(struct sched_domain_topology_level *tl, int cpu) +sd_init(struct sched_domain_topology_level *tl, int cpu) { struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu); - int level = tl->numa_level; - int sd_weight = cpumask_weight( - sched_domains_numa_masks[level][cpu_to_node(cpu)]); + int sd_weight; + + /* + * Ugly hack to pass state to sd_numa_mask()... + */ + sched_domains_curr_level = tl->numa_level; + + sd_weight = cpumask_weight(tl->mask(cpu)); + + if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, + "wrong sd_flags in topology description\n")) + tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

*sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 32, .imbalance_pct = 125, - .cache_nice_tries = 2, - .busy_idx = 3, - .idle_idx = 2, + + .cache_nice_tries = 0, + .busy_idx = 0, + .idle_idx = 0, .newidle_idx = 0, .wake_idx = 0, .forkexec_idx = 0,

.flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE - | 0*SD_BALANCE_EXEC - | 0*SD_BALANCE_FORK + | 1*SD_BALANCE_EXEC + | 1*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE - | 0*SD_WAKE_AFFINE + | 1*SD_WAKE_AFFINE | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES - | 1*SD_SERIALIZE + | 0*SD_SERIALIZE | 0*SD_PREFER_SIBLING - | 1*SD_NUMA - | sd_local_flags(level) + | 0*SD_NUMA + | tl->sd_flags , + .last_balance = jiffies, .balance_interval = sd_weight, + .smt_gain = 0, + .max_newidle_lb_cost = 0, + .next_decay_max_lb_cost = jiffies, +#ifdef CONFIG_SCHED_DEBUG + .name = tl->name, +#endif }; - SD_INIT_NAME(sd, NUMA); - sd->private = &tl->data;

/* - * Ugly hack to pass state to sd_numa_mask()... + * Convert topological properties into behaviour. */ - sched_domains_curr_level = tl->numa_level; + + if (sd->flags & SD_SHARE_CPUPOWER) { + sd->imbalance_pct = 110; + sd->smt_gain = 1178; /* ~15% */ + sd->flags |= arch_sd_sibling_asym_packing(); + + } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { + sd->imbalance_pct = 117; + sd->cache_nice_tries = 1; + sd->busy_idx = 2; + +#ifdef CONFIG_NUMA + } else if (sd->flags & SD_NUMA) { + sd->cache_nice_tries = 2; + sd->busy_idx = 3; + sd->idle_idx = 2; + + sd->flags |= SD_SERIALIZE; + if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { + sd->flags &= ~(SD_BALANCE_EXEC | + SD_BALANCE_FORK | + SD_WAKE_AFFINE); + } + +#endif + } else { + sd->flags |= SD_PREFER_SIBLING; + sd->cache_nice_tries = 1; + sd->busy_idx = 2; + sd->idle_idx = 1; + } + + sd->private = &tl->data;

return sd; }

+/* + * Topology list, bottom-up. + */ +static struct sched_domain_topology_level default_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) }, +#endif +#ifdef CONFIG_SCHED_MC + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, +#endif +#ifdef CONFIG_SCHED_BOOK + { cpu_book_mask, SD_INIT_NAME(BOOK) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +struct sched_domain_topology_level *sched_domain_topology = default_topology; + +#define for_each_sd_topology(tl) \ + for (tl = sched_domain_topology; tl->mask; tl++) + +#ifdef CONFIG_NUMA + static const struct cpumask *sd_numa_mask(int cpu) { return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)]; @@ -6203,7 +6200,10 @@ static void sched_init_numa(void) } }

- tl = kzalloc((ARRAY_SIZE(default_topology) + level) * + /* Compute default topology size */ + for (i = 0; sched_domain_topology[i].mask; i++); + + tl = kzalloc((i + level) * sizeof(struct sched_domain_topology_level), GFP_KERNEL); if (!tl) return; @@ -6211,18 +6211,19 @@ static void sched_init_numa(void) /* * Copy the default topology bits.. */ - for (i = 0; default_topology[i].init; i++) - tl[i] = default_topology[i]; + for (i = 0; sched_domain_topology[i].mask; i++) + tl[i] = sched_domain_topology[i];

/* * .. and append 'j' levels of NUMA goodness. */ for (j = 0; j < level; i++, j++) { tl[i] = (struct sched_domain_topology_level){ - .init = sd_numa_init, .mask = sd_numa_mask, + .sd_flags = SD_NUMA, .flags = SDTL_OVERLAP, .numa_level = j, + SD_INIT_NAME(NUMA) }; }

@@ -6380,7 +6381,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, int cpu) { - struct sched_domain *sd = tl->init(tl, cpu); + struct sched_domain *sd = sd_init(tl, cpu); if (!sd) return child;

-- 1.7.9.5

Dietmar Eggemann

5:09 p.m.

New subject: [PATCH 2/6] sched: rework of sched_domain topology definition

On 05/03/14 07:18, Vincent Guittot wrote:

...

We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which wants to add new level where a load balance make sense like BOOK or powergating level.

For each level, we need a function pointer that returns cpumask for each cpu, the flags configuration and a name. Each level must be a subset on

Maybe it's worth mentioning here that those flags are from the set of the sd topology flags to distinguish them from the set of sd behavioural flags. Latter can't be set via this interface.

...

the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and relevant information for load balancing than its child .

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/ia64/include/asm/topology.h | 24 ---- arch/s390/include/asm/topology.h | 2 - arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 29 +++++ include/linux/topology.h | 128 +++------------------ kernel/sched/core.c | 227 +++++++++++++++++++------------------- 6 files changed, 156 insertions(+), 287 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index a2496e4..20d12fa 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -46,30 +46,6 @@

void build_cpu_to_node_map(void);

-#define SD_CPU_INIT (struct sched_domain) { \
  .parent                 = NULL,                 \
  .child                  = NULL,                 \
  .groups                 = NULL,                 \
  .min_interval           = 1,                    \
  .max_interval           = 4,                    \
  .busy_factor            = 64,                   \
  .imbalance_pct          = 125,                  \
  .cache_nice_tries       = 2,                    \
  .busy_idx               = 2,                    \
  .idle_idx               = 1,                    \
  .newidle_idx            = 0,                    \
  .wake_idx               = 0,                    \
  .forkexec_idx           = 0,                    \
  .flags                  = SD_LOAD_BALANCE       \
                          | SD_BALANCE_NEWIDLE    \
                          | SD_BALANCE_EXEC       \
                          | SD_BALANCE_FORK       \
                          | SD_WAKE_AFFINE,       \
  .last_balance           = jiffies,              \
  .balance_interval       = 1,                    \
  .nr_balance_failed      = 0,                    \
-}

#endif /* CONFIG_NUMA */

#ifdef CONFIG_SMP

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index 05425b1..07763bd 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void) }; #endif

-#define SD_BOOK_INIT SD_CPU_INIT

#include <asm-generic/topology.h>

#endif /* _ASM_S390_TOPOLOGY_H */

diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d15c0d8..9383118 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node) /* For now, use numa node -1 for global allocation. */ #define pcibus_to_node(bus) ((void)(bus), -1)

-/*

TILE architecture has many cores integrated in one processor, so we need

setup bigger balance_interval for both CPU/NODE scheduling domains to

reduce process scheduling costs.

*/

-/* sched_domains SD_CPU_INIT for TILE architecture */ -#define SD_CPU_INIT (struct sched_domain) { \
  .min_interval           = 4,                                    \
  .max_interval           = 128,                                  \
  .busy_factor            = 64,                                   \
  .imbalance_pct          = 125,                                  \
  .cache_nice_tries       = 1,                                    \
  .busy_idx               = 2,                                    \
  .idle_idx               = 1,                                    \
  .newidle_idx            = 0,                                    \
  .wake_idx               = 0,                                    \
  .forkexec_idx           = 0,                                    \
                                                                  \
  .flags                  = 1*SD_LOAD_BALANCE                     \
                          | 1*SD_BALANCE_NEWIDLE                  \
                          | 1*SD_BALANCE_EXEC                     \
                          | 1*SD_BALANCE_FORK                     \
                          | 0*SD_BALANCE_WAKE                     \
                          | 0*SD_WAKE_AFFINE                      \
                          | 0*SD_SHARE_CPUPOWER                   \
                          | 0*SD_SHARE_PKG_RESOURCES              \
                          | 0*SD_SERIALIZE                        \
                          ,                                       \
  .last_balance           = jiffies,                              \
  .balance_interval       = 32,                                   \
-}

/* By definition, we create nodes based on online memory. */ #define node_has_online_mem(nid) 1

diff --git a/include/linux/sched.h b/include/linux/sched.h index 825ed83..dbc35dd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -976,6 +976,35 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);

bool cpus_share_cache(int this_cpu, int that_cpu);

+typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);

+#define SDTL_OVERLAP 0x01

+struct sd_data {
  struct sched_domain **__percpu sd;
  struct sched_group **__percpu sg;
  struct sched_group_power **__percpu sgp;
+};

+struct sched_domain_topology_level {
  sched_domain_mask_f mask;
  int                 sd_flags;
  int                 flags;
  int                 numa_level;
  struct sd_data      data;
+#ifdef CONFIG_SCHED_DEBUG
  char                *name;
+#endif +};

+extern struct sched_domain_topology_level *sched_domain_topology;

+#ifdef CONFIG_SCHED_DEBUG +# define SD_INIT_NAME(type) .name = #type +#else +# define SD_INIT_NAME(type) +#endif

#else /* CONFIG_SMP */

struct sched_domain_attr;

diff --git a/include/linux/topology.h b/include/linux/topology.h index 12ae6ce..3a9db05 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -66,121 +66,6 @@ int arch_update_cpu_topology(void); #define PENALTY_FOR_NODE_WITH_CPUS (1) #endif

-/*

Below are the 3 major initializers used in building sched_domains:

SD_SIBLING_INIT, for SMT domains

SD_CPU_INIT, for SMP domains

Any architecture that cares to do any tuning to these values should do so

by defining their own arch-specific initializer in include/asm/topology.h.

A definition there will automagically override these default initializers

and allow arch-specific performance tuning of sched_domains.

(Only non-zero and non-null fields need be specified.)

*/

-#ifdef CONFIG_SCHED_SMT -/* MCD - Do we really need this? It is always on if CONFIG_SCHED_SMT is,

so can't we drop this in favor of CONFIG_SCHED_SMT?

*/

-#define ARCH_HAS_SCHED_WAKE_IDLE -/* Common values for SMT siblings */ -#ifndef SD_SIBLING_INIT -#define SD_SIBLING_INIT (struct sched_domain) { \
  .min_interval           = 1,                                    \
  .max_interval           = 2,                                    \
  .busy_factor            = 64,                                   \
  .imbalance_pct          = 110,                                  \
                                                                  \
  .flags                  = 1*SD_LOAD_BALANCE                     \
                          | 1*SD_BALANCE_NEWIDLE                  \
                          | 1*SD_BALANCE_EXEC                     \
                          | 1*SD_BALANCE_FORK                     \
                          | 0*SD_BALANCE_WAKE                     \
                          | 1*SD_WAKE_AFFINE                      \
                          | 1*SD_SHARE_CPUPOWER                   \
                          | 1*SD_SHARE_PKG_RESOURCES              \
                          | 0*SD_SERIALIZE                        \
                          | 0*SD_PREFER_SIBLING                   \
                          | arch_sd_sibling_asym_packing()        \
                          ,                                       \
  .last_balance           = jiffies,                              \
  .balance_interval       = 1,                                    \
  .smt_gain               = 1178, /* 15% */                       \
  .max_newidle_lb_cost    = 0,                                    \
  .next_decay_max_lb_cost = jiffies,                              \
-} -#endif -#endif /* CONFIG_SCHED_SMT */

-#ifdef CONFIG_SCHED_MC -/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */ -#ifndef SD_MC_INIT -#define SD_MC_INIT (struct sched_domain) { \
  .min_interval           = 1,                                    \
  .max_interval           = 4,                                    \
  .busy_factor            = 64,                                   \
  .imbalance_pct          = 125,                                  \
  .cache_nice_tries       = 1,                                    \
  .busy_idx               = 2,                                    \
  .wake_idx               = 0,                                    \
  .forkexec_idx           = 0,                                    \
                                                                  \
  .flags                  = 1*SD_LOAD_BALANCE                     \
                          | 1*SD_BALANCE_NEWIDLE                  \
                          | 1*SD_BALANCE_EXEC                     \
                          | 1*SD_BALANCE_FORK                     \
                          | 0*SD_BALANCE_WAKE                     \
                          | 1*SD_WAKE_AFFINE                      \
                          | 0*SD_SHARE_CPUPOWER                   \
                          | 1*SD_SHARE_PKG_RESOURCES              \
                          | 0*SD_SERIALIZE                        \
                          ,                                       \
  .last_balance           = jiffies,                              \
  .balance_interval       = 1,                                    \
  .max_newidle_lb_cost    = 0,                                    \
  .next_decay_max_lb_cost = jiffies,                              \
-} -#endif -#endif /* CONFIG_SCHED_MC */

-/* Common values for CPUs */ -#ifndef SD_CPU_INIT -#define SD_CPU_INIT (struct sched_domain) { \
  .min_interval           = 1,                                    \
  .max_interval           = 4,                                    \
  .busy_factor            = 64,                                   \
  .imbalance_pct          = 125,                                  \
  .cache_nice_tries       = 1,                                    \
  .busy_idx               = 2,                                    \
  .idle_idx               = 1,                                    \
  .newidle_idx            = 0,                                    \
  .wake_idx               = 0,                                    \
  .forkexec_idx           = 0,                                    \
                                                                  \
  .flags                  = 1*SD_LOAD_BALANCE                     \
                          | 1*SD_BALANCE_NEWIDLE                  \
                          | 1*SD_BALANCE_EXEC                     \
                          | 1*SD_BALANCE_FORK                     \
                          | 0*SD_BALANCE_WAKE                     \
                          | 1*SD_WAKE_AFFINE                      \
                          | 0*SD_SHARE_CPUPOWER                   \
                          | 0*SD_SHARE_PKG_RESOURCES              \
                          | 0*SD_SERIALIZE                        \
                          | 1*SD_PREFER_SIBLING                   \
                          ,                                       \
  .last_balance           = jiffies,                              \
  .balance_interval       = 1,                                    \
  .max_newidle_lb_cost    = 0,                                    \
  .next_decay_max_lb_cost = jiffies,                              \
-} -#endif

-#ifdef CONFIG_SCHED_BOOK -#ifndef SD_BOOK_INIT -#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!! -#endif -#endif /* CONFIG_SCHED_BOOK */

#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DECLARE_PER_CPU(int, numa_node);

@@ -295,4 +180,17 @@ static inline int cpu_to_mem(int cpu) #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif

+#ifdef CONFIG_SCHED_SMT +static inline const struct cpumask *cpu_smt_mask(int cpu) +{
  return topology_thread_cpumask(cpu);
+} +#endif

+static inline const struct cpumask *cpu_cpu_mask(int cpu) +{
  return cpumask_of_node(cpu_to_node(cpu));
+}

#endif /* _LINUX_TOPOLOGY_H */

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee8004c..6f6a8f0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5588,17 +5588,6 @@ static int __init isolated_cpu_setup(char *str)

__setup("isolcpus=", isolated_cpu_setup);

-static const struct cpumask *cpu_cpu_mask(int cpu) -{
  return cpumask_of_node(cpu_to_node(cpu));
-}

-struct sd_data {
  struct sched_domain **__percpu sd;
  struct sched_group **__percpu sg;
  struct sched_group_power **__percpu sgp;
-};

struct s_data { struct sched_domain ** __percpu sd; struct root_domain *rd;

@@ -5611,21 +5600,6 @@ enum s_alloc { sa_none, };

-struct sched_domain_topology_level;

-typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu); -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);

-#define SDTL_OVERLAP 0x01

-struct sched_domain_topology_level {
  sched_domain_init_f init;
  sched_domain_mask_f mask;
  int                 flags;
  int                 numa_level;
  struct sd_data      data;
-};

/*

Build an iteration mask that can exclude certain CPUs from the upwards

domain traversal.

@@ -5854,34 +5828,6 @@ int __weak arch_sd_sibling_asym_packing(void)

Non-inlined to reduce accumulated stack pressure in build_sched_domains()

*/

-#ifdef CONFIG_SCHED_DEBUG -# define SD_INIT_NAME(sd, type) sd->name = #type -#else -# define SD_INIT_NAME(sd, type) do { } while (0) -#endif

-#define SD_INIT_FUNC(type) \ -static noinline struct sched_domain * \ -sd_init_##type(struct sched_domain_topology_level *tl, int cpu) \ -{ \
  struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);       \
  *sd = SD_##type##_INIT;                                         \
  SD_INIT_NAME(sd, type);                                         \
  sd->private = &tl->data;                                        \
  return sd;                                                      \
-}

-SD_INIT_FUNC(CPU) -#ifdef CONFIG_SCHED_SMT

SD_INIT_FUNC(SIBLING)

-#endif -#ifdef CONFIG_SCHED_MC

SD_INIT_FUNC(MC)

-#endif -#ifdef CONFIG_SCHED_BOOK

SD_INIT_FUNC(BOOK)

-#endif

static int default_relax_domain_level = -1; int sched_domain_level_max;

@@ -5969,97 +5915,148 @@ static void claim_allocations(int cpu, struct sched_domain *sd) *per_cpu_ptr(sdd->sgp, cpu) = NULL; }

-#ifdef CONFIG_SCHED_SMT -static const struct cpumask *cpu_smt_mask(int cpu) -{
  return topology_thread_cpumask(cpu);
-} -#endif

-/*

Topology list, bottom-up.

*/

-static struct sched_domain_topology_level default_topology[] = { -#ifdef CONFIG_SCHED_SMT
  { sd_init_SIBLING, cpu_smt_mask, },
-#endif -#ifdef CONFIG_SCHED_MC
  { sd_init_MC, cpu_coregroup_mask, },
-#endif -#ifdef CONFIG_SCHED_BOOK
  { sd_init_BOOK, cpu_book_mask, },
-#endif
  { sd_init_CPU, cpu_cpu_mask, },
  { NULL, },
-};

-static struct sched_domain_topology_level *sched_domain_topology = default_topology;

-#define for_each_sd_topology(tl) \
  for (tl = sched_domain_topology; tl->init; tl++)

Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?

...

+static int sched_domains_curr_level;

#ifdef CONFIG_NUMA

static int sched_domains_numa_levels; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks;

-static int sched_domains_curr_level;

-static inline int sd_local_flags(int level) -{
  if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
          return 0;
+#endif
  return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
-} +/*

SD_flags allowed in topology descriptions.

SD_SHARE_CPUPOWER - describes SMT topologies

SD_SHARE_PKG_RESOURCES - describes shared caches

SD_NUMA - describes NUMA topologies

Odd one out:

SD_ASYM_PACKING - describes SMT quirks

*/

+#define TOPOLOGY_SD_FLAGS \
  (SD_SHARE_CPUPOWER |            \
   SD_SHARE_PKG_RESOURCES |       \
   SD_NUMA |                      \
   SD_ASYM_PACKING)
static struct sched_domain *
-sd_numa_init(struct sched_domain_topology_level *tl, int cpu) +sd_init(struct sched_domain_topology_level *tl, int cpu) { struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
  int level = tl->numa_level;
  int sd_weight = cpumask_weight(
                  sched_domains_numa_masks[level][cpu_to_node(cpu)]);
  int sd_weight;

Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef CONFIG_NUMA later in sd_init() though.

...

```
  /*
```

   * Ugly hack to pass state to sd_numa_mask()...

```
   */
```

  sched_domains_curr_level = tl->numa_level;

  sd_weight = cpumask_weight(tl->mask(cpu));

  if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,

                  "wrong sd_flags in topology description\n"))

          tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

   *sd = (struct sched_domain){
           .min_interval           = sd_weight,
           .max_interval           = 2*sd_weight,
           .busy_factor            = 32,
           .imbalance_pct          = 125,

```
          .cache_nice_tries       = 2,
```
```
          .busy_idx               = 3,
```
```
          .idle_idx               = 2,
```

```
          .cache_nice_tries       = 0,
```
```
          .busy_idx               = 0,
```

          .idle_idx               = 0,
           .newidle_idx            = 0,
           .wake_idx               = 0,
           .forkexec_idx           = 0,

Why we want to explicitly set those indexes to 0 here? IMHO, the memory for *sd is zeroed out before. This is true for all data members which are set to 0 later in this function including the | 0*SD_FOO . IMHO, would make the code more readable.

...

             .flags                  = 1*SD_LOAD_BALANCE
                                     | 1*SD_BALANCE_NEWIDLE

                                  | 0*SD_BALANCE_EXEC

                                  | 0*SD_BALANCE_FORK

                                  | 1*SD_BALANCE_EXEC

                                  | 1*SD_BALANCE_FORK
                                   | 0*SD_BALANCE_WAKE

                                  | 0*SD_WAKE_AFFINE

                                  | 1*SD_WAKE_AFFINE
                                   | 0*SD_SHARE_CPUPOWER
                                   | 0*SD_SHARE_PKG_RESOURCES

                                  | 1*SD_SERIALIZE

                                  | 0*SD_SERIALIZE
                                   | 0*SD_PREFER_SIBLING

                                  | 1*SD_NUMA

                                  | sd_local_flags(level)

                                  | 0*SD_NUMA

                                  | tl->sd_flags
                                   ,

           .last_balance           = jiffies,
           .balance_interval       = sd_weight,

```
          .smt_gain               = 0,
```
```
          .max_newidle_lb_cost    = 0,
```

          .next_decay_max_lb_cost = jiffies,

+#ifdef CONFIG_SCHED_DEBUG

          .name                   = tl->name,

+#endif };

```
  SD_INIT_NAME(sd, NUMA);
```
```
  sd->private = &tl->data;

   /*
```

   * Ugly hack to pass state to sd_numa_mask()...

   * Convert topological properties into behaviour.
    */

  sched_domains_curr_level = tl->numa_level;

```
  if (sd->flags & SD_SHARE_CPUPOWER) {
```
```
          sd->imbalance_pct = 110;
```

          sd->smt_gain = 1178; /* ~15% */

          sd->flags |= arch_sd_sibling_asym_packing();

  } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {

```
          sd->imbalance_pct = 117;
```
```
          sd->cache_nice_tries = 1;
```
```
          sd->busy_idx = 2;
```

+#ifdef CONFIG_NUMA

```
  } else if (sd->flags & SD_NUMA) {
```
```
          sd->cache_nice_tries = 2;
```
```
          sd->busy_idx = 3;
```
```
          sd->idle_idx = 2;
```
```
          sd->flags |= SD_SERIALIZE;
```

          if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {

                  sd->flags &= ~(SD_BALANCE_EXEC |

                                 SD_BALANCE_FORK |

                                 SD_WAKE_AFFINE);

```
          }
```

+#endif

```
  } else {
```

          sd->flags |= SD_PREFER_SIBLING;

```
          sd->cache_nice_tries = 1;
```
```
          sd->busy_idx = 2;
```
```
          sd->idle_idx = 1;
```
```
  }
```

  sd->private = &tl->data;

   return sd;

}

+/*

- Topology list, bottom-up.
*/

+static struct sched_domain_topology_level default_topology[] = { +#ifdef CONFIG_SCHED_SMT

  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },

+#endif +#ifdef CONFIG_SCHED_MC

  { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },

+#endif +#ifdef CONFIG_SCHED_BOOK

  { cpu_book_mask, SD_INIT_NAME(BOOK) },

+#endif

Never got the new name DIE for CPU? Might confuse people when they use /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

...

  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};

+struct sched_domain_topology_level *sched_domain_topology = default_topology;

+#define for_each_sd_topology(tl) \
  for (tl = sched_domain_topology; tl->mask; tl++)
+#ifdef CONFIG_NUMA

static const struct cpumask *sd_numa_mask(int cpu) { return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];

@@ -6203,7 +6200,10 @@ static void sched_init_numa(void) } }
  tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
  /* Compute default topology size */
  for (i = 0; sched_domain_topology[i].mask; i++);
  tl = kzalloc((i + level) *
                   sizeof(struct sched_domain_topology_level), GFP_KERNEL);
   if (!tl)
           return;
@@ -6211,18 +6211,19 @@ static void sched_init_numa(void) /* * Copy the default topology bits.. */
  for (i = 0; default_topology[i].init; i++)
          tl[i] = default_topology[i];
  for (i = 0; sched_domain_topology[i].mask; i++)
          tl[i] = sched_domain_topology[i];

   /*
    * .. and append 'j' levels of NUMA goodness.
    */
   for (j = 0; j < level; i++, j++) {
           tl[i] = (struct sched_domain_topology_level){
                  .init = sd_numa_init,
                   .mask = sd_numa_mask,
                  .sd_flags = SD_NUMA,
                   .flags = SDTL_OVERLAP,
                   .numa_level = j,
                  SD_INIT_NAME(NUMA)
           };
   }
@@ -6380,7 +6381,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, int cpu) {
  struct sched_domain *sd = tl->init(tl, cpu);
  struct sched_domain *sd = sd_init(tl, cpu);
   if (!sd)
           return child;
-- 1.7.9.5

Vincent Guittot

6 Mar 6 Mar

8:32 a.m.

New subject: [PATCH 2/6] sched: rework of sched_domain topology definition

On 6 March 2014 01:09, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...

On 05/03/14 07:18, Vincent Guittot wrote:

...
We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which wants to add new level where a load balance make sense like BOOK or powergating level.

For each level, we need a function pointer that returns cpumask for each cpu, the flags configuration and a name. Each level must be a subset on

Maybe it's worth mentioning here that those flags are from the set of the sd topology flags to distinguish them from the set of sd behavioural flags. Latter can't be set via this interface.

yes, i will add the list of flags that can be set with the table

...

...
the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and relevant information for load balancing than its child .

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/ia64/include/asm/topology.h | 24 ---- arch/s390/include/asm/topology.h | 2 - arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 29 +++++ include/linux/topology.h | 128 +++------------------ kernel/sched/core.c | 227 +++++++++++++++++++------------------- 6 files changed, 156 insertions(+), 287 deletions(-)

[snip]

...

...
-#define for_each_sd_topology(tl) \
  for (tl = sched_domain_topology; tl->init; tl++)
Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?

it should not as well as its use in sd_init

...

...
+static int sched_domains_curr_level;

#ifdef CONFIG_NUMA

static int sched_domains_numa_levels; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks;

-static int sched_domains_curr_level;

-static inline int sd_local_flags(int level) -{
  if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
          return 0;
+#endif
  return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
-} +/*

SD_flags allowed in topology descriptions.

SD_SHARE_CPUPOWER - describes SMT topologies

SD_SHARE_PKG_RESOURCES - describes shared caches

SD_NUMA - describes NUMA topologies

Odd one out:

SD_ASYM_PACKING - describes SMT quirks

*/

+#define TOPOLOGY_SD_FLAGS \
  (SD_SHARE_CPUPOWER |            \
   SD_SHARE_PKG_RESOURCES |       \
   SD_NUMA |                      \
   SD_ASYM_PACKING)
static struct sched_domain *
-sd_numa_init(struct sched_domain_topology_level *tl, int cpu) +sd_init(struct sched_domain_topology_level *tl, int cpu) { struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
  int level = tl->numa_level;
  int sd_weight = cpumask_weight(
sched_domains_numa_masks[level][cpu_to_node(cpu)]);
  int sd_weight;
Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef CONFIG_NUMA later in sd_init() though.

...
  /*
   * Ugly hack to pass state to sd_numa_mask()...
   */
  sched_domains_curr_level = tl->numa_level;
  sd_weight = cpumask_weight(tl->mask(cpu));
  if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
                  "wrong sd_flags in topology description\n"))
          tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

   *sd = (struct sched_domain){
           .min_interval           = sd_weight,
           .max_interval           = 2*sd_weight,
           .busy_factor            = 32,
           .imbalance_pct          = 125,
          .cache_nice_tries       = 2,
          .busy_idx               = 3,
          .idle_idx               = 2,
          .cache_nice_tries       = 0,
          .busy_idx               = 0,
          .idle_idx               = 0,
           .newidle_idx            = 0,
           .wake_idx               = 0,
           .forkexec_idx           = 0,
Why we want to explicitly set those indexes to 0 here? IMHO, the memory for *sd is zeroed out before. This is true for all data members which are set to 0 later in this function including the | 0*SD_FOO . IMHO, would make the code more readable.

I would say that it makes the configuration more readable and modifiable because you have the list of possible flag to set

...

             .flags                  = 1*SD_LOAD_BALANCE
                                     | 1*SD_BALANCE_NEWIDLE

                                  | 0*SD_BALANCE_EXEC

                                  | 0*SD_BALANCE_FORK

                                  | 1*SD_BALANCE_EXEC

                                  | 1*SD_BALANCE_FORK
                                   | 0*SD_BALANCE_WAKE

                                  | 0*SD_WAKE_AFFINE

                                  | 1*SD_WAKE_AFFINE
                                   | 0*SD_SHARE_CPUPOWER
                                   | 0*SD_SHARE_PKG_RESOURCES

                                  | 1*SD_SERIALIZE

                                  | 0*SD_SERIALIZE
                                   | 0*SD_PREFER_SIBLING

                                  | 1*SD_NUMA

                                  | sd_local_flags(level)

                                  | 0*SD_NUMA

                                  | tl->sd_flags
                                   ,

           .last_balance           = jiffies,
           .balance_interval       = sd_weight,

```
          .smt_gain               = 0,
```
```
          .max_newidle_lb_cost    = 0,
```

          .next_decay_max_lb_cost = jiffies,

+#ifdef CONFIG_SCHED_DEBUG

          .name                   = tl->name,

+#endif };

```
  SD_INIT_NAME(sd, NUMA);
```
```
  sd->private = &tl->data;
```

[snip]

...

...
+/*

Topology list, bottom-up.

*/

+static struct sched_domain_topology_level default_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) }, +#endif +#ifdef CONFIG_SCHED_MC
  { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
+#endif +#ifdef CONFIG_SCHED_BOOK
  { cpu_book_mask, SD_INIT_NAME(BOOK) },
+#endif
Never got the new name DIE for CPU? Might confuse people when they use /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

In fact, CPU is also confusing because it's used for different things. But if it makes things even more confusing, i can come back to CPU

...

...
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};

+struct sched_domain_topology_level *sched_domain_topology = default_topology;

Peter Zijlstra

11 Mar 11 Mar

10:31 a.m.

New subject: [PATCH 2/6] sched: rework of sched_domain topology definition

On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote:

...

...
Never got the new name DIE for CPU? Might confuse people when they use /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

In fact, CPU is also confusing because it's used for different things. But if it makes things even more confusing, i can come back to CPU

Yeah, not sure DIE is the right thing either; because there's multi-die packages that get classified under CPU :-)

Take for example the Core 2 Quad, which was 2 dual core dies glued together in a single package.

There's also the AMD bulldozer which glued two dies into a single package; but for those its not a problem because each die is a separate numa node, so there DIE would actually be the correct term and PACKAGE would be wrong.

So while CPU sucks, I'm not sure we can come up with anything that's actually correct. That said; we could try for something less wrong than CPU :-)

I'm not sure there are a lot of people who see/know the names of these domains to be bothered by a change in them; it might be limited to just us for all I know.

Vincent Guittot

1:27 p.m.

New subject: [PATCH 2/6] sched: rework of sched_domain topology definition

On 11 March 2014 11:31, Peter Zijlstra peterz@infradead.org wrote:

...

On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote:

...
...
Never got the new name DIE for CPU? Might confuse people when they use /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

In fact, CPU is also confusing because it's used for different things. But if it makes things even more confusing, i can come back to CPU

Yeah, not sure DIE is the right thing either; because there's multi-die packages that get classified under CPU :-)

Take for example the Core 2 Quad, which was 2 dual core dies glued together in a single package.

There's also the AMD bulldozer which glued two dies into a single package; but for those its not a problem because each die is a separate numa node, so there DIE would actually be the correct term and PACKAGE would be wrong.

So while CPU sucks, I'm not sure we can come up with anything that's actually correct. That said; we could try for something less wrong than CPU :-)

Dietmar,

Have you got another naming that DIE that could suit better ? otherwise i will keep it

Vincent

...

I'm not sure there are a lot of people who see/know the names of these domains to be bothered by a change in them; it might be limited to just us for all I know.

linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Dietmar Eggemann

1:48 p.m.

New subject: [PATCH 2/6] sched: rework of sched_domain topology definition

On 11/03/14 13:27, Vincent Guittot wrote:

...

On 11 March 2014 11:31, Peter Zijlstra peterz@infradead.org wrote:

...
On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote:

...
...
Never got the new name DIE for CPU? Might confuse people when they use /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

In fact, CPU is also confusing because it's used for different things. But if it makes things even more confusing, i can come back to CPU

Yeah, not sure DIE is the right thing either; because there's multi-die packages that get classified under CPU :-)

Take for example the Core 2 Quad, which was 2 dual core dies glued together in a single package.

There's also the AMD bulldozer which glued two dies into a single package; but for those its not a problem because each die is a separate numa node, so there DIE would actually be the correct term and PACKAGE would be wrong.

So while CPU sucks, I'm not sure we can come up with anything that's actually correct. That said; we could try for something less wrong than CPU :-)

OK

Dietmar,

Have you got another naming that DIE that could suit better ? otherwise i will keep it

If backward compatibility is not an issue here, keep it.

-- Dietmar

...

Vincent

...
I'm not sure there are a lot of people who see/know the names of these domains to be bothered by a change in them; it might be limited to just us for all I know.

linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Vincent Guittot

5 Mar 5 Mar

7:18 a.m.

New subject: [RFC 3/6] sched: s390: create a dedicated topology table

BOOK level is only relevant for s390 so we create a dedicated topology table with BOOK level and remove it from default table.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/s390/include/asm/topology.h | 11 +---------- arch/s390/kernel/topology.c | 25 +++++++++++++++++++++++++ kernel/sched/core.c | 3 --- 3 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index 07763bd..56af530 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -26,21 +26,12 @@ extern struct cpu_topology_s390 cpu_topology[NR_CPUS];

#define mc_capable() 1

-static inline const struct cpumask *cpu_coregroup_mask(int cpu) -{ - return &cpu_topology[cpu].core_mask; -} - -static inline const struct cpumask *cpu_book_mask(int cpu) -{ - return &cpu_topology[cpu].book_mask; -} - int topology_cpu_init(struct cpu *); int topology_set_cpu_management(int fc); void topology_schedule_update(void); void store_topology(struct sysinfo_15_1_x *info); void topology_expect_change(void); +const struct cpumask *cpu_coregroup_mask(int cpu);

#else /* CONFIG_SCHED_BOOK */

diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c index 4b2e3e3..8a7ac47 100644 --- a/arch/s390/kernel/topology.c +++ b/arch/s390/kernel/topology.c @@ -443,6 +443,28 @@ int topology_cpu_init(struct cpu *cpu) return sysfs_create_group(&cpu->dev.kobj, &topology_cpu_attr_group); }

+const struct cpumask *cpu_coregroup_mask(int cpu) +{ + return &cpu_topology[cpu].core_mask; +} + +static const struct cpumask *cpu_book_mask(int cpu) +{ + return &cpu_topology[cpu].book_mask; +} + +static struct sched_domain_topology_level s390_topology[] = { + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, + { cpu_book_mask, SD_INIT_NAME(BOOK) }, + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ + sched_domain_topology = s390_topology; +} + static int __init topology_init(void) { if (!MACHINE_HAS_TOPOLOGY) { @@ -452,6 +474,9 @@ static int __init topology_init(void) set_topology_timer(); out: update_cpu_masks(); + + set_sched_topology(); + return device_create_file(cpu_subsys.dev_root, &dev_attr_dispatching); } device_initcall(topology_init); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6f6a8f0..3479467 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6043,9 +6043,6 @@ static struct sched_domain_topology_level default_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, #endif -#ifdef CONFIG_SCHED_BOOK - { cpu_book_mask, SD_INIT_NAME(BOOK) }, -#endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, };

-- 1.7.9.5

Vincent Guittot

7:18 a.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/powerpc/kernel/smp.c | 35 +++++++++++++++++++++++++++-------- kernel/sched/core.c | 6 ------ 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..75da054 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier) return 0; }

+#ifdef CONFIG_SCHED_SMT +/* cpumask of CPUs with asymetric SMT dependancy */ +static const struct cpumask *cpu_asmt_mask(int cpu) +{ + if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { + printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n"); + return topology_thread_cpumask(cpu); + } + return cpumask_of(cpu); +} +#endif + +static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) }, + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ + sched_domain_topology = powerpc_topology; +} + void __init smp_cpus_done(unsigned int max_cpus) { cpumask_var_t old_mask; @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)

dump_numa_cpu_topology();

-} + set_sched_topology();

-int arch_sd_sibling_asym_packing(void) -{ - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { - printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n"); - return SD_ASYM_PACKING; - } - return 0; }

#ifdef CONFIG_HOTPLUG_CPU diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3479467..7606de0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); }

-int __weak arch_sd_sibling_asym_packing(void) -{ - return 0*SD_ASYM_PACKING; -} - /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sd->flags & SD_SHARE_CPUPOWER) { sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ - sd->flags |= arch_sd_sibling_asym_packing();

} else if (sd->flags & SD_SHARE_PKG_RESOURCES) { sd->imbalance_pct = 117;

-- 1.7.9.5

Preeti U Murthy

11 Mar 11 Mar

10:08 a.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:

...

Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour.

Regards Preeti U Murthy

...

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/powerpc/kernel/smp.c | 35 +++++++++++++++++++++++++++-------- kernel/sched/core.c | 6 ------ 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..75da054 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier) return 0; }

+#ifdef CONFIG_SCHED_SMT +/* cpumask of CPUs with asymetric SMT dependancy */ +static const struct cpumask *cpu_asmt_mask(int cpu) +{
if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
return topology_thread_cpumask(cpu);
}

return cpumask_of(cpu);
+} +#endif

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT

{ cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },

{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },

+#endif

{ cpu_cpu_mask, SD_INIT_NAME(DIE) },

{ NULL, },

+};

+static void __init set_sched_topology(void) +{

sched_domain_topology = powerpc_topology;

+}

void __init smp_cpus_done(unsigned int max_cpus) { cpumask_var_t old_mask; @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)

dump_numa_cpu_topology();

-}

set_sched_topology();

-int arch_sd_sibling_asym_packing(void) -{
if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
return SD_ASYM_PACKING;
}

return 0;
}

#ifdef CONFIG_HOTPLUG_CPU diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3479467..7606de0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); }

-int __weak arch_sd_sibling_asym_packing(void) -{
  return 0*SD_ASYM_PACKING;
-}

/*

Initializers for schedule domains

Non-inlined to reduce accumulated stack pressure in build_sched_domains()

@@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sd->flags & SD_SHARE_CPUPOWER) { sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
sd->flags |= arch_sd_sibling_asym_packing();
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) { sd->imbalance_pct = 117;

Vincent Guittot

1:18 p.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

On 11 March 2014 11:08, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...

Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:

...
Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour.

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};

Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ +#ifdef CONFIG_SCHED_SMT + powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing(); +#endif + sched_domain_topology = powerpc_topology; +}

...

Regards Preeti U Murthy

...
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/powerpc/kernel/smp.c | 35 +++++++++++++++++++++++++++-------- kernel/sched/core.c | 6 ------ 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..75da054 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier) return 0; }

+#ifdef CONFIG_SCHED_SMT +/* cpumask of CPUs with asymetric SMT dependancy */ +static const struct cpumask *cpu_asmt_mask(int cpu) +{
if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
        printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
        return topology_thread_cpumask(cpu);
}
return cpumask_of(cpu);
+} +#endif

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
{ cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },
+#endif
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
+};

+static void __init set_sched_topology(void) +{
sched_domain_topology = powerpc_topology;
+}

void __init smp_cpus_done(unsigned int max_cpus) { cpumask_var_t old_mask; @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
  dump_numa_cpu_topology();
-}
set_sched_topology();
-int arch_sd_sibling_asym_packing(void) -{
if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
        printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
        return SD_ASYM_PACKING;
}
return 0;
}

#ifdef CONFIG_HOTPLUG_CPU diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3479467..7606de0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); }

-int __weak arch_sd_sibling_asym_packing(void) -{
  return 0*SD_ASYM_PACKING;
-}

/*

Initializers for schedule domains

Non-inlined to reduce accumulated stack pressure in build_sched_domains()

@@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sd->flags & SD_SHARE_CPUPOWER) { sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */
        sd->flags |= arch_sd_sibling_asym_packing();

} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
        sd->imbalance_pct = 117;

Preeti U Murthy

12 Mar 12 Mar

4:42 a.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

On 03/11/2014 06:48 PM, Vincent Guittot wrote:

...

On 11 March 2014 11:08, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:

...
Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour.

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};
Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT

{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,

SD_INIT_NAME(SMT) }, +#endif

{ cpu_cpu_mask, SD_INIT_NAME(DIE) },

{ NULL, },

+};

+static void __init set_sched_topology(void) +{ +#ifdef CONFIG_SCHED_SMT

powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();

+#endif

sched_domain_topology = powerpc_topology;

+}

I think we can set it in powerpc_topology[] and not bother about setting additional flags outside of this array. It is clearer this way.

On an additional note, on powerpc we would want to clear the SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we have 8 threads per core, we would want to consolidate tasks atleast upto 4 threads without significant performance impact before spilling over to the other cores. By doing so, besides making use of the higher power of the core we could do cpuidle management at the core level for the remaining idle cores as a result of this consolidation.

So the powerpc_topology[] would be something like the below:

The amount of consolidation to the threads in a core, we will probably take care of it in cpu capacity or a similar parameter, but the above topology level would be required to request the scheduler to try consolidating tasks to cores till the cpu capacity(3/4/5 threads) has exceeded.

Regards Preeti U Murthy

Vincent Guittot

7:44 a.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

On 12 March 2014 05:42, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...

On 03/11/2014 06:48 PM, Vincent Guittot wrote:

...
On 11 March 2014 11:08, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:

...
Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour.

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};
Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT

{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,

SD_INIT_NAME(SMT) }, +#endif

{ cpu_cpu_mask, SD_INIT_NAME(DIE) },

{ NULL, },

+};

+static void __init set_sched_topology(void) +{ +#ifdef CONFIG_SCHED_SMT

powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();

+#endif

sched_domain_topology = powerpc_topology;

+}
I think we can set it in powerpc_topology[] and not bother about setting additional flags outside of this array. It is clearer this way.

IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at runtime which prevents it from being put directly in the table. Or it means that we should use a function pointer in the table for setting flags instead of a static value like the current proposal.

...

On an additional note, on powerpc we would want to clear the SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we have 8 threads per core, we would want to consolidate tasks atleast upto 4 threads without significant performance impact before spilling over to the other cores. By doing so, besides making use of the higher power of the core we could do cpuidle management at the core level for the remaining idle cores as a result of this consolidation.

OK. i will add the SD_SHARE_POWERDOMAIN like below

Thanks Vincent

...

So the powerpc_topology[] would be something like the below:

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};

The amount of consolidation to the threads in a core, we will probably take care of it in cpu capacity or a similar parameter, but the above topology level would be required to request the scheduler to try consolidating tasks to cores till the cpu capacity(3/4/5 threads) has exceeded.

Regards Preeti U Murthy

Dietmar Eggemann

11:04 a.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

On 12/03/14 07:44, Vincent Guittot wrote:

...

On 12 March 2014 05:42, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
On 03/11/2014 06:48 PM, Vincent Guittot wrote:

...
On 11 March 2014 11:08, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:

...
Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour.

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};
Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT

{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,

SD_INIT_NAME(SMT) }, +#endif

{ cpu_cpu_mask, SD_INIT_NAME(DIE) },

{ NULL, },

+};

+static void __init set_sched_topology(void) +{ +#ifdef CONFIG_SCHED_SMT

powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();

+#endif

sched_domain_topology = powerpc_topology;

+}
I think we can set it in powerpc_topology[] and not bother about setting additional flags outside of this array. It is clearer this way.
IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at runtime which prevents it from being put directly in the table. Or it means that we should use a function pointer in the table for setting flags instead of a static value like the current proposal.

Right,

static struct sched_domain_topology_level powerpc_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) }, { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(SMT) }, #endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, };

is not compiling:

CC arch/powerpc/kernel/smp.o arch/powerpc/kernel/smp.c:787:2: error: initializer element is not constant arch/powerpc/kernel/smp.c:787:2: error: (near initialization for 'powerpc_topology[1].sd_flags')

So I'm in favour of a function pointer, even w/o the 'int cpu' parameter to circumvent the issue that it is too easy to create broken sd setups.

-- Dietmar

...

...
On an additional note, on powerpc we would want to clear the SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we have 8 threads per core, we would want to consolidate tasks atleast upto 4 threads without significant performance impact before spilling over to the other cores. By doing so, besides making use of the higher power of the core we could do cpuidle management at the core level for the remaining idle cores as a result of this consolidation.

OK. i will add the SD_SHARE_POWERDOMAIN like below

Thanks Vincent

...
So the powerpc_topology[] would be something like the below:

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};

The amount of consolidation to the threads in a core, we will probably take care of it in cpu capacity or a similar parameter, but the above topology level would be required to request the scheduler to try consolidating tasks to cores till the cpu capacity(3/4/5 threads) has exceeded.

Regards Preeti U Murthy

Preeti U Murthy

14 Mar 14 Mar

2:30 a.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

On 03/12/2014 04:34 PM, Dietmar Eggemann wrote:

...

On 12/03/14 07:44, Vincent Guittot wrote:

...
On 12 March 2014 05:42, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
On 03/11/2014 06:48 PM, Vincent Guittot wrote:

...
On 11 March 2014 11:08, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:

...
Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour.

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};
Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT

{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,

SD_INIT_NAME(SMT) }, +#endif

{ cpu_cpu_mask, SD_INIT_NAME(DIE) },

{ NULL, },

+};

+static void __init set_sched_topology(void) +{ +#ifdef CONFIG_SCHED_SMT

powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();

+#endif

sched_domain_topology = powerpc_topology;

+}
I think we can set it in powerpc_topology[] and not bother about setting additional flags outside of this array. It is clearer this way.
IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at runtime which prevents it from being put directly in the table. Or it means that we should use a function pointer in the table for setting flags instead of a static value like the current proposal.
Right,

static struct sched_domain_topology_level powerpc_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) }, { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(SMT) }, #endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, };

is not compiling:

CC arch/powerpc/kernel/smp.o arch/powerpc/kernel/smp.c:787:2: error: initializer element is not constant arch/powerpc/kernel/smp.c:787:2: error: (near initialization for 'powerpc_topology[1].sd_flags')

So I'm in favour of a function pointer, even w/o the 'int cpu' parameter to circumvent the issue that it is too easy to create broken sd setups.

Alright, this looks fine to me. You could use the function pointer to retrieve flags and have all initializations of sched domain features consolidated in the table.

Regards Preeti U Murthy

...

-- Dietmar

...
...
On an additional note, on powerpc we would want to clear the SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we have 8 threads per core, we would want to consolidate tasks atleast upto 4 threads without significant performance impact before spilling over to the other cores. By doing so, besides making use of the higher power of the core we could do cpuidle management at the core level for the remaining idle cores as a result of this consolidation.

OK. i will add the SD_SHARE_POWERDOMAIN like below

Thanks Vincent

...
So the powerpc_topology[] would be something like the below:

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};

The amount of consolidation to the threads in a core, we will probably take care of it in cpu capacity or a similar parameter, but the above topology level would be required to request the scheduler to try consolidating tasks to cores till the cpu capacity(3/4/5 threads) has exceeded.

Regards Preeti U Murthy

Preeti U Murthy

2:14 a.m.

New subject: [RFC 4/6] sched: powerpc: create a dedicated topology table

On 03/12/2014 01:14 PM, Vincent Guittot wrote:

...

On 12 March 2014 05:42, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
On 03/11/2014 06:48 PM, Vincent Guittot wrote:

...
On 11 March 2014 11:08, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:

...
Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:

...
Create a dedicated topology table for handling asymetric feature. The current proposal creates a new level which describes which groups of CPUs take adavantge of SD_ASYM_PACKING. The useless level will be removed during the build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would strongly recommend this approach rather than adding another level to the topology level to represent the asymmetric behaviour.

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};
Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT

{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,

SD_INIT_NAME(SMT) }, +#endif

{ cpu_cpu_mask, SD_INIT_NAME(DIE) },

{ NULL, },

+};

+static void __init set_sched_topology(void) +{ +#ifdef CONFIG_SCHED_SMT

powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();

+#endif

sched_domain_topology = powerpc_topology;

+}
I think we can set it in powerpc_topology[] and not bother about setting additional flags outside of this array. It is clearer this way.
IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at runtime which prevents it from being put directly in the table. Or it means that we should use a function pointer in the table for setting flags instead of a static value like the current proposal.

Oh yes you are right. Then the above looks good to me. So we define the table as usual and set the asymmetric flag in set_sched_topology().

...

...
On an additional note, on powerpc we would want to clear the SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we have 8 threads per core, we would want to consolidate tasks atleast upto 4 threads without significant performance impact before spilling over to the other cores. By doing so, besides making use of the higher power of the core we could do cpuidle management at the core level for the remaining idle cores as a result of this consolidation.

OK. i will add the SD_SHARE_POWERDOMAIN like below

Thanks!

Regards Preeti U Murthy

...

Thanks Vincent

...
So the powerpc_topology[] would be something like the below:

+static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT
  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN }, +#endif
  { cpu_cpu_mask, SD_INIT_NAME(DIE) },
  { NULL, },
+};

The amount of consolidation to the threads in a core, we will probably take care of it in cpu capacity or a similar parameter, but the above topology level would be required to request the scheduler to try consolidating tasks to cores till the cpu capacity(3/4/5 threads) has exceeded.

Regards Preeti U Murthy

Vincent Guittot

5 Mar 5 Mar

7:18 a.m.

New subject: [RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain

A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used to add load balancing level between group of CPUs than can power gate independantly. The default behavior of the scheduler is to spread tasks across CPUs and groups of CPUs so the flag is set into all sched_domains.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 1 + kernel/sched/core.c | 9 ++++++--- 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h index dbc35dd..182a080 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -861,6 +861,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7606de0..b28cff0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5283,7 +5283,8 @@ static int sd_degenerate(struct sched_domain *sd) SD_BALANCE_FORK | SD_BALANCE_EXEC | SD_SHARE_CPUPOWER | - SD_SHARE_PKG_RESOURCES)) { + SD_SHARE_PKG_RESOURCES | + SD_SHARE_POWERDOMAIN)) { if (sd->groups != sd->groups->next) return 0; } @@ -5314,7 +5315,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) SD_BALANCE_EXEC | SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | - SD_PREFER_SIBLING); + SD_PREFER_SIBLING | + SD_SHARE_POWERDOMAIN); if (nr_node_ids == 1) pflags &= ~SD_SERIALIZE; } @@ -5932,7 +5934,8 @@ static struct cpumask ***sched_domains_numa_masks; (SD_SHARE_CPUPOWER | \ SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ - SD_ASYM_PACKING) + SD_ASYM_PACKING | \ + SD_SHARE_POWERDOMAIN)

static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, int cpu)

-- 1.7.9.5

Vincent Guittot

7:18 a.m.

New subject: [RFC 6/6] sched: ARM: create a dedicated scheduler topology table

Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others.

The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..ae8ffbc 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return &cpu_topology[cpu].core_sibling; }

+/* + * The current assumption is that we can power gate each core independently. + * This will be superseded by DT binding once available. + */ +const struct cpumask *cpu_corepower_mask(int cpu) +{ + return &cpu_topology[cpu].thread_sibling; +} + static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }

+static struct sched_domain_topology_level arm_topology[] = { +#ifdef CONFIG_SCHED_MC + { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) }, + { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +static void __init set_sched_topology(void) +{ + sched_domain_topology = arm_topology; +} + /* * init_cpu_topology is called at boot when only one cpu is running * which prevent simultaneous write access to cpu_topology array @@ -289,4 +312,7 @@ void __init init_cpu_topology(void) smp_wmb();

parse_dt_topology(); + + /* Set scheduler topology descriptor */ + set_sched_topology(); }

-- 1.7.9.5

Dietmar Eggemann

10:38 p.m.

New subject: [RFC 6/6] sched: ARM: create a dedicated scheduler topology table

On 05/03/14 07:18, Vincent Guittot wrote:

...

Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others.

The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..ae8ffbc 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return &cpu_topology[cpu].core_sibling; } +/*

The current assumption is that we can power gate each core independently.

This will be superseded by DT binding once available.

*/

+const struct cpumask *cpu_corepower_mask(int cpu) +{

return &cpu_topology[cpu].thread_sibling;

+}

Although you already explained this to me in a private conversation, it's important to notice that running this set-up on a dual cluster TC2 (2 Cortex A15 - 3 Cortex A7) (no independent core power gating) we don't see the SD_SHARE_POWERDOMAIN topology flag set in the sd's on MC level because this patch-set doesn't contain the appropriate DT parsing. Like you said back then, the comment above mentions this.

...

static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];

@@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); } +static struct sched_domain_topology_level arm_topology[] = { +#ifdef CONFIG_SCHED_MC

{ cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },

{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },

+#endif

{ cpu_cpu_mask, SD_INIT_NAME(DIE) },

{ NULL, },

+};

+static void __init set_sched_topology(void) +{

sched_domain_topology = arm_topology;

+}

/*

init_cpu_topology is called at boot when only one cpu is running

which prevent simultaneous write access to cpu_topology array

@@ -289,4 +312,7 @@ void __init init_cpu_topology(void) smp_wmb(); parse_dt_topology();

/* Set scheduler topology descriptor */

set_sched_topology(); }

How about the core scheduler provides an interface set_sched_topology() instead of each arch has its own __init function? Sketched out below for ARM.

-- >8 -- Subject: [PATCH] sched: set_sched_topology() as an interface of core scheduler

--- arch/arm/kernel/topology.c | 7 +------ include/linux/sched.h | 2 +- kernel/sched/core.c | 5 +++++ 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index ae8ffbc..89d5592 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -284,11 +284,6 @@ static struct sched_domain_topology_level arm_topology[] = { { NULL, }, };

-static void __init set_sched_topology(void) -{ - sched_domain_topology = arm_topology; -} - /* * init_cpu_topology is called at boot when only one cpu is running * which prevent simultaneous write access to cpu_topology array @@ -314,5 +309,5 @@ void __init init_cpu_topology(void) parse_dt_topology();

/* Set scheduler topology descriptor */ - set_sched_topology(); + set_sched_topology(arm_topology); } diff --git a/include/linux/sched.h b/include/linux/sched.h index 8831413..fefd4e7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -998,7 +998,7 @@ struct sched_domain_topology_level { #endif };

-extern struct sched_domain_topology_level *sched_domain_topology; +extern void set_sched_topology(struct sched_domain_topology_level *tl);

#ifdef CONFIG_SCHED_DEBUG # define SD_INIT_NAME(type) .name = #type diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b4fb0df..a748c92 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6105,6 +6105,11 @@ static struct sched_domain_topology_level default_topology[] = {

struct sched_domain_topology_level *sched_domain_topology = default_topology;

+void set_sched_topology(struct sched_domain_topology_level *tl) +{ + sched_domain_topology = tl; +} + #define for_each_sd_topology(tl) \ for (tl = sched_domain_topology; tl->mask; tl++)

-- 1.8.3.2

Vincent Guittot

6 Mar 6 Mar

8:42 a.m.

New subject: [RFC 6/6] sched: ARM: create a dedicated scheduler topology table

On 6 March 2014 06:38, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...

On 05/03/14 07:18, Vincent Guittot wrote:

...
Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others.

The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN.

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org

[snip]

...

...
  parse_dt_topology();
/* Set scheduler topology descriptor */
set_sched_topology();
}
How about the core scheduler provides an interface set_sched_topology() instead of each arch has its own __init function? Sketched out below for ARM.

yes, i will add it

...

-- >8 -- Subject: [PATCH] sched: set_sched_topology() as an interface of core scheduler

arch/arm/kernel/topology.c | 7 +------ include/linux/sched.h | 2 +- kernel/sched/core.c | 5 +++++ 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index ae8ffbc..89d5592 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -284,11 +284,6 @@ static struct sched_domain_topology_level arm_topology[] = { { NULL, }, };

-static void __init set_sched_topology(void) -{
  sched_domain_topology = arm_topology;
-}

/*

init_cpu_topology is called at boot when only one cpu is running

which prevent simultaneous write access to cpu_topology array

@@ -314,5 +309,5 @@ void __init init_cpu_topology(void) parse_dt_topology();
    /* Set scheduler topology descriptor */
  set_sched_topology();
  set_sched_topology(arm_topology);
} diff --git a/include/linux/sched.h b/include/linux/sched.h index 8831413..fefd4e7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -998,7 +998,7 @@ struct sched_domain_topology_level { #endif };

-extern struct sched_domain_topology_level *sched_domain_topology; +extern void set_sched_topology(struct sched_domain_topology_level *tl);

#ifdef CONFIG_SCHED_DEBUG # define SD_INIT_NAME(type) .name = #type diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b4fb0df..a748c92 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6105,6 +6105,11 @@ static struct sched_domain_topology_level default_topology[] = {

struct sched_domain_topology_level *sched_domain_topology = default_topology;

+void set_sched_topology(struct sched_domain_topology_level *tl) +{
  sched_domain_topology = tl;
+}

#define for_each_sd_topology(tl) \ for (tl = sched_domain_topology; tl->mask; tl++)

-- 1.8.3.2

Dietmar Eggemann

5 Mar 5 Mar

11:17 p.m.

On 05/03/14 07:18, Vincent Guittot wrote:

...

This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table

arch/arm/kernel/topology.c | 26 ++++ arch/ia64/include/asm/topology.h | 24 ---- arch/metag/include/asm/topology.h | 27 ----- arch/powerpc/kernel/smp.c | 35 ++++-- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 30 +++++ include/linux/topology.h | 128 ++------------------ kernel/sched/core.c | 235 ++++++++++++++++++------------------- 10 files changed, 237 insertions(+), 339 deletions(-)

Hi Vincent,

I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it.

One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level.

int (*sched_domain_flags_f)(int cpu);

This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd levels, like cpu_corepower_mask() for the GMC level in the ARM case. There could be a set of standard sd topology flags function for the default sd layer and archs which want to pass in something special define those function locally since they will use them only in their arch specific struct sched_domain_topology_level table anyway. I know that you use the existing sd degenerate functionality for this and that the freeing of the redundant data structures (sched_domain, sched_group and sched_group_power) is there too but it still doesn't seem to me to be the right thing to do.

The problem that we now expose internal data structures (struct sd_data and struct sched_domain_topology_level) could be dealt with later.

-- Dietmar

Vincent Guittot

6 Mar 6 Mar

9:04 a.m.

On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...

On 05/03/14 07:18, Vincent Guittot wrote:

...
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table

arch/arm/kernel/topology.c | 26 ++++ arch/ia64/include/asm/topology.h | 24 ---- arch/metag/include/asm/topology.h | 27 ----- arch/powerpc/kernel/smp.c | 35 ++++-- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 30 +++++ include/linux/topology.h | 128 ++------------------ kernel/sched/core.c | 235 ++++++++++++++++++------------------- 10 files changed, 237 insertions(+), 339 deletions(-)

Hi Vincent,

I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it.

Thanks

...

One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level.

int (*sched_domain_flags_f)(int cpu);

We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer.

Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ?

In the current example of this patchset, the flags are statically set in the table but nothing prevents an architecture to update the flags value before being given to the scheduler

...

This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd levels, like cpu_corepower_mask() for the GMC level in the ARM case. There could be a set of standard sd topology flags function for the default sd layer and archs which want to pass in something special define those function locally since they will use them only in their arch specific struct sched_domain_topology_level table anyway. I know that you use the existing sd degenerate functionality for this and that the freeing of the redundant data structures (sched_domain, sched_group and sched_group_power) is there too but it still doesn't seem to me to be the right thing to do.

The problem that we now expose internal data structures (struct sd_data and struct sched_domain_topology_level) could be dealt with later.

-- Dietmar

Dietmar Eggemann

12:31 p.m.

On 06/03/14 09:04, Vincent Guittot wrote:

...

On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...
On 05/03/14 07:18, Vincent Guittot wrote:

...
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table

arch/arm/kernel/topology.c | 26 ++++ arch/ia64/include/asm/topology.h | 24 ---- arch/metag/include/asm/topology.h | 27 ----- arch/powerpc/kernel/smp.c | 35 ++++-- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 30 +++++ include/linux/topology.h | 128 ++------------------ kernel/sched/core.c | 235 ++++++++++++++++++------------------- 10 files changed, 237 insertions(+), 339 deletions(-)

Hi Vincent,

I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it.

Thanks

...
One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level.

int (*sched_domain_flags_f)(int cpu);

We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer.

This is what I don't understand at the moment. In your example in the link above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15 can't). Why can't we have

static inline int cpu_coregroup_flags(int cpu) { int flags = SD_SHARE_PKG_RESOURCES;

if (cpu >= 8) flags |= SD_SHARE_POWERDOMAIN;

return flags; }

inside the arch specific topology file and use it in the MC level as the int (*sched_domain_flags_f)(int cpu) member of struct sched_domain_topology_level?

...

Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ?

The one above. Fundamentally all situations where you want to set sd topology flags differently for different cpus in the same sd level. big.LITTLE is another example but it's the same as powergated/!powergated in this perspective.

...

In the current example of this patchset, the flags are statically set in the table but nothing prevents an architecture to update the flags value before being given to the scheduler

What will be the infrastructure if the arch wants to do so?

Thanks,

-- Dietmar

...

...
This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd levels, like cpu_corepower_mask() for the GMC level in the ARM case. There could be a set of standard sd topology flags function for the default sd layer and archs which want to pass in something special define those function locally since they will use them only in their arch specific struct sched_domain_topology_level table anyway. I know that you use the existing sd degenerate functionality for this and that the freeing of the redundant data structures (sched_domain, sched_group and sched_group_power) is there too but it still doesn't seem to me to be the right thing to do.

The problem that we now expose internal data structures (struct sd_data and struct sched_domain_topology_level) could be dealt with later.

-- Dietmar

Vincent Guittot

7 Mar 7 Mar

2:47 a.m.

On 6 March 2014 20:31, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...

On 06/03/14 09:04, Vincent Guittot wrote:

...
On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...
On 05/03/14 07:18, Vincent Guittot wrote:

...
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table

arch/arm/kernel/topology.c | 26 ++++ arch/ia64/include/asm/topology.h | 24 ---- arch/metag/include/asm/topology.h | 27 ----- arch/powerpc/kernel/smp.c | 35 ++++-- arch/s390/include/asm/topology.h | 13 +- arch/s390/kernel/topology.c | 25 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 30 +++++ include/linux/topology.h | 128 ++------------------ kernel/sched/core.c | 235 ++++++++++++++++++------------------- 10 files changed, 237 insertions(+), 339 deletions(-)

Hi Vincent,

I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it.

Thanks

...
One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level.

int (*sched_domain_flags_f)(int cpu);

We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer.

This is what I don't understand at the moment. In your example in the link above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15 can't). Why can't we have

The 2nd example is a bit more complicated and needs an additional level to describe the topology

...

static inline int cpu_coregroup_flags(int cpu) { int flags = SD_SHARE_PKG_RESOURCES;
if (cpu >= 8)
    flags |= SD_SHARE_POWERDOMAIN;

return flags;
}

inside the arch specific topology file and use it in the MC level as the int (*sched_domain_flags_f)(int cpu) member of struct sched_domain_topology_level?

...
Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ?

The one above. Fundamentally all situations where you want to set sd topology flags differently for different cpus in the same sd level. big.LITTLE is another example but it's the same as powergated/!powergated in this perspective.

You see the flag of a level as something that can be changed per cpu whereas the proposal is to define a number of level with interesting properties for the scheduler and to associate a mask of cpus to this properties.

I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Peter,

Was the use of the cpu as a parameter in the initialization of sched_domain's flag a reason for asking for reworking the initialization of sched_domain ?

...

...
In the current example of this patchset, the flags are statically set in the table but nothing prevents an architecture to update the flags value before being given to the scheduler

What will be the infrastructure if the arch wants to do so?

I'm not sure to catch your point. The arch is now able to defines its own table and fill it before giving it to the scheduler so nothing prevents to set/update the flags field according the system configuration during boot sequence.

Thanks, Vincent

...

Thanks,

-- Dietmar

...
...
This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd levels, like cpu_corepower_mask() for the GMC level in the ARM case. There could be a set of standard sd topology flags function for the default sd layer and archs which want to pass in something special define those function locally since they will use them only in their arch specific struct sched_domain_topology_level table anyway. I know that you use the existing sd degenerate functionality for this and that the freeing of the redundant data structures (sched_domain, sched_group and sched_group_power) is there too but it still doesn't seem to me to be the right thing to do.

The problem that we now expose internal data structures (struct sd_data and struct sched_domain_topology_level) could be dealt with later.

-- Dietmar

Dietmar Eggemann

8 Mar 8 Mar

12:40 p.m.

On 07/03/14 02:47, Vincent Guittot wrote:

...

On 6 March 2014 20:31, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...
On 06/03/14 09:04, Vincent Guittot wrote:

...
On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...
On 05/03/14 07:18, Vincent Guittot wrote:

...
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table
arch/arm/kernel/topology.c        |   26 ++++
arch/ia64/include/asm/topology.h  |   24 ----
arch/metag/include/asm/topology.h |   27 -----
arch/powerpc/kernel/smp.c         |   35 ++++--
arch/s390/include/asm/topology.h  |   13 +-
arch/s390/kernel/topology.c       |   25 ++++
arch/tile/include/asm/topology.h  |   33 ------
include/linux/sched.h             |   30 +++++
include/linux/topology.h          |  128 ++------------------
kernel/sched/core.c               |  235
++++++++++++++++++------------------- 10 files changed, 237 insertions(+), 339 deletions(-)
Hi Vincent,

I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it.
Thanks

...
One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level.
int (*sched_domain_flags_f)(int cpu);
We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer.
This is what I don't understand at the moment. In your example in the link above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15 can't). Why can't we have
The 2nd example is a bit more complicated and needs an additional level to describe the topology

I see. In the 2. example you want to have an additional MC level for cpu 2-7 because you want to do load balance between those cpus more often than for cpu 0-7 for dst cpus from the set (2-7). Not sure if such topologies (only a subset of cpus inside a cluster can't be powergated) exists today in the real world though?

from https://lkml.org/lkml/2013/12/18/279:

CPU2: domain 0: span 2-3 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 0 1 <-- Doesn't this have to be 'groups: 2 3' domain 1: span 2-7 level: MC flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 2-7 4-5 6-7 domain 2: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES groups: 2-7 0-1 domain 3: span 0-15 level: CPU flags: groups: 0-7 8-15

...

...
static inline int cpu_coregroup_flags(int cpu) { int flags = SD_SHARE_PKG_RESOURCES;
 if (cpu >= 8)
     flags |= SD_SHARE_POWERDOMAIN;

 return flags;
}

inside the arch specific topology file and use it in the MC level as the int (*sched_domain_flags_f)(int cpu) member of struct sched_domain_topology_level?

...
Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ?

The one above. Fundamentally all situations where you want to set sd topology flags differently for different cpus in the same sd level. big.LITTLE is another example but it's the same as powergated/!powergated in this perspective.
You see the flag of a level as something that can be changed per cpu whereas the proposal is to define a number of level with interesting properties for the scheduler and to associate a mask of cpus to this properties.

That's right.

...

I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do you think that those completely different flags configuration would make it impossible, that the load-balance code could work at all at this sd?

...

Peter,

Was the use of the cpu as a parameter in the initialization of sched_domain's flag a reason for asking for reworking the initialization of sched_domain ?

...
...
In the current example of this patchset, the flags are statically set in the table but nothing prevents an architecture to update the flags value before being given to the scheduler

What will be the infrastructure if the arch wants to do so?

I'm not sure to catch your point. The arch is now able to defines its own table and fill it before giving it to the scheduler so nothing prevents to set/update the flags field according the system configuration during boot sequence.

Sorry, misunderstanding from my side in the first place. You just said that the arch is able to change those flags (due to the arch specific topology table) and I related it to the possibility for the arch to set the topology flags per cpu.

...

Thanks, Vincent

...
Thanks,

-- Dietmar

...
...
This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd levels, like cpu_corepower_mask() for the GMC level in the ARM case. There could be a set of standard sd topology flags function for the default sd layer and archs which want to pass in something special define those function locally since they will use them only in their arch specific struct sched_domain_topology_level table anyway. I know that you use the existing sd degenerate functionality for this and that the freeing of the redundant data structures (sched_domain, sched_group and sched_group_power) is there too but it still doesn't seem to me to be the right thing to do.

The problem that we now expose internal data structures (struct sd_data and struct sched_domain_topology_level) could be dealt with later.

-- Dietmar

Vincent Guittot

10 Mar 10 Mar

1:21 p.m.

On 8 March 2014 13:40, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...

On 07/03/14 02:47, Vincent Guittot wrote:

...
On 6 March 2014 20:31, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...
On 06/03/14 09:04, Vincent Guittot wrote:

...
On 6 March 2014 07:17, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...
On 05/03/14 07:18, Vincent Guittot wrote:

...
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6): sched: remove unused SCHED_INIT_NODE sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table
arch/arm/kernel/topology.c        |   26 ++++
arch/ia64/include/asm/topology.h  |   24 ----
arch/metag/include/asm/topology.h |   27 -----
arch/powerpc/kernel/smp.c         |   35 ++++--
arch/s390/include/asm/topology.h  |   13 +-
arch/s390/kernel/topology.c       |   25 ++++
arch/tile/include/asm/topology.h  |   33 ------
include/linux/sched.h             |   30 +++++
include/linux/topology.h          |  128 ++------------------
kernel/sched/core.c               |  235
++++++++++++++++++------------------- 10 files changed, 237 insertions(+), 339 deletions(-)
Hi Vincent,

I reviewed your patch-set carefully (including test runs on TC2), especially due to the fact that we want to build our sd_energy stuff on top of it.
Thanks

...
One thing I'm still not convinced of is the fact that specifying additional sd levels in the struct sched_domain_topology_level table has an advantage over a function pointer for sd topology flags similar to the one we're already using for the cpu mask in struct sched_domain_topology_level.
int (*sched_domain_flags_f)(int cpu);
We have to create additional level for some kind of topology as described in my trial https://lkml.org/lkml/2013/12/18/279 which is not possible with function pointer.
This is what I don't understand at the moment. In your example in the link above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15 can't). Why can't we have
The 2nd example is a bit more complicated and needs an additional level to describe the topology
I see. In the 2. example you want to have an additional MC level for cpu 2-7 because you want to do load balance between those cpus more often than for cpu 0-7 for dst cpus from the set (2-7). Not sure if such topologies (only a subset of cpus inside a cluster can't be powergated) exists today in the real world though?

Be sure that such topology is studied and considered by HW guys

...

from https://lkml.org/lkml/2013/12/18/279:

CPU2: domain 0: span 2-3 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 0 1 <-- Doesn't this have to be 'groups: 2 3' domain 1: span 2-7 level: MC flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 2-7 4-5 6-7 domain 2: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES groups: 2-7 0-1 domain 3: span 0-15 level: CPU flags: groups: 0-7 8-15

...
...
static inline int cpu_coregroup_flags(int cpu) { int flags = SD_SHARE_PKG_RESOURCES;
 if (cpu >= 8)
     flags |= SD_SHARE_POWERDOMAIN;

 return flags;
}

inside the arch specific topology file and use it in the MC level as the int (*sched_domain_flags_f)(int cpu) member of struct sched_domain_topology_level?

...
Have you got a situation in mind where it will be necessary to use the function pointer with cpu number as an argument ?

The one above. Fundamentally all situations where you want to set sd topology flags differently for different cpus in the same sd level. big.LITTLE is another example but it's the same as powergated/!powergated in this perspective.
You see the flag of a level as something that can be changed per cpu whereas the proposal is to define a number of level with interesting properties for the scheduler and to associate a mask of cpus to this properties.
That's right.

...
I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do you think that those completely different flags configuration would make it impossible, that the load-balance code could work at all at this sd?

With such function, an arch could set differently some topology flags like SD_SHARE_PKG_RESOURCES and SD_SHARE_CPUPOWER in the same sd level which make the notion of sd level meaningless.

...

...
Peter,

Was the use of the cpu as a parameter in the initialization of sched_domain's flag a reason for asking for reworking the initialization of sched_domain ?

...
...
In the current example of this patchset, the flags are statically set in the table but nothing prevents an architecture to update the flags value before being given to the scheduler

What will be the infrastructure if the arch wants to do so?

I'm not sure to catch your point. The arch is now able to defines its own table and fill it before giving it to the scheduler so nothing prevents to set/update the flags field according the system configuration during boot sequence.

Sorry, misunderstanding from my side in the first place. You just said that the arch is able to change those flags (due to the arch specific topology table) and I related it to the possibility for the arch to set the topology flags per cpu.

OK, i was not sure that your point was only about the cpu argument for the SD's flags configuration

Vincent

...

...
Thanks, Vincent

...
Thanks,

-- Dietmar

...
...
This function pointer would be simply another member of struct sched_domain_topology_level and would replace int sd_flags. AFAICS, you have to create additional cpu mask functions anyway for the additional sd levels, like cpu_corepower_mask() for the GMC level in the ARM case. There could be a set of standard sd topology flags function for the default sd layer and archs which want to pass in something special define those function locally since they will use them only in their arch specific struct sched_domain_topology_level table anyway. I know that you use the existing sd degenerate functionality for this and that the freeing of the redundant data structures (sched_domain, sched_group and sched_group_power) is there too but it still doesn't seem to me to be the right thing to do.

The problem that we now expose internal data structures (struct sd_data and struct sched_domain_topology_level) could be dealt with later.

-- Dietmar

Peter Zijlstra

11 Mar 11 Mar

1:17 p.m.

On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:

...

...
I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do you think that those completely different flags configuration would make it impossible, that the load-balance code could work at all at this sd?

So a problem with such an interfaces is that is makes it far too easy to generate completely broken domains.

You can, for two cpus in the same domain provide, different flags; such a configuration doesn't make any sense at all.

Now I see why people would like to have this; but unless we can make it robust I'd be very hesitant to go this route.

Dietmar Eggemann

12 Mar 12 Mar

1:28 p.m.

On 11/03/14 13:17, Peter Zijlstra wrote:

...

On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:

...
...
I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do you think that those completely different flags configuration would make it impossible, that the load-balance code could work at all at this sd?

So a problem with such an interfaces is that is makes it far too easy to generate completely broken domains.

I see the point. What I'm still struggling with is to understand why this interface is worse then the one where we set-up additional, adjacent sd levels with new cpu_foo_mask functions plus different static sd-flags configurations and rely on the sd degenerate functionality in the core scheduler to fold these levels together to achieve different per cpu sd flags configurations.

IMHO, exposing struct sched_domain_topology_level bar_topology[] to the arch is the reason why the core scheduler has to check if the arch provides a sane sd setup in both cases.

...

You can, for two cpus in the same domain provide, different flags; such a configuration doesn't make any sense at all.

Now I see why people would like to have this; but unless we can make it robust I'd be very hesitant to go this route.

By making it robust, I guess you mean that the core scheduler has to check that the provided set-ups are sane, something like the following code snippet in sd_init()

if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

but for per cpu set-up's. Obviously, this check has to be in sync with the usage of these flags in the core scheduler algorithms. This comprises probably that a subset of these topology sd flags has to be set for all cpus in a sd level whereas other can be set only for some cpus.

...

Vincent Guittot

1:47 p.m.

On 12 March 2014 14:28, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...

On 11/03/14 13:17, Peter Zijlstra wrote:

...
On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:

...
...
I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do you think that those completely different flags configuration would make it impossible, that the load-balance code could work at all at this sd?

So a problem with such an interfaces is that is makes it far too easy to generate completely broken domains.

I see the point. What I'm still struggling with is to understand why this interface is worse then the one where we set-up additional, adjacent sd levels with new cpu_foo_mask functions plus different static sd-flags configurations and rely on the sd degenerate functionality in the core scheduler to fold these levels together to achieve different per cpu sd flags configurations.

The main difference is that all CPUs has got the same levels at the initial state and then the degenerate sequence can decide that it's worth removing a level and if it will not create unsuable domains.

...

IMHO, exposing struct sched_domain_topology_level bar_topology[] to the arch is the reason why the core scheduler has to check if the arch provides a sane sd setup in both cases.

...
You can, for two cpus in the same domain provide, different flags; such a configuration doesn't make any sense at all.

Now I see why people would like to have this; but unless we can make it robust I'd be very hesitant to go this route.

By making it robust, I guess you mean that the core scheduler has to check that the provided set-ups are sane, something like the following code snippet in sd_init()

if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

but for per cpu set-up's. Obviously, this check has to be in sync with the usage of these flags in the core scheduler algorithms. This comprises probably that a subset of these topology sd flags has to be set for all cpus in a sd level whereas other can be set only for some cpus.

...

Dietmar Eggemann

13 Mar 13 Mar

2:07 p.m.

On 12/03/14 13:47, Vincent Guittot wrote:

...

On 12 March 2014 14:28, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...
On 11/03/14 13:17, Peter Zijlstra wrote:

...
On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:

...
...
I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do you think that those completely different flags configuration would make it impossible, that the load-balance code could work at all at this sd?

So a problem with such an interfaces is that is makes it far too easy to generate completely broken domains.

I see the point. What I'm still struggling with is to understand why this interface is worse then the one where we set-up additional, adjacent sd levels with new cpu_foo_mask functions plus different static sd-flags configurations and rely on the sd degenerate functionality in the core scheduler to fold these levels together to achieve different per cpu sd flags configurations.

The main difference is that all CPUs has got the same levels at the initial state and then the degenerate sequence can decide that it's worth removing a level and if it will not create unsuable domains.

Agreed. But what I'm trying to say is that using the approach of multiple adjacent sd levels with different cpu_mask(int cpu) functions and static sd topology flags will not prevent us from coding the enforcement of sane sd topology flags set-ups somewhere inside the core scheduler.

It is possible to easily introduce erroneous set-ups from the standpoint of sd topology flags with this approach too.

For the sake of an example on ARM TC2 platform, I changed cpu_corepower_mask(int cpu) [arch/arm/kernel/topology.c] to simulate that in socket 1 (3 Cortex-A7) cores can powergate individually whereas in socket 0 (2 Cortex A15) they can't:

const struct cpumask *cpu_corepower_mask(int cpu) { - return &cpu_topology[cpu].thread_sibling; + return cpu_topology[cpu].socket_id ? &cpu_topology[cpu].thread_sibling : + &cpu_topology[cpu].core_sibling; }

With this I get the following cpu mask configuration:

dmesg snippet (w/ additional debug in cpu_coregroup_mask(), cpu_corepower_mask()):

... CPU0: cpu_corepower_mask=0-1 CPU0: cpu_coregroup_mask=0-1 CPU1: cpu_corepower_mask=0-1 CPU1: cpu_coregroup_mask=0-1 CPU2: cpu_corepower_mask=2 CPU2: cpu_coregroup_mask=2-4 CPU3: cpu_corepower_mask=3 CPU3: cpu_coregroup_mask=2-4 CPU4: cpu_corepower_mask=4 CPU4: cpu_coregroup_mask=2-4 ...

And I deliberately introduced the following error into the arm_topology[] table:

static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC - { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) }, + { cpu_corepower_mask, SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) }, { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },

With this set-up, I get GMC & DIE level for CPU0,1 and MC & DIE level for CPU2,3,4, i.e. the SD_SHARE_PKG_RESOURCES flag is only set for CPU2,3,4 and MC level.

dmesg snippet (w/ adapted sched_domain_debug_one(), only CPU0 and CPU2 shown here):

... CPU0 attaching sched-domain: domain 0: span 0-1 level GMC SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_POWERDOMAIN SD_PREFER_SIBLING groups: 0 1 ... domain 1: span 0-4 level DIE SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING groups: 0-1 (cpu_power = 2048) 2-4 (cpu_power = 3072) ... CPU2 attaching sched-domain: domain 0: span 2-4 level MC SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES groups: 2 3 4 ... domain 1: span 0-4 level DIE SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING groups: 2-4 (cpu_power = 3072) 0-1 (cpu_power = 2048) ...

What I wanted to say is IMHO, it doesn't matter which approach we take (multiple adjacent sd levels or per-cpu topo sd flag function), we have to enforce sane sd topology flags set-up inside the core scheduler anyway.

-- Dietmar

...

...
IMHO, exposing struct sched_domain_topology_level bar_topology[] to the arch is the reason why the core scheduler has to check if the arch provides a sane sd setup in both cases.

...
You can, for two cpus in the same domain provide, different flags; such a configuration doesn't make any sense at all.

Now I see why people would like to have this; but unless we can make it robust I'd be very hesitant to go this route.

By making it robust, I guess you mean that the core scheduler has to check that the provided set-ups are sane, something like the following code snippet in sd_init()

if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

but for per cpu set-up's. Obviously, this check has to be in sync with the usage of these flags in the core scheduler algorithms. This comprises probably that a subset of these topology sd flags has to be set for all cpus in a sd level whereas other can be set only for some cpus.

[...]

Peter Zijlstra

17 Mar 17 Mar

11:52 a.m.

On Wed, Mar 12, 2014 at 01:28:07PM +0000, Dietmar Eggemann wrote:

...

On 11/03/14 13:17, Peter Zijlstra wrote:

...
On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:

...
...
I don't have a strong opinion about using or not a cpu argument for setting the flags of a level (it was part of the initial proposal before we start to completely rework the build of sched_domain) Nevertheless, I see one potential concern that you can have completely different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do you think that those completely different flags configuration would make it impossible, that the load-balance code could work at all at this sd?

So a problem with such an interfaces is that is makes it far too easy to generate completely broken domains.

I see the point. What I'm still struggling with is to understand why this interface is worse then the one where we set-up additional, adjacent sd levels with new cpu_foo_mask functions plus different static sd-flags configurations and rely on the sd degenerate functionality in the core scheduler to fold these levels together to achieve different per cpu sd flags configurations.

Well, the folding of SD levels is 'safe' in that it keeps domains internally consistent.

...

IMHO, exposing struct sched_domain_topology_level bar_topology[] to the arch is the reason why the core scheduler has to check if the arch provides a sane sd setup in both cases.

Up to a point yes. On the other hand; the reason we have the degenerate stuff is because the topology was generic and might contain pointless levels because the architecture didn't actually have them.

By moving the topology setup into the arch; that could be made to go away (not sure you want to do that, but you could).

But yes, by moving the topology setup out of the core code, you need some extra validation to make sure that whatever you're fed makes some kind of sense.

...

...
You can, for two cpus in the same domain provide, different flags; such a configuration doesn't make any sense at all.

Now I see why people would like to have this; but unless we can make it robust I'd be very hesitant to go this route.

By making it robust, I guess you mean that the core scheduler has to check that the provided set-ups are sane, something like the following code snippet in sd_init()

if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

but for per cpu set-up's.

So a domain is principally a group of CPUs with the same properties. However per-cpu domain attributes allows you to specify different domain properties within the one domain mask.

That's completely broken.

So the way to validate something like that would be:

cpu = cpumask_first(tl->mask()); flags = tl->flags(cpu);

for (;cpu = cpumask_next(cpu, tl->mask()), cpu < nr_cpu_ids;) BUG_ON(tl->flags(cpu) != flags);

Or something along those lines.

But for me its far easier to think in the simple one domain one flags scenario. The whole degenerate folding is a very simple optimization simply removing redundant levels.

Dietmar Eggemann

19 Mar 19 Mar

7:15 p.m.

On 17/03/14 11:52, Peter Zijlstra wrote:

...

On Wed, Mar 12, 2014 at 01:28:07PM +0000, Dietmar Eggemann wrote:

[...]

...

...
By making it robust, I guess you mean that the core scheduler has to check that the provided set-ups are sane, something like the following code snippet in sd_init()

if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

but for per cpu set-up's.

So a domain is principally a group of CPUs with the same properties. However per-cpu domain attributes allows you to specify different domain properties within the one domain mask.

That's completely broken.

So the way to validate something like that would be:

cpu = cpumask_first(tl->mask()); flags = tl->flags(cpu);

for (;cpu = cpumask_next(cpu, tl->mask()), cpu < nr_cpu_ids;) BUG_ON(tl->flags(cpu) != flags);

Or something along those lines.

I tried this idea inside sd_init() on top of Vincent's V3 and it's doing its job.

...

But for me its far easier to think in the simple one domain one flags scenario. The whole degenerate folding is a very simple optimization simply removing redundant levels.

For me, the approach with the 'int cpu' parameter in the flag function is easier to understand. One of the things I had to grasp though was the fact that we can only specify SD_SHARE_FOO flags and not SD_NOT_SHARE_FOO per domain.

-- >8 --

Subject: [PATCH] sched: check that the sd_flags are consistent in one domain

--- arch/arm/kernel/topology.c | 13 +++++++++---- include/linux/sched.h | 6 +++--- kernel/sched/core.c | 11 +++++++++-- 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec6d31a..425f133c690d 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -275,15 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }

-static inline const int cpu_corepower_flags(void) +//static inline const int cpu_corepower_flags(void) +//{ +// return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; +//} + +static inline const int arm_cpu_core_flags(int cpu) { - return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; + return (cpu < 2) ? SD_SHARE_PKG_RESOURCES : SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; }

static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC - { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, - { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, +// { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, + { cpu_coregroup_mask, arm_cpu_core_flags, SD_INIT_NAME(MC) }, #endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, diff --git a/include/linux/sched.h b/include/linux/sched.h index 05ce264e5144..45e5aa3d3e80 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -870,14 +870,14 @@ enum cpu_idle_type { #define SD_NUMA 0x4000 /* cross-node balancing */

#ifdef CONFIG_SCHED_SMT -static inline const int cpu_smt_flags(void) +static inline const int cpu_smt_flags(int cpu) { return SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES; } #endif

#ifdef CONFIG_SCHED_MC -static inline const int cpu_core_flags(void) +static inline const int cpu_core_flags(int cpu) { return SD_SHARE_PKG_RESOURCES; } @@ -990,7 +990,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); bool cpus_share_cache(int this_cpu, int that_cpu);

typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); -typedef const int (*sched_domain_flags_f)(void); +typedef const int (*sched_domain_flags_f)(int cpu);

#define SDTL_OVERLAP 0x01

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f2ee6c72b13a..6b8ba837977c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5968,7 +5968,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) sd_weight = cpumask_weight(tl->mask(cpu));

if (tl->sd_flags) - sd_flags = (*tl->sd_flags)(); + sd_flags = (*tl->sd_flags)(cpu); if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) sd_flags &= ~TOPOLOGY_SD_FLAGS; @@ -6044,9 +6044,16 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) sd->idle_idx = 1; }

+ if (tl->sd_flags) { + int flags = (*tl->sd_flags)(cpumask_first(tl->mask(cpu))); + + for (;cpu = cpumask_next(cpu, tl->mask(cpu)), cpu < nr_cpu_ids;) + BUG_ON((*tl->sd_flags)(cpu) != flags); + } + sd->private = &tl->data;

- return sd; + return sd; }

-- 1.7.9.5

Vincent Guittot

20 Mar 20 Mar

8:28 a.m.

On 19 March 2014 20:15, Dietmar Eggemann dietmar.eggemann@arm.com wrote:

...

On 17/03/14 11:52, Peter Zijlstra wrote:

...
On Wed, Mar 12, 2014 at 01:28:07PM +0000, Dietmar Eggemann wrote:

[...]

...
...
By making it robust, I guess you mean that the core scheduler has to check that the provided set-ups are sane, something like the following code snippet in sd_init()

if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS, "wrong sd_flags in topology description\n")) tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

but for per cpu set-up's.

So a domain is principally a group of CPUs with the same properties. However per-cpu domain attributes allows you to specify different domain properties within the one domain mask.

That's completely broken.

So the way to validate something like that would be:
  cpu = cpumask_first(tl->mask());
  flags = tl->flags(cpu);

  for (;cpu = cpumask_next(cpu, tl->mask()), cpu < nr_cpu_ids;)
          BUG_ON(tl->flags(cpu) != flags);
Or something along those lines.
I tried this idea inside sd_init() on top of Vincent's V3 and it's doing its job.

...
But for me its far easier to think in the simple one domain one flags scenario. The whole degenerate folding is a very simple optimization simply removing redundant levels.

For me, the approach with the 'int cpu' parameter in the flag function is easier to understand. One of the things I had to grasp though was the fact that we can only specify SD_SHARE_FOO flags and not SD_NOT_SHARE_FOO per domain.

Looking at you test below, the solution without cpu argument is more readable for me because you don't have to handle 2 cpu args that can vary when setting your level. I'm afraid that the flags function will become quite complex and unreadable with a cpu arg. And this additional cpu arg doesn't give any benefit.

Vincent

...

-- >8 --

Subject: [PATCH] sched: check that the sd_flags are consistent in one domain

arch/arm/kernel/topology.c | 13 +++++++++---- include/linux/sched.h | 6 +++--- kernel/sched/core.c | 11 +++++++++-- 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec6d31a..425f133c690d 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -275,15 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }

-static inline const int cpu_corepower_flags(void) +//static inline const int cpu_corepower_flags(void) +//{ +// return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; +//}

+static inline const int arm_cpu_core_flags(int cpu) {
  return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
  return (cpu < 2) ? SD_SHARE_PKG_RESOURCES : SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
}

static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC
  { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
  { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+// { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
  { cpu_coregroup_mask, arm_cpu_core_flags, SD_INIT_NAME(MC) },
#endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, diff --git a/include/linux/sched.h b/include/linux/sched.h index 05ce264e5144..45e5aa3d3e80 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -870,14 +870,14 @@ enum cpu_idle_type { #define SD_NUMA 0x4000 /* cross-node balancing */

#ifdef CONFIG_SCHED_SMT -static inline const int cpu_smt_flags(void) +static inline const int cpu_smt_flags(int cpu) { return SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES; } #endif

#ifdef CONFIG_SCHED_MC -static inline const int cpu_core_flags(void) +static inline const int cpu_core_flags(int cpu) { return SD_SHARE_PKG_RESOURCES; } @@ -990,7 +990,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); bool cpus_share_cache(int this_cpu, int that_cpu);

typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); -typedef const int (*sched_domain_flags_f)(void); +typedef const int (*sched_domain_flags_f)(int cpu);

#define SDTL_OVERLAP 0x01

diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f2ee6c72b13a..6b8ba837977c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5968,7 +5968,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) sd_weight = cpumask_weight(tl->mask(cpu));
    if (tl->sd_flags)
          sd_flags = (*tl->sd_flags)();
          sd_flags = (*tl->sd_flags)(cpu);
  if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
                  "wrong sd_flags in topology description\n"))
          sd_flags &= ~TOPOLOGY_SD_FLAGS;
@@ -6044,9 +6044,16 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) sd->idle_idx = 1; }
  if (tl->sd_flags) {
         int flags = (*tl->sd_flags)(cpumask_first(tl->mask(cpu)));
         for (;cpu = cpumask_next(cpu, tl->mask(cpu)), cpu < nr_cpu_ids;)
                 BUG_ON((*tl->sd_flags)(cpu) != flags);
  }
  sd->private = &tl->data;
  return sd;
  return sd;
}

/*

1.7.9.5

Peter Zijlstra

11 Mar 11 Mar

1:08 p.m.

On Fri, Mar 07, 2014 at 10:47:49AM +0800, Vincent Guittot wrote:

...

Peter,

Was the use of the cpu as a parameter in the initialization of sched_domain's flag a reason for asking for reworking the initialization of sched_domain ?

/me tries very hard to remember.. and fails.

Reading back the linked thread also doesn't seem to jog memory.

4228

days inactive

4243

days old

linaro-kernel@lists.linaro.org

34 comments

participants

tags (0)

participants (4)

Dietmar Eggemann
Peter Zijlstra
Preeti U Murthy
Vincent Guittot