This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm
Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449
Change since v3: - remove last patch which was adding SD_SHARE_POWER_DOMAIN for powerpc's SMT level
Change since v2: - remove patch 1 as it has been already applied to metag tree for v3.15 - updates some commit messages - add new flag description in TOPOLOGY_SD_FLAGS description
Change since v1: - move sched_domains_curr_level back under #ifdef CONFIG_NUMA - use function pointer to set flag instead of a plain value. - add list of tunable flags in the commit message of patch 2 - add SD_SHARE_POWER_DOMAIN flag for powerpc's SMT level
Vincent Guittot (5): sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table
arch/arm/kernel/topology.c | 26 +++++ arch/ia64/include/asm/topology.h | 24 ---- arch/powerpc/kernel/smp.c | 31 +++-- arch/s390/include/asm/topology.h | 13 +-- arch/s390/kernel/topology.c | 20 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 49 +++++++- include/linux/topology.h | 128 +++----------------- kernel/sched/core.c | 244 ++++++++++++++++++++------------------- 9 files changed, 255 insertions(+), 313 deletions(-)
We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed).
We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which either wants to add new level where a load balance make sense like BOOK or powergating level or wants to change the flags configuration of some levels.
For each level, we need a function pointer that returns cpumask for each cpu, a function pointer that returns the flags for the level and a name. Only flags that describe topology, can be set by an architecture. The current topology flags are: SD_SHARE_CPUPOWER SD_SHARE_PKG_RESOURCES SD_NUMA SD_ASYM_PACKING
Then, each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and no more relevant information for load balancing than its childs.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- arch/ia64/include/asm/topology.h | 24 ---- arch/s390/include/asm/topology.h | 2 - arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 48 ++++++++ include/linux/topology.h | 128 +++------------------ kernel/sched/core.c | 235 ++++++++++++++++++++------------------- 6 files changed, 183 insertions(+), 287 deletions(-)
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index 5cb55a1..3202aa7 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -46,30 +46,6 @@
void build_cpu_to_node_map(void);
-#define SD_CPU_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 1, \ - .max_interval = 4, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_NEWIDLE \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_FORK \ - | SD_WAKE_AFFINE, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .nr_balance_failed = 0, \ -} - #endif /* CONFIG_NUMA */
#ifdef CONFIG_SMP diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index 05425b1..07763bd 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void) }; #endif
-#define SD_BOOK_INIT SD_CPU_INIT - #include <asm-generic/topology.h>
#endif /* _ASM_S390_TOPOLOGY_H */ diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index d15c0d8..9383118 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node) /* For now, use numa node -1 for global allocation. */ #define pcibus_to_node(bus) ((void)(bus), -1)
-/* - * TILE architecture has many cores integrated in one processor, so we need - * setup bigger balance_interval for both CPU/NODE scheduling domains to - * reduce process scheduling costs. - */ - -/* sched_domains SD_CPU_INIT for TILE architecture */ -#define SD_CPU_INIT (struct sched_domain) { \ - .min_interval = 4, \ - .max_interval = 128, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 0*SD_WAKE_AFFINE \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - , \ - .last_balance = jiffies, \ - .balance_interval = 32, \ -} - /* By definition, we create nodes based on online memory. */ #define node_has_online_mem(nid) 1
diff --git a/include/linux/sched.h b/include/linux/sched.h index 7cb07fd..3161701 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -874,6 +874,20 @@ enum cpu_idle_type {
extern int __weak arch_sd_sibiling_asym_packing(void);
+#ifdef CONFIG_SCHED_SMT +static inline const int cpu_smt_flags(void) +{ + return SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES; +} +#endif + +#ifdef CONFIG_SCHED_MC +static inline const int cpu_core_flags(void) +{ + return SD_SHARE_PKG_RESOURCES; +} +#endif + struct sched_domain_attr { int relax_domain_level; }; @@ -980,6 +994,38 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
bool cpus_share_cache(int this_cpu, int that_cpu);
+typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); +typedef const int (*sched_domain_flags_f)(void); + +#define SDTL_OVERLAP 0x01 + +struct sd_data { + struct sched_domain **__percpu sd; + struct sched_group **__percpu sg; + struct sched_group_power **__percpu sgp; +}; + +struct sched_domain_topology_level { + sched_domain_mask_f mask; + sched_domain_flags_f sd_flags; + int flags; + int numa_level; + struct sd_data data; +#ifdef CONFIG_SCHED_DEBUG + char *name; +#endif +}; + +extern struct sched_domain_topology_level *sched_domain_topology; + +extern void set_sched_topology(struct sched_domain_topology_level *tl); + +#ifdef CONFIG_SCHED_DEBUG +# define SD_INIT_NAME(type) .name = #type +#else +# define SD_INIT_NAME(type) +#endif + #else /* CONFIG_SMP */
struct sched_domain_attr; @@ -995,6 +1041,8 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu) return true; }
+static inline void set_sched_topology(struct sched_domain_topology_level *tl) { } + #endif /* !CONFIG_SMP */
diff --git a/include/linux/topology.h b/include/linux/topology.h index 12ae6ce..3a9db05 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -66,121 +66,6 @@ int arch_update_cpu_topology(void); #define PENALTY_FOR_NODE_WITH_CPUS (1) #endif
-/* - * Below are the 3 major initializers used in building sched_domains: - * SD_SIBLING_INIT, for SMT domains - * SD_CPU_INIT, for SMP domains - * - * Any architecture that cares to do any tuning to these values should do so - * by defining their own arch-specific initializer in include/asm/topology.h. - * A definition there will automagically override these default initializers - * and allow arch-specific performance tuning of sched_domains. - * (Only non-zero and non-null fields need be specified.) - */ - -#ifdef CONFIG_SCHED_SMT -/* MCD - Do we really need this? It is always on if CONFIG_SCHED_SMT is, - * so can't we drop this in favor of CONFIG_SCHED_SMT? - */ -#define ARCH_HAS_SCHED_WAKE_IDLE -/* Common values for SMT siblings */ -#ifndef SD_SIBLING_INIT -#define SD_SIBLING_INIT (struct sched_domain) { \ - .min_interval = 1, \ - .max_interval = 2, \ - .busy_factor = 64, \ - .imbalance_pct = 110, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 1*SD_SHARE_CPUPOWER \ - | 1*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ - | arch_sd_sibling_asym_packing() \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .smt_gain = 1178, /* 15% */ \ - .max_newidle_lb_cost = 0, \ - .next_decay_max_lb_cost = jiffies, \ -} -#endif -#endif /* CONFIG_SCHED_SMT */ - -#ifdef CONFIG_SCHED_MC -/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */ -#ifndef SD_MC_INIT -#define SD_MC_INIT (struct sched_domain) { \ - .min_interval = 1, \ - .max_interval = 4, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_SHARE_CPUPOWER \ - | 1*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .max_newidle_lb_cost = 0, \ - .next_decay_max_lb_cost = jiffies, \ -} -#endif -#endif /* CONFIG_SCHED_MC */ - -/* Common values for CPUs */ -#ifndef SD_CPU_INIT -#define SD_CPU_INIT (struct sched_domain) { \ - .min_interval = 1, \ - .max_interval = 4, \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 2, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 0*SD_SERIALIZE \ - | 1*SD_PREFER_SIBLING \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .max_newidle_lb_cost = 0, \ - .next_decay_max_lb_cost = jiffies, \ -} -#endif - -#ifdef CONFIG_SCHED_BOOK -#ifndef SD_BOOK_INIT -#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!! -#endif -#endif /* CONFIG_SCHED_BOOK */ - #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID DECLARE_PER_CPU(int, numa_node);
@@ -295,4 +180,17 @@ static inline int cpu_to_mem(int cpu) #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif
+#ifdef CONFIG_SCHED_SMT +static inline const struct cpumask *cpu_smt_mask(int cpu) +{ + return topology_thread_cpumask(cpu); +} +#endif + +static inline const struct cpumask *cpu_cpu_mask(int cpu) +{ + return cpumask_of_node(cpu_to_node(cpu)); +} + + #endif /* _LINUX_TOPOLOGY_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f9d9776..0714574 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5600,17 +5600,6 @@ static int __init isolated_cpu_setup(char *str)
__setup("isolcpus=", isolated_cpu_setup);
-static const struct cpumask *cpu_cpu_mask(int cpu) -{ - return cpumask_of_node(cpu_to_node(cpu)); -} - -struct sd_data { - struct sched_domain **__percpu sd; - struct sched_group **__percpu sg; - struct sched_group_power **__percpu sgp; -}; - struct s_data { struct sched_domain ** __percpu sd; struct root_domain *rd; @@ -5623,21 +5612,6 @@ enum s_alloc { sa_none, };
-struct sched_domain_topology_level; - -typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu); -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); - -#define SDTL_OVERLAP 0x01 - -struct sched_domain_topology_level { - sched_domain_init_f init; - sched_domain_mask_f mask; - int flags; - int numa_level; - struct sd_data data; -}; - /* * Build an iteration mask that can exclude certain CPUs from the upwards * domain traversal. @@ -5866,34 +5840,6 @@ int __weak arch_sd_sibling_asym_packing(void) * Non-inlined to reduce accumulated stack pressure in build_sched_domains() */
-#ifdef CONFIG_SCHED_DEBUG -# define SD_INIT_NAME(sd, type) sd->name = #type -#else -# define SD_INIT_NAME(sd, type) do { } while (0) -#endif - -#define SD_INIT_FUNC(type) \ -static noinline struct sched_domain * \ -sd_init_##type(struct sched_domain_topology_level *tl, int cpu) \ -{ \ - struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu); \ - *sd = SD_##type##_INIT; \ - SD_INIT_NAME(sd, type); \ - sd->private = &tl->data; \ - return sd; \ -} - -SD_INIT_FUNC(CPU) -#ifdef CONFIG_SCHED_SMT - SD_INIT_FUNC(SIBLING) -#endif -#ifdef CONFIG_SCHED_MC - SD_INIT_FUNC(MC) -#endif -#ifdef CONFIG_SCHED_BOOK - SD_INIT_FUNC(BOOK) -#endif - static int default_relax_domain_level = -1; int sched_domain_level_max;
@@ -5981,97 +5927,156 @@ static void claim_allocations(int cpu, struct sched_domain *sd) *per_cpu_ptr(sdd->sgp, cpu) = NULL; }
-#ifdef CONFIG_SCHED_SMT -static const struct cpumask *cpu_smt_mask(int cpu) -{ - return topology_thread_cpumask(cpu); -} -#endif - -/* - * Topology list, bottom-up. - */ -static struct sched_domain_topology_level default_topology[] = { -#ifdef CONFIG_SCHED_SMT - { sd_init_SIBLING, cpu_smt_mask, }, -#endif -#ifdef CONFIG_SCHED_MC - { sd_init_MC, cpu_coregroup_mask, }, -#endif -#ifdef CONFIG_SCHED_BOOK - { sd_init_BOOK, cpu_book_mask, }, -#endif - { sd_init_CPU, cpu_cpu_mask, }, - { NULL, }, -}; - -static struct sched_domain_topology_level *sched_domain_topology = default_topology; - -#define for_each_sd_topology(tl) \ - for (tl = sched_domain_topology; tl->init; tl++) - #ifdef CONFIG_NUMA - static int sched_domains_numa_levels; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks; static int sched_domains_curr_level; +#endif
-static inline int sd_local_flags(int level) -{ - if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE) - return 0; - - return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE; -} +/* + * SD_flags allowed in topology descriptions. + * + * SD_SHARE_CPUPOWER - describes SMT topologies + * SD_SHARE_PKG_RESOURCES - describes shared caches + * SD_NUMA - describes NUMA topologies + * + * Odd one out: + * SD_ASYM_PACKING - describes SMT quirks + */ +#define TOPOLOGY_SD_FLAGS \ + (SD_SHARE_CPUPOWER | \ + SD_SHARE_PKG_RESOURCES | \ + SD_NUMA | \ + SD_ASYM_PACKING)
static struct sched_domain * -sd_numa_init(struct sched_domain_topology_level *tl, int cpu) +sd_init(struct sched_domain_topology_level *tl, int cpu) { struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu); - int level = tl->numa_level; - int sd_weight = cpumask_weight( - sched_domains_numa_masks[level][cpu_to_node(cpu)]); + int sd_weight, sd_flags = 0; + +#ifdef CONFIG_NUMA + /* + * Ugly hack to pass state to sd_numa_mask()... + */ + sched_domains_curr_level = tl->numa_level; +#endif + + sd_weight = cpumask_weight(tl->mask(cpu)); + + if (tl->sd_flags) + sd_flags = (*tl->sd_flags)(); + if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS, + "wrong sd_flags in topology description\n")) + sd_flags &= ~TOPOLOGY_SD_FLAGS;
*sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 32, .imbalance_pct = 125, - .cache_nice_tries = 2, - .busy_idx = 3, - .idle_idx = 2, + + .cache_nice_tries = 0, + .busy_idx = 0, + .idle_idx = 0, .newidle_idx = 0, .wake_idx = 0, .forkexec_idx = 0,
.flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE - | 0*SD_BALANCE_EXEC - | 0*SD_BALANCE_FORK + | 1*SD_BALANCE_EXEC + | 1*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE - | 0*SD_WAKE_AFFINE + | 1*SD_WAKE_AFFINE | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES - | 1*SD_SERIALIZE + | 0*SD_SERIALIZE | 0*SD_PREFER_SIBLING - | 1*SD_NUMA - | sd_local_flags(level) + | 0*SD_NUMA + | sd_flags , + .last_balance = jiffies, .balance_interval = sd_weight, + .smt_gain = 0, + .max_newidle_lb_cost = 0, + .next_decay_max_lb_cost = jiffies, +#ifdef CONFIG_SCHED_DEBUG + .name = tl->name, +#endif }; - SD_INIT_NAME(sd, NUMA); - sd->private = &tl->data;
/* - * Ugly hack to pass state to sd_numa_mask()... + * Convert topological properties into behaviour. */ - sched_domains_curr_level = tl->numa_level; + + if (sd->flags & SD_SHARE_CPUPOWER) { + sd->imbalance_pct = 110; + sd->smt_gain = 1178; /* ~15% */ + sd->flags |= arch_sd_sibling_asym_packing(); + + } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { + sd->imbalance_pct = 117; + sd->cache_nice_tries = 1; + sd->busy_idx = 2; + +#ifdef CONFIG_NUMA + } else if (sd->flags & SD_NUMA) { + sd->cache_nice_tries = 2; + sd->busy_idx = 3; + sd->idle_idx = 2; + + sd->flags |= SD_SERIALIZE; + if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { + sd->flags &= ~(SD_BALANCE_EXEC | + SD_BALANCE_FORK | + SD_WAKE_AFFINE); + } + +#endif + } else { + sd->flags |= SD_PREFER_SIBLING; + sd->cache_nice_tries = 1; + sd->busy_idx = 2; + sd->idle_idx = 1; + } + + sd->private = &tl->data;
return sd; }
+/* + * Topology list, bottom-up. + */ +static struct sched_domain_topology_level default_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) }, +#endif +#ifdef CONFIG_SCHED_MC + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, +#endif +#ifdef CONFIG_SCHED_BOOK + { cpu_book_mask, SD_INIT_NAME(BOOK) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + +struct sched_domain_topology_level *sched_domain_topology = default_topology; + +#define for_each_sd_topology(tl) \ + for (tl = sched_domain_topology; tl->mask; tl++) + +void set_sched_topology(struct sched_domain_topology_level *tl) +{ + sched_domain_topology = tl; +} + +#ifdef CONFIG_NUMA + static const struct cpumask *sd_numa_mask(int cpu) { return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)]; @@ -6215,7 +6220,10 @@ static void sched_init_numa(void) } }
- tl = kzalloc((ARRAY_SIZE(default_topology) + level) * + /* Compute default topology size */ + for (i = 0; sched_domain_topology[i].mask; i++); + + tl = kzalloc((i + level) * sizeof(struct sched_domain_topology_level), GFP_KERNEL); if (!tl) return; @@ -6223,18 +6231,19 @@ static void sched_init_numa(void) /* * Copy the default topology bits.. */ - for (i = 0; default_topology[i].init; i++) - tl[i] = default_topology[i]; + for (i = 0; sched_domain_topology[i].mask; i++) + tl[i] = sched_domain_topology[i];
/* * .. and append 'j' levels of NUMA goodness. */ for (j = 0; j < level; i++, j++) { tl[i] = (struct sched_domain_topology_level){ - .init = sd_numa_init, .mask = sd_numa_mask, + .sd_flags = SD_NUMA, .flags = SDTL_OVERLAP, .numa_level = j, + SD_INIT_NAME(NUMA) }; }
@@ -6392,7 +6401,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, struct sched_domain *child, int cpu) { - struct sched_domain *sd = tl->init(tl, cpu); + struct sched_domain *sd = sd_init(tl, cpu); if (!sd) return child;
On Fri, Apr 11, 2014 at 11:44:37AM +0200, Vincent Guittot wrote:
We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed).
We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which either wants to add new level where a load balance make sense like BOOK or powergating level or wants to change the flags configuration of some levels.
For each level, we need a function pointer that returns cpumask for each cpu, a function pointer that returns the flags for the level and a name. Only flags that describe topology, can be set by an architecture. The current topology flags are: SD_SHARE_CPUPOWER SD_SHARE_PKG_RESOURCES SD_NUMA SD_ASYM_PACKING
Then, each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and no more relevant information for load balancing than its childs.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com
On x86_64 defconfig this gets me:
kernel/sched/core.c: In function 'sched_init_numa': kernel/sched/core.c:6197:4: warning: initialization makes pointer from integer without a cast [enabled by default] kernel/sched/core.c:6197:4: warning: (near initialization for '(anonymous).sd_flags') [enabled by default]
fix missing change from int to pointer function for numa flags
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 7 +++++++ kernel/sched/core.c | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 18464cb..d2b981c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -887,6 +887,13 @@ static inline const int cpu_core_flags(void) } #endif
+#ifdef CONFIG_NUMA +static inline const int cpu_numa_flags(void) +{ + return SD_NUMA; +} +#endif + struct sched_domain_attr { int relax_domain_level; }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a30b0b4..24d0831 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6235,7 +6235,7 @@ static void sched_init_numa(void) for (j = 0; j < level; i++, j++) { tl[i] = (struct sched_domain_topology_level){ .mask = sd_numa_mask, - .sd_flags = SD_NUMA, + .sd_flags = cpu_numa_flags, .flags = SDTL_OVERLAP, .numa_level = j, SD_INIT_NAME(NUMA)
On Fri, Apr 18, 2014 at 01:34:06PM +0200, Vincent Guittot wrote:
fix missing change from int to pointer function for numa flags
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Thanks, folded it into the original patch.
On 18 April 2014 12:56, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Apr 11, 2014 at 11:44:37AM +0200, Vincent Guittot wrote:
We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed).
We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which either wants to add new level where a load balance make sense like BOOK or powergating level or wants to change the flags configuration of some levels.
For each level, we need a function pointer that returns cpumask for each cpu, a function pointer that returns the flags for the level and a name. Only flags that describe topology, can be set by an architecture. The current topology flags are: SD_SHARE_CPUPOWER SD_SHARE_PKG_RESOURCES SD_NUMA SD_ASYM_PACKING
Then, each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and no more relevant information for load balancing than its childs.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com
On x86_64 defconfig this gets me:
kernel/sched/core.c: In function 'sched_init_numa': kernel/sched/core.c:6197:4: warning: initialization makes pointer from integer without a cast [enabled by default] kernel/sched/core.c:6197:4: warning: (near initialization for '(anonymous).sd_flags') [enabled by default]
Sorry, while moving from int to function pointer, i forgot to change the numa part
I'm going to send a fix right now
BOOK level is only relevant for s390 so we create a dedicated topology table with BOOK level and remove it from default table.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/s390/include/asm/topology.h | 11 +---------- arch/s390/kernel/topology.c | 20 ++++++++++++++++++++ kernel/sched/core.c | 3 --- 3 files changed, 21 insertions(+), 13 deletions(-)
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h index 07763bd..56af530 100644 --- a/arch/s390/include/asm/topology.h +++ b/arch/s390/include/asm/topology.h @@ -26,21 +26,12 @@ extern struct cpu_topology_s390 cpu_topology[NR_CPUS];
#define mc_capable() 1
-static inline const struct cpumask *cpu_coregroup_mask(int cpu) -{ - return &cpu_topology[cpu].core_mask; -} - -static inline const struct cpumask *cpu_book_mask(int cpu) -{ - return &cpu_topology[cpu].book_mask; -} - int topology_cpu_init(struct cpu *); int topology_set_cpu_management(int fc); void topology_schedule_update(void); void store_topology(struct sysinfo_15_1_x *info); void topology_expect_change(void); +const struct cpumask *cpu_coregroup_mask(int cpu);
#else /* CONFIG_SCHED_BOOK */
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c index 4b2e3e3..ceddd77 100644 --- a/arch/s390/kernel/topology.c +++ b/arch/s390/kernel/topology.c @@ -443,6 +443,23 @@ int topology_cpu_init(struct cpu *cpu) return sysfs_create_group(&cpu->dev.kobj, &topology_cpu_attr_group); }
+const struct cpumask *cpu_coregroup_mask(int cpu) +{ + return &cpu_topology[cpu].core_mask; +} + +static const struct cpumask *cpu_book_mask(int cpu) +{ + return &cpu_topology[cpu].book_mask; +} + +static struct sched_domain_topology_level s390_topology[] = { + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, + { cpu_book_mask, SD_INIT_NAME(BOOK) }, + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + static int __init topology_init(void) { if (!MACHINE_HAS_TOPOLOGY) { @@ -452,6 +469,9 @@ static int __init topology_init(void) set_topology_timer(); out: update_cpu_masks(); + + set_sched_topology(s390_topology); + return device_create_file(cpu_subsys.dev_root, &dev_attr_dispatching); } device_initcall(topology_init); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0714574..b15b994 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6058,9 +6058,6 @@ static struct sched_domain_topology_level default_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif -#ifdef CONFIG_SCHED_BOOK - { cpu_book_mask, SD_INIT_NAME(BOOK) }, -#endif { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, };
Create a dedicated topology table for handling asymetric feature of powerpc.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com --- arch/powerpc/kernel/smp.c | 31 +++++++++++++++++++++++-------- include/linux/sched.h | 2 -- kernel/sched/core.c | 6 ------ 3 files changed, 23 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ac2621a..c9cade5 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -755,6 +755,28 @@ int setup_profiling_timer(unsigned int multiplier) return 0; }
+#ifdef CONFIG_SCHED_SMT +/* cpumask of CPUs with asymetric SMT dependancy */ +static const int powerpc_smt_flags(void) +{ + int flags = SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES; + + if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { + printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n"); + flags |= SD_ASYM_PACKING; + } + return flags; +} +#endif + +static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + void __init smp_cpus_done(unsigned int max_cpus) { cpumask_var_t old_mask; @@ -779,15 +801,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
dump_numa_cpu_topology();
-} + set_sched_topology(powerpc_topology);
-int arch_sd_sibling_asym_packing(void) -{ - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { - printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n"); - return SD_ASYM_PACKING; - } - return 0; }
#ifdef CONFIG_HOTPLUG_CPU diff --git a/include/linux/sched.h b/include/linux/sched.h index 3161701..7df3ca2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -872,8 +872,6 @@ enum cpu_idle_type { #define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */ #define SD_NUMA 0x4000 /* cross-node balancing */
-extern int __weak arch_sd_sibiling_asym_packing(void); - #ifdef CONFIG_SCHED_SMT static inline const int cpu_smt_flags(void) { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b15b994..3743a6c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5830,11 +5830,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight); }
-int __weak arch_sd_sibling_asym_packing(void) -{ - return 0*SD_ASYM_PACKING; -} - /* * Initializers for schedule domains * Non-inlined to reduce accumulated stack pressure in build_sched_domains() @@ -6015,7 +6010,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) if (sd->flags & SD_SHARE_CPUPOWER) { sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ - sd->flags |= arch_sd_sibling_asym_packing();
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) { sd->imbalance_pct = 117;
A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used in the load balance decision or to add load balancing level between group of CPUs that can power gate independantly. This flag is part of the topology flags that can be set by arch.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 1 + kernel/sched/core.c | 10 +++++++--- 2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 7df3ca2..18464cb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -865,6 +865,7 @@ enum cpu_idle_type { #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ #define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */ +#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3743a6c..a30b0b4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5295,7 +5295,8 @@ static int sd_degenerate(struct sched_domain *sd) SD_BALANCE_FORK | SD_BALANCE_EXEC | SD_SHARE_CPUPOWER | - SD_SHARE_PKG_RESOURCES)) { + SD_SHARE_PKG_RESOURCES | + SD_SHARE_POWERDOMAIN)) { if (sd->groups != sd->groups->next) return 0; } @@ -5326,7 +5327,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) SD_BALANCE_EXEC | SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | - SD_PREFER_SIBLING); + SD_PREFER_SIBLING | + SD_SHARE_POWERDOMAIN); if (nr_node_ids == 1) pflags &= ~SD_SERIALIZE; } @@ -5935,6 +5937,7 @@ static int sched_domains_curr_level; * SD_SHARE_CPUPOWER - describes SMT topologies * SD_SHARE_PKG_RESOURCES - describes shared caches * SD_NUMA - describes NUMA topologies + * SD_SHARE_POWERDOMAIN - describes shared power domain * * Odd one out: * SD_ASYM_PACKING - describes SMT quirks @@ -5943,7 +5946,8 @@ static int sched_domains_curr_level; (SD_SHARE_CPUPOWER | \ SD_SHARE_PKG_RESOURCES | \ SD_NUMA | \ - SD_ASYM_PACKING) + SD_ASYM_PACKING | \ + SD_SHARE_POWERDOMAIN)
static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, int cpu)
On Fri, Apr 11, 2014 at 11:44:40AM +0200, Vincent Guittot wrote:
A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used in the load balance decision or to add load balancing level between group of CPUs that can power gate independantly. This flag is part of the topology flags that can be set by arch.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Building a cris-defconfig this gets me an endless stream of misery:
In file included from /usr/src/linux-2.6/init/do_mounts.c:12:0: /usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: 'struct sched_domain_topology_level' declared inside parameter list [enabled by default] /usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] In file included from /usr/src/linux-2.6/include/linux/utsname.h:5:0, from /usr/src/linux-2.6/include/linux/init_task.h:6, from /usr/src/linux-2.6/init/init_task.c:1:
Might be something about UP builds, didn't check x86-UP.
set_sched_topology doesn't need to be defined for non SMP because there is no sched_domain
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- include/linux/sched.h | 2 -- 1 file changed, 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index d2b981c..674b1fa 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1047,8 +1047,6 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu) return true; }
-static inline void set_sched_topology(struct sched_domain_topology_level *tl) { } - #endif /* !CONFIG_SMP */
On 18 April 2014 12:58, Peter Zijlstra peterz@infradead.org wrote:
On Fri, Apr 11, 2014 at 11:44:40AM +0200, Vincent Guittot wrote:
A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used in the load balance decision or to add load balancing level between group of CPUs that can power gate independantly. This flag is part of the topology flags that can be set by arch.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
Building a cris-defconfig this gets me an endless stream of misery:
In file included from /usr/src/linux-2.6/init/do_mounts.c:12:0: /usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: 'struct sched_domain_topology_level' declared inside parameter list [enabled by default] /usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] In file included from /usr/src/linux-2.6/include/linux/utsname.h:5:0, from /usr/src/linux-2.6/include/linux/init_task.h:6, from /usr/src/linux-2.6/init/init_task.c:1:
Might be something about UP builds, didn't check x86-UP.
yes, you're right An empty set_sched_topology function has been define for UP whereas it can't because of the sched_domain_topology_level parameter which is available only in SMP
I'm going to send a fix.
sorry for the dumb error
Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others.
The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org --- arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..71e1fec 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return &cpu_topology[cpu].core_sibling; }
+/* + * The current assumption is that we can power gate each core independently. + * This will be superseded by DT binding once available. + */ +const struct cpumask *cpu_corepower_mask(int cpu) +{ + return &cpu_topology[cpu].thread_sibling; +} + static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }
+static inline const int cpu_corepower_flags(void) +{ + return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; +} + +static struct sched_domain_topology_level arm_topology[] = { +#ifdef CONFIG_SCHED_MC + { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, + { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + /* * init_cpu_topology is called at boot when only one cpu is running * which prevent simultaneous write access to cpu_topology array @@ -289,4 +312,7 @@ void __init init_cpu_topology(void) smp_wmb();
parse_dt_topology(); + + /* Set scheduler topology descriptor */ + set_sched_topology(arm_topology); }
Hi,
I'm trying to use this approach of specifying different per-cpu views on sd flags on DIE level on a TC2 platform (cluster 0 w/ CPU0/1 and cluster 1 w/ CPU2/3/4 w/o SMT). It doesn't work like in the case for the GMC/MC sd level.
If I use the following patch (just to illustrate the issue) on top (returning SD_SHARE_POWERDOMAIN in cpu_cpupower_flags() is bogus ... I just needed a flag function for GDIE level):
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec6d31a..803330d89c09 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -194,6 +194,14 @@ const struct cpumask *cpu_corepower_mask(int cpu) return &cpu_topology[cpu].thread_sibling; }
+const struct cpumask *cpu_cpupower_mask(int cpu) +{ + return cpu_topology[cpu].socket_id ? + cpumask_of_node(cpu_to_node(cpu)) : + &cpu_topology[cpu].core_sibling; +} + + static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -280,11 +288,18 @@ static inline const int cpu_corepower_flags(void) return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; }
+static inline const int cpu_cpupower_flags(void) +{ + return SD_SHARE_POWERDOMAIN; +} + + static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif + { cpu_cpupower_mask, cpu_cpupower_flags, SD_INIT_NAME(GDIE) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, };
so I get the following topology:
CPU0: cpu_corepower_mask=0 (GMC) CPU0: cpu_coregroup_mask=0-1 (MC) CPU0: cpu_cpupower_mask=0-1 (GDIE) CPU0: cpu_cpu_mask=0-4 (DIE) CPU1: cpu_corepower_mask=1 ... CPU1: cpu_coregroup_mask=0-1 CPU1: cpu_cpupower_mask=0-1 CPU1: cpu_cpu_mask=0-4 CPU2: cpu_corepower_mask=2 CPU2: cpu_coregroup_mask=2-4 CPU2: cpu_cpupower_mask=0-4 CPU2: cpu_cpu_mask=0-4 CPU3: cpu_corepower_mask=3 CPU3: cpu_coregroup_mask=2-4 CPU3: cpu_cpupower_mask=0-4 CPU3: cpu_cpu_mask=0-4 CPU4: cpu_corepower_mask=4 CPU4: cpu_coregroup_mask=2-4 CPU4: cpu_cpupower_mask=0-4 CPU4: cpu_cpu_mask=0-4
Firstly, I had to get rid of the cpumask_equal(cpu_map, sched_domain_span(sd)) condition in build_sched_domains() to allow that I can have two sd levels which span CPU 0-4 (for CPU2/3/4).
But it still doesn't work correctly:
dmesg snippet 2:
CPU0 attaching sched-domain: domain 0: span 0-1 level MC groups: 0 1 domain 1: span 0-4 level DIE <-- error (there's only one group) groups: 0-4 (cpu_power = 2048) ... CPU2 attaching sched-domain: domain 0: span 2-4 level MC groups: 2 3 4 domain 1: span 0-4 level GDIE ERROR: domain->groups does not contain CPU2 groups: ERROR: domain->cpu_power not set
ERROR: groups don't span domain->span ...
It turns out that the function get_group() which is used a couple of times in build_sched_groups() uses a reference to sd->child and even though the degenerate functionality gets rid of GDIE for CPU0/1 and DIE for CPU2/3/4 the group set-up doesn't work as expected since sd->child for DIE is GDIE and not MC any more. So it looks like GMC/MC level is somehow special (GMC has no sd->child for TC2 or GMC/MC contains only one cpu per group?).
Although this problem does not effect the current patch-set, people might think that they can apply this degenerate trick for other sd levels as well.
I'm trying to fix get_group()/build_sched_groups() in such a way that my example would work but so far I haven't succeeded. The question for me remains ... is this application of the degenerate trick feasible at all in all sd levels, i.e. does it scale? What about platforms w/ SMT sd level which want to use this trick in GMC/MC level?
Any hints are highly appreciated here.
-- Dietmar
On 11/04/14 10:44, Vincent Guittot wrote:
Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others.
The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..71e1fec 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return &cpu_topology[cpu].core_sibling; } +/*
- The current assumption is that we can power gate each core independently.
- This will be superseded by DT binding once available.
- */
+const struct cpumask *cpu_corepower_mask(int cpu) +{
- return &cpu_topology[cpu].thread_sibling;
+}
static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); } +static inline const int cpu_corepower_flags(void) +{
- return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+}
+static struct sched_domain_topology_level arm_topology[] = { +#ifdef CONFIG_SCHED_MC
- { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
- { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
- { cpu_cpu_mask, SD_INIT_NAME(DIE) },
- { NULL, },
+};
/*
- init_cpu_topology is called at boot when only one cpu is running
- which prevent simultaneous write access to cpu_topology array
@@ -289,4 +312,7 @@ void __init init_cpu_topology(void) smp_wmb(); parse_dt_topology();
- /* Set scheduler topology descriptor */
- set_sched_topology(arm_topology);
}
On 23 April 2014 13:46, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
Hi,
I'm trying to use this approach of specifying different per-cpu views on sd flags on DIE level on a TC2 platform (cluster 0 w/ CPU0/1 and cluster 1 w/ CPU2/3/4 w/o SMT). It doesn't work like in the case for the GMC/MC sd level.
If I use the following patch (just to illustrate the issue) on top (returning SD_SHARE_POWERDOMAIN in cpu_cpupower_flags() is bogus ... I just needed a flag function for GDIE level):
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec6d31a..803330d89c09 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -194,6 +194,14 @@ const struct cpumask *cpu_corepower_mask(int cpu) return &cpu_topology[cpu].thread_sibling; }
+const struct cpumask *cpu_cpupower_mask(int cpu) +{
return cpu_topology[cpu].socket_id ?
cpumask_of_node(cpu_to_node(cpu)) :
&cpu_topology[cpu].core_sibling;
+}
static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -280,11 +288,18 @@ static inline const int cpu_corepower_flags(void) return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; }
+static inline const int cpu_cpupower_flags(void) +{
return SD_SHARE_POWERDOMAIN;
+}
static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif
{ cpu_cpupower_mask, cpu_cpupower_flags, SD_INIT_NAME(GDIE) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, },
};
so I get the following topology:
CPU0: cpu_corepower_mask=0 (GMC) CPU0: cpu_coregroup_mask=0-1 (MC) CPU0: cpu_cpupower_mask=0-1 (GDIE) CPU0: cpu_cpu_mask=0-4 (DIE) CPU1: cpu_corepower_mask=1 ... CPU1: cpu_coregroup_mask=0-1 CPU1: cpu_cpupower_mask=0-1 CPU1: cpu_cpu_mask=0-4 CPU2: cpu_corepower_mask=2 CPU2: cpu_coregroup_mask=2-4 CPU2: cpu_cpupower_mask=0-4 CPU2: cpu_cpu_mask=0-4 CPU3: cpu_corepower_mask=3 CPU3: cpu_coregroup_mask=2-4 CPU3: cpu_cpupower_mask=0-4 CPU3: cpu_cpu_mask=0-4 CPU4: cpu_corepower_mask=4 CPU4: cpu_coregroup_mask=2-4 CPU4: cpu_cpupower_mask=0-4 CPU4: cpu_cpu_mask=0-4
You have an inconsistency in your topology description:
At GDIE level: CPU1 cpu_cpupower_mask=0-1 but CPU2 cpu_cpupower_mask=0-4
so CPU2 says that it shares power domain with CPU0 but CPU1 says the opposite
Regards Vincent
Firstly, I had to get rid of the cpumask_equal(cpu_map, sched_domain_span(sd)) condition in build_sched_domains() to allow that I can have two sd levels which span CPU 0-4 (for CPU2/3/4).
But it still doesn't work correctly:
dmesg snippet 2:
CPU0 attaching sched-domain: domain 0: span 0-1 level MC groups: 0 1 domain 1: span 0-4 level DIE <-- error (there's only one group) groups: 0-4 (cpu_power = 2048) ... CPU2 attaching sched-domain: domain 0: span 2-4 level MC groups: 2 3 4 domain 1: span 0-4 level GDIE ERROR: domain->groups does not contain CPU2 groups: ERROR: domain->cpu_power not set
ERROR: groups don't span domain->span ...
It turns out that the function get_group() which is used a couple of times in build_sched_groups() uses a reference to sd->child and even though the degenerate functionality gets rid of GDIE for CPU0/1 and DIE for CPU2/3/4 the group set-up doesn't work as expected since sd->child for DIE is GDIE and not MC any more. So it looks like GMC/MC level is somehow special (GMC has no sd->child for TC2 or GMC/MC contains only one cpu per group?).
Although this problem does not effect the current patch-set, people might think that they can apply this degenerate trick for other sd levels as well.
I'm trying to fix get_group()/build_sched_groups() in such a way that my example would work but so far I haven't succeeded. The question for me remains ... is this application of the degenerate trick feasible at all in all sd levels, i.e. does it scale? What about platforms w/ SMT sd level which want to use this trick in GMC/MC level?
Any hints are highly appreciated here.
-- Dietmar
On 11/04/14 10:44, Vincent Guittot wrote:
Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others.
The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..71e1fec 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return &cpu_topology[cpu].core_sibling; }
+/*
- The current assumption is that we can power gate each core independently.
- This will be superseded by DT binding once available.
- */
+const struct cpumask *cpu_corepower_mask(int cpu) +{
return &cpu_topology[cpu].thread_sibling;
+}
static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }
+static inline const int cpu_corepower_flags(void) +{
return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+}
+static struct sched_domain_topology_level arm_topology[] = { +#ifdef CONFIG_SCHED_MC
{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
+};
/*
- init_cpu_topology is called at boot when only one cpu is running
- which prevent simultaneous write access to cpu_topology array
@@ -289,4 +312,7 @@ void __init init_cpu_topology(void) smp_wmb();
parse_dt_topology();
/* Set scheduler topology descriptor */
set_sched_topology(arm_topology);
}
On 23/04/14 15:46, Vincent Guittot wrote:
On 23 April 2014 13:46, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
Hi,
I'm trying to use this approach of specifying different per-cpu views on sd flags on DIE level on a TC2 platform (cluster 0 w/ CPU0/1 and cluster 1 w/ CPU2/3/4 w/o SMT). It doesn't work like in the case for the GMC/MC sd level.
If I use the following patch (just to illustrate the issue) on top (returning SD_SHARE_POWERDOMAIN in cpu_cpupower_flags() is bogus ... I just needed a flag function for GDIE level):
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 71e1fec6d31a..803330d89c09 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -194,6 +194,14 @@ const struct cpumask *cpu_corepower_mask(int cpu) return &cpu_topology[cpu].thread_sibling; }
+const struct cpumask *cpu_cpupower_mask(int cpu) +{
return cpu_topology[cpu].socket_id ?
cpumask_of_node(cpu_to_node(cpu)) :
&cpu_topology[cpu].core_sibling;
+}
static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -280,11 +288,18 @@ static inline const int cpu_corepower_flags(void) return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN; }
+static inline const int cpu_cpupower_flags(void) +{
return SD_SHARE_POWERDOMAIN;
+}
static struct sched_domain_topology_level arm_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) }, { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) }, #endif
{ cpu_cpupower_mask, cpu_cpupower_flags, SD_INIT_NAME(GDIE) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, },
};
so I get the following topology:
CPU0: cpu_corepower_mask=0 (GMC) CPU0: cpu_coregroup_mask=0-1 (MC) CPU0: cpu_cpupower_mask=0-1 (GDIE) CPU0: cpu_cpu_mask=0-4 (DIE) CPU1: cpu_corepower_mask=1 ... CPU1: cpu_coregroup_mask=0-1 CPU1: cpu_cpupower_mask=0-1 CPU1: cpu_cpu_mask=0-4 CPU2: cpu_corepower_mask=2 CPU2: cpu_coregroup_mask=2-4 CPU2: cpu_cpupower_mask=0-4 CPU2: cpu_cpu_mask=0-4 CPU3: cpu_corepower_mask=3 CPU3: cpu_coregroup_mask=2-4 CPU3: cpu_cpupower_mask=0-4 CPU3: cpu_cpu_mask=0-4 CPU4: cpu_corepower_mask=4 CPU4: cpu_coregroup_mask=2-4 CPU4: cpu_cpupower_mask=0-4 CPU4: cpu_cpu_mask=0-4
You have an inconsistency in your topology description:
That's true functional-wise but I think that this is not the reason why the code in get_group()/build_sched_groups() isn't working correctly any more with this set-up.
Like I said above the cpu_cpupower_flags() is just bogus and I only added it to illustrate the issue (Shouldn't have used SD_SHARE_POWERDOMAIN in the first place). Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.
I thought so far that I can achieve that by getting rid of GDIE sd level for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way that it returns the same cpu mask as its child sd level (MC) and of DIE sd level for CPU2/3/4 because it returns the same cpu mask as its child sd level (GDIE) related cpu mask function. This will let sd degenerate do it's job of folding sd levels which it does. The only problem I have is that the groups are not created correctly any more.
I don't see right now how the flag SD_SHARE_FOO affects the code in get_group()/build_sched_groups().
Think of SD_SHARE_FOO of something I would like to have for all sd's of CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd level where each CPU sees two groups (group0 containing CPU0/1 and group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .
-- Dietmar
At GDIE level: CPU1 cpu_cpupower_mask=0-1 but CPU2 cpu_cpupower_mask=0-4
so CPU2 says that it shares power domain with CPU0 but CPU1 says the opposite
Regards Vincent
Firstly, I had to get rid of the cpumask_equal(cpu_map, sched_domain_span(sd)) condition in build_sched_domains() to allow that I can have two sd levels which span CPU 0-4 (for CPU2/3/4).
But it still doesn't work correctly:
dmesg snippet 2:
CPU0 attaching sched-domain: domain 0: span 0-1 level MC groups: 0 1 domain 1: span 0-4 level DIE <-- error (there's only one group) groups: 0-4 (cpu_power = 2048) ... CPU2 attaching sched-domain: domain 0: span 2-4 level MC groups: 2 3 4 domain 1: span 0-4 level GDIE ERROR: domain->groups does not contain CPU2 groups: ERROR: domain->cpu_power not set
ERROR: groups don't span domain->span ...
It turns out that the function get_group() which is used a couple of times in build_sched_groups() uses a reference to sd->child and even though the degenerate functionality gets rid of GDIE for CPU0/1 and DIE for CPU2/3/4 the group set-up doesn't work as expected since sd->child for DIE is GDIE and not MC any more. So it looks like GMC/MC level is somehow special (GMC has no sd->child for TC2 or GMC/MC contains only one cpu per group?).
Although this problem does not effect the current patch-set, people might think that they can apply this degenerate trick for other sd levels as well.
I'm trying to fix get_group()/build_sched_groups() in such a way that my example would work but so far I haven't succeeded. The question for me remains ... is this application of the degenerate trick feasible at all in all sd levels, i.e. does it scale? What about platforms w/ SMT sd level which want to use this trick in GMC/MC level?
Any hints are highly appreciated here.
-- Dietmar
On 11/04/14 10:44, Vincent Guittot wrote:
Create a dedicated topology table for ARM which will create new level to differentiate CPUs that can or not powergate independantly from others.
The patch gives an example of how to add domain that will take advantage of SD_SHARE_POWERDOMAIN.
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 0bc94b1..71e1fec 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return &cpu_topology[cpu].core_sibling; }
+/*
- The current assumption is that we can power gate each core independently.
- This will be superseded by DT binding once available.
- */
+const struct cpumask *cpu_corepower_mask(int cpu) +{
return &cpu_topology[cpu].thread_sibling;
+}
static void update_siblings_masks(unsigned int cpuid) { struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }
+static inline const int cpu_corepower_flags(void) +{
return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+}
+static struct sched_domain_topology_level arm_topology[] = { +#ifdef CONFIG_SCHED_MC
{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
+};
/*
- init_cpu_topology is called at boot when only one cpu is running
- which prevent simultaneous write access to cpu_topology array
@@ -289,4 +312,7 @@ void __init init_cpu_topology(void) smp_wmb();
parse_dt_topology();
/* Set scheduler topology descriptor */
set_sched_topology(arm_topology);
}
On 23 April 2014 17:26, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/04/14 15:46, Vincent Guittot wrote:
On 23 April 2014 13:46, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
Hi,
[snip]
You have an inconsistency in your topology description:
That's true functional-wise but I think that this is not the reason why the code in get_group()/build_sched_groups() isn't working correctly any more with this set-up.
Like I said above the cpu_cpupower_flags() is just bogus and I only added it to illustrate the issue (Shouldn't have used SD_SHARE_POWERDOMAIN in the first place).
More than the flag that is used for the example, it's about the cpumask which are inconsistent across CPUs for the same level and the build_sched_domain sequence rely on this consistency to build sched_group
Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.
I thought so far that I can achieve that by getting rid of GDIE sd level for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way that it returns the same cpu mask as its child sd level (MC) and of DIE sd level for CPU2/3/4 because it returns the same cpu mask as its child sd level (GDIE) related cpu mask function. This will let sd degenerate do it's job of folding sd levels which it does. The only problem I have is that the groups are not created correctly any more.
I don't see right now how the flag SD_SHARE_FOO affects the code in get_group()/build_sched_groups().
Think of SD_SHARE_FOO of something I would like to have for all sd's of CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd level where each CPU sees two groups (group0 containing CPU0/1 and group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .
I'm not sure that's it's feasible because it's not possible from a topology pov to have different flags if the span include all cpus. Could you give us more details about what you want to achieve with this flag ?
Vincent
-- Dietmar
At GDIE level: CPU1 cpu_cpupower_mask=0-1 but CPU2 cpu_cpupower_mask=0-4
so CPU2 says that it shares power domain with CPU0 but CPU1 says the opposite
Regards Vincent
Firstly, I had to get rid of the cpumask_equal(cpu_map, sched_domain_span(sd)) condition in build_sched_domains() to allow that I can have two sd levels which span CPU 0-4 (for CPU2/3/4).
But it still doesn't work correctly:
dmesg snippet 2:
CPU0 attaching sched-domain: domain 0: span 0-1 level MC groups: 0 1 domain 1: span 0-4 level DIE <-- error (there's only one group) groups: 0-4 (cpu_power = 2048) ... CPU2 attaching sched-domain: domain 0: span 2-4 level MC groups: 2 3 4 domain 1: span 0-4 level GDIE ERROR: domain->groups does not contain CPU2 groups: ERROR: domain->cpu_power not set
ERROR: groups don't span domain->span ...
[snip]
On 24/04/14 08:30, Vincent Guittot wrote:
On 23 April 2014 17:26, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/04/14 15:46, Vincent Guittot wrote:
On 23 April 2014 13:46, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
Hi,
[...]
More than the flag that is used for the example, it's about the cpumask which are inconsistent across CPUs for the same level and the build_sched_domain sequence rely on this consistency to build sched_group
Now I'm lost here. I thought so far that by specifying different cpu masks per CPU in an sd level, we get the sd level folding functionality in sd degenerate?
We discussed this here for an example on TC2 for the GMC level: https://lkml.org/lkml/2014/3/21/126
Back than I had CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=2 so for GMC level the cpumasks are inconsistent across CPUs and it worked.
The header of '[PATCH v4 1/5] sched: rework of sched_domain topology definition' mentions only the requirement "Then, each level must be a subset on the next one" and this one I haven't broken w/ my GMC/MC/GDIE/DIE set-up.
Do I miss something else here?
Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.
I thought so far that I can achieve that by getting rid of GDIE sd level for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way that it returns the same cpu mask as its child sd level (MC) and of DIE sd level for CPU2/3/4 because it returns the same cpu mask as its child sd level (GDIE) related cpu mask function. This will let sd degenerate do it's job of folding sd levels which it does. The only problem I have is that the groups are not created correctly any more.
I don't see right now how the flag SD_SHARE_FOO affects the code in get_group()/build_sched_groups().
Think of SD_SHARE_FOO of something I would like to have for all sd's of CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd level where each CPU sees two groups (group0 containing CPU0/1 and group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .
I'm not sure that's it's feasible because it's not possible from a topology pov to have different flags if the span include all cpus. Could you give us more details about what you want to achieve with this flag ?
IMHO, the flag is not important for this discussion. OTHA, information like you can't use sd degenerate functionality to fold adjacent sd levels (GFOO/FOO) on sd level which span all CPUs would be. I want to make sure we understand what are the limitations to use folding of adjacent sd levels based on per-cpu differences in the return value of cpu_mask functions.
-- Dietmar
[...]
On 24 April 2014 14:48, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 24/04/14 08:30, Vincent Guittot wrote:
On 23 April 2014 17:26, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 23/04/14 15:46, Vincent Guittot wrote:
On 23 April 2014 13:46, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
Hi,
[...]
More than the flag that is used for the example, it's about the cpumask which are inconsistent across CPUs for the same level and the build_sched_domain sequence rely on this consistency to build sched_group
Now I'm lost here. I thought so far that by specifying different cpu masks per CPU in an sd level, we get the sd level folding functionality in sd degenerate?
We discussed this here for an example on TC2 for the GMC level: https://lkml.org/lkml/2014/3/21/126
Back than I had CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=2 so for GMC level the cpumasks are inconsistent across CPUs and it worked.
The example above is consistent because CPU2 mask and CPU0 mask are fully exclusive
so CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=2 are consistent
CPU0: cpu_corepower_mask=0-2 CPU2: cpu_corepower_mask=0-2 are also consistent
but
CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=0-2 are not consistent
and your example uses the last configuration
To be more precise, the rule above applies on default SDT definition but the flag SD_OVERLAP enables such kind of overlap between group. Have you tried it ?
Vincent
The header of '[PATCH v4 1/5] sched: rework of sched_domain topology definition' mentions only the requirement "Then, each level must be a subset on the next one" and this one I haven't broken w/ my GMC/MC/GDIE/DIE set-up.
Do I miss something else here?
Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.
I thought so far that I can achieve that by getting rid of GDIE sd level for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way that it returns the same cpu mask as its child sd level (MC) and of DIE sd level for CPU2/3/4 because it returns the same cpu mask as its child sd level (GDIE) related cpu mask function. This will let sd degenerate do it's job of folding sd levels which it does. The only problem I have is that the groups are not created correctly any more.
I don't see right now how the flag SD_SHARE_FOO affects the code in get_group()/build_sched_groups().
Think of SD_SHARE_FOO of something I would like to have for all sd's of CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd level where each CPU sees two groups (group0 containing CPU0/1 and group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .
I'm not sure that's it's feasible because it's not possible from a topology pov to have different flags if the span include all cpus. Could you give us more details about what you want to achieve with this flag ?
IMHO, the flag is not important for this discussion. OTHA, information like you can't use sd degenerate functionality to fold adjacent sd levels (GFOO/FOO) on sd level which span all CPUs would be. I want to make sure we understand what are the limitations to use folding of adjacent sd levels based on per-cpu differences in the return value of cpu_mask functions.
-- Dietmar
[...]
On 25/04/14 08:45, Vincent Guittot wrote:
[...]
Back than I had CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=2 so for GMC level the cpumasks are inconsistent across CPUs and it worked.
The example above is consistent because CPU2 mask and CPU0 mask are fully exclusive
OK, got it now. The cpu mask functions on an sd level can return different (but then exclusive) cpu masks or they all return the same cpu mask (DIE level in example). Like you said we still have to respect the topology of the system.
This essentially excludes the DIE level (i.e. the sd level which spawns all CPUs) from playing this 'sd level folding via sd degenerate' game for a system which specifies FORCE_SD_OVERLAP to false or don't use this SDTL_OVERLAP tl flag.
so CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=2 are consistent
CPU0: cpu_corepower_mask=0-2 CPU2: cpu_corepower_mask=0-2 are also consistent
but
CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=0-2 are not consistent
and your example uses the last configuration
To be more precise, the rule above applies on default SDT definition but the flag SD_OVERLAP enables such kind of overlap between group. Have you tried it ?
Setting FORCE_SD_OVERLAP indeed changes the scenario a bit (we're now using build_overlap_sched_groups() instead of build_sched_groups()). It looks better, but the groups for CPU0/1 in DIE level are wrong (to get so far I still have to comment out the check that 'if cpu_map is equal to sd span of sd then break' in build_sched_domains() though).
dmesg snippet:
CPU0 attaching sched-domain: domain 0: span 0-1 level MC groups: 0 1 domain 1: span 0-4 level DIE groups: 0-4 (cpu_power = 5120) 0-1 (cpu_power = 2048) <-- error !!! CPU1 attaching sched-domain: domain 0: span 0-1 level MC groups: 1 0 domain 1: span 0-4 level DIE groups: 0-1 (cpu_power = 2048) 0-4 (cpu_power = 5120) <-- error !!! CPU2 attaching sched-domain: domain 0: span 2-4 level GMC groups: 2 3 4 domain 1: span 0-4 level GDIE groups: 2-4 (cpu_power = 3072) 0-1 (cpu_power = 2048) ...
The feature I'm currently working on is to add sd energy information to sd levels of the sd topology level table. I essentially added another column of sd energy related func ptr's (next to the flags related one) and wanted to use 'sd level folding via sd degenerate' in MC and DIE level to have different sd energy information per CPU.
In any case, this dependency to FORCE_SD_OVERLAP would be less then nice for this feature though :-( A way out would be a 'int cpu' parameter but we already discussed this back then for the flag function.
Thanks,
-- Dietmar
Vincent
[...]
The example above is consistent because CPU2 mask and CPU0 mask are fully exclusive
so CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=2 are consistent
CPU0: cpu_corepower_mask=0-2 CPU2: cpu_corepower_mask=0-2 are also consistent
but
CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=0-2 are not consistent
and your example uses the last configuration
To be more precise, the rule above applies on default SDT definition but the flag SD_OVERLAP enables such kind of overlap between group. Have you tried it ?
I've never tried degenerate stuff with SD_OVERLAP, it might horribly explode -- its not actually meant to work.
The SD_OVERLAP comes from not fully connected NUMA topologies; suppose something like:
0------1 | | | | 2------3
or:
( 10 20 20 0 ) ( 20 10 0 20 ) ( 20 0 10 20 ) ( 0 20 20 10 )
Your domain level that models the single-hop/20 distance has overlapping masks:
N0: 0-2 N1: 0,1,3 N2: 0,2,3 N3: 1-3
I've never tried to construct a NUMA topology that would be overlapping and have redundant bits in.
On Fri, Apr 25, 2014 at 06:04:19PM +0200, Peter Zijlstra wrote:
The example above is consistent because CPU2 mask and CPU0 mask are fully exclusive
so CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=2 are consistent
CPU0: cpu_corepower_mask=0-2 CPU2: cpu_corepower_mask=0-2 are also consistent
but
CPU0: cpu_corepower_mask=0-1 CPU2: cpu_corepower_mask=0-2 are not consistent
and your example uses the last configuration
To be more precise, the rule above applies on default SDT definition but the flag SD_OVERLAP enables such kind of overlap between group. Have you tried it ?
I've never tried degenerate stuff with SD_OVERLAP, it might horribly explode -- its not actually meant to work.
The SD_OVERLAP comes from not fully connected NUMA topologies; suppose something like:
0------1 | | | | 2------3
or:
( 10 20 20 0 ) ( 20 10 0 20 ) ( 20 0 10 20 ) ( 0 20 20 10 )
d'0h: s/0/30/
0 <-> 3 is 2 hops, too focused on the single hop case
Your domain level that models the single-hop/20 distance has overlapping masks:
N0: 0-2 N1: 0,1,3 N2: 0,2,3 N3: 1-3
I've never tried to construct a NUMA topology that would be overlapping and have redundant bits in.
On 11/04/14 10:44, Vincent Guittot wrote:
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm
Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449
Hi Vincent,
given the discussion we had for v1-v3 and a short boot test of v4:
For patch 1/5, 4/5, 5/5 on ARM TC2 (heterogeneous dual socket w/o SMT machine):
Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com
Cheers,
-- Dietmar
Change since v3:
- remove last patch which was adding SD_SHARE_POWER_DOMAIN for powerpc's SMT level
Change since v2:
- remove patch 1 as it has been already applied to metag tree for v3.15
- updates some commit messages
- add new flag description in TOPOLOGY_SD_FLAGS description
Change since v1:
- move sched_domains_curr_level back under #ifdef CONFIG_NUMA
- use function pointer to set flag instead of a plain value.
- add list of tunable flags in the commit message of patch 2
- add SD_SHARE_POWER_DOMAIN flag for powerpc's SMT level
Vincent Guittot (5): sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table
arch/arm/kernel/topology.c | 26 +++++ arch/ia64/include/asm/topology.h | 24 ---- arch/powerpc/kernel/smp.c | 31 +++-- arch/s390/include/asm/topology.h | 13 +-- arch/s390/kernel/topology.c | 20 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 49 +++++++- include/linux/topology.h | 128 +++----------------- kernel/sched/core.c | 244 ++++++++++++++++++++------------------- 9 files changed, 255 insertions(+), 313 deletions(-)
On 12 April 2014 14:56, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 11/04/14 10:44, Vincent Guittot wrote:
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm
Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449
Hi Vincent,
given the discussion we had for v1-v3 and a short boot test of v4:
For patch 1/5, 4/5, 5/5 on ARM TC2 (heterogeneous dual socket w/o SMT machine):
Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com
Thanks
Cheers,
-- Dietmar
Change since v3:
- remove last patch which was adding SD_SHARE_POWER_DOMAIN for powerpc's
SMT level
Change since v2:
- remove patch 1 as it has been already applied to metag tree for v3.15
- updates some commit messages
- add new flag description in TOPOLOGY_SD_FLAGS description
Change since v1:
- move sched_domains_curr_level back under #ifdef CONFIG_NUMA
- use function pointer to set flag instead of a plain value.
- add list of tunable flags in the commit message of patch 2
- add SD_SHARE_POWER_DOMAIN flag for powerpc's SMT level
Vincent Guittot (5): sched: rework of sched_domain topology definition sched: s390: create a dedicated topology table sched: powerpc: create a dedicated topology table sched: add a new SD_SHARE_POWERDOMAIN for sched_domain sched: ARM: create a dedicated scheduler topology table
arch/arm/kernel/topology.c | 26 +++++ arch/ia64/include/asm/topology.h | 24 ---- arch/powerpc/kernel/smp.c | 31 +++-- arch/s390/include/asm/topology.h | 13 +-- arch/s390/kernel/topology.c | 20 ++++ arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 49 +++++++- include/linux/topology.h | 128 +++----------------- kernel/sched/core.c | 244 ++++++++++++++++++++------------------- 9 files changed, 255 insertions(+), 313 deletions(-)
On Sat, Apr 12, 2014 at 01:56:16PM +0100, Dietmar Eggemann wrote:
On 11/04/14 10:44, Vincent Guittot wrote:
This patchset was previously part of the larger tasks packing patchset [1]. I have splitted the latter in 3 different patchsets (at least) to make the thing easier. -configuration of sched_domain topology (this patchset) -update and consolidation of cpu_power -tasks packing algorithm
Based on Peter Z's proposal [2][3], this patchset modifies the way to configure the sched_domain level in order to let architectures to add specific level like the current BOOK level or the proposed power gating level for ARM architecture.
[1] https://lkml.org/lkml/2013/10/18/121 [2] https://lkml.org/lkml/2013/11/5/239 [3] https://lkml.org/lkml/2013/11/5/449
Hi Vincent,
given the discussion we had for v1-v3 and a short boot test of v4:
For patch 1/5, 4/5, 5/5 on ARM TC2 (heterogeneous dual socket w/o SMT machine):
Reviewed-by: Dietmar Eggemann dietmar.eggemann@arm.com Tested-by: Dietmar Eggemann dietmar.eggemann@arm.com
Thanks guys!
linaro-kernel@lists.linaro.org