On 6 March 2014 01:09, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
On 05/03/14 07:18, Vincent Guittot wrote:
We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed).
We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which wants to add new level where a load balance make sense like BOOK or powergating level.
For each level, we need a function pointer that returns cpumask for each cpu, the flags configuration and a name. Each level must be a subset on
Maybe it's worth mentioning here that those flags are from the set of the sd topology flags to distinguish them from the set of sd behavioural flags. Latter can't be set via this interface.
yes, i will add the list of flags that can be set with the table
the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and relevant information for load balancing than its child .
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/ia64/include/asm/topology.h | 24 ---- arch/s390/include/asm/topology.h | 2 - arch/tile/include/asm/topology.h | 33 ------ include/linux/sched.h | 29 +++++ include/linux/topology.h | 128 +++------------------ kernel/sched/core.c | 227 +++++++++++++++++++------------------- 6 files changed, 156 insertions(+), 287 deletions(-)
[snip]
-#define for_each_sd_topology(tl) \
for (tl = sched_domain_topology; tl->init; tl++)
Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?
it should not as well as its use in sd_init
+static int sched_domains_curr_level;
#ifdef CONFIG_NUMA
- static int sched_domains_numa_levels; static int *sched_domains_numa_distance; static struct cpumask ***sched_domains_numa_masks;
-static int sched_domains_curr_level;
-static inline int sd_local_flags(int level) -{
if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
return 0;
+#endif
return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
-} +/*
- SD_flags allowed in topology descriptions.
- SD_SHARE_CPUPOWER - describes SMT topologies
- SD_SHARE_PKG_RESOURCES - describes shared caches
- SD_NUMA - describes NUMA topologies
- Odd one out:
- SD_ASYM_PACKING - describes SMT quirks
- */
+#define TOPOLOGY_SD_FLAGS \
(SD_SHARE_CPUPOWER | \
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING)
static struct sched_domain *
-sd_numa_init(struct sched_domain_topology_level *tl, int cpu) +sd_init(struct sched_domain_topology_level *tl, int cpu) { struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
int level = tl->numa_level;
int sd_weight = cpumask_weight(
sched_domains_numa_masks[level][cpu_to_node(cpu)]);
int sd_weight;
Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef CONFIG_NUMA later in sd_init() though.
/*
* Ugly hack to pass state to sd_numa_mask()...
*/
sched_domains_curr_level = tl->numa_level;
sd_weight = cpumask_weight(tl->mask(cpu));
if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
"wrong sd_flags in topology description\n"))
tl->sd_flags &= ~TOPOLOGY_SD_FLAGS; *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, .busy_factor = 32, .imbalance_pct = 125,
.cache_nice_tries = 2,
.busy_idx = 3,
.idle_idx = 2,
.cache_nice_tries = 0,
.busy_idx = 0,
.idle_idx = 0, .newidle_idx = 0, .wake_idx = 0, .forkexec_idx = 0,
Why we want to explicitly set those indexes to 0 here? IMHO, the memory for *sd is zeroed out before. This is true for all data members which are set to 0 later in this function including the | 0*SD_FOO . IMHO, would make the code more readable.
I would say that it makes the configuration more readable and modifiable because you have the list of possible flag to set
.flags = 1*SD_LOAD_BALANCE | 1*SD_BALANCE_NEWIDLE
| 0*SD_BALANCE_EXEC
| 0*SD_BALANCE_FORK
| 1*SD_BALANCE_EXEC
| 1*SD_BALANCE_FORK | 0*SD_BALANCE_WAKE
| 0*SD_WAKE_AFFINE
| 1*SD_WAKE_AFFINE | 0*SD_SHARE_CPUPOWER | 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_SERIALIZE | 0*SD_PREFER_SIBLING
| 1*SD_NUMA
| sd_local_flags(level)
| 0*SD_NUMA
| tl->sd_flags ,
.last_balance = jiffies, .balance_interval = sd_weight,
.smt_gain = 0,
.max_newidle_lb_cost = 0,
.next_decay_max_lb_cost = jiffies,
+#ifdef CONFIG_SCHED_DEBUG
.name = tl->name,
+#endif };
SD_INIT_NAME(sd, NUMA);
sd->private = &tl->data;
[snip]
+/*
- Topology list, bottom-up.
- */
+static struct sched_domain_topology_level default_topology[] = { +#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) }, +#endif +#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
+#endif +#ifdef CONFIG_SCHED_BOOK
{ cpu_book_mask, SD_INIT_NAME(BOOK) },
+#endif
Never got the new name DIE for CPU? Might confuse people when they use /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().
In fact, CPU is also confusing because it's used for different things. But if it makes things even more confusing, i can come back to CPU
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
+};
+struct sched_domain_topology_level *sched_domain_topology = default_topology;