Re: [PATCH 2/6] sched: rework of sched_domain topology definition

6 Mar 2014

      On 6 March 2014 01:09, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
...
On 05/03/14 07:18, Vincent Guittot wrote:
...
We replace the old way to configure the scheduler topology with a new
method
which enables a platform to declare additionnal level (if needed).
We still have a default topology table definition that can be used by
platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table
can
be overwritten by an arch which wants to add new level where a load
balance
make sense like BOOK or powergating level.
For each level, we need a function pointer that returns cpumask for each
cpu,
the flags configuration and a name. Each level must be a subset on
Maybe it's worth mentioning here that those flags are from the set of the sd
topology flags to distinguish them from the set of sd behavioural flags.
Latter can't be set via this interface.
yes, i will add the list of flags that can be set with the table
...
...
the next one. The build sequence of the sched_domain will take care of
removing useless levels like those with 1 CPU and those with the same CPU
span
and relevant information for load balancing than its child .
Signed-off-by: Vincent Guittot vincent.guittot@linaro.org
arch/ia64/include/asm/topology.h |   24 ----
  arch/s390/include/asm/topology.h |    2 -
  arch/tile/include/asm/topology.h |   33 ------
  include/linux/sched.h            |   29 +++++
  include/linux/topology.h         |  128 +++------------------
  kernel/sched/core.c              |  227
+++++++++++++++++++-------------------
  6 files changed, 156 insertions(+), 287 deletions(-)
[snip]
...
...

-#define for_each_sd_topology(tl)                       \

  for (tl = sched_domain_topology; tl->init; tl++)

Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?
it should not as well as its use in sd_init
...
...
+static int sched_domains_curr_level;
#ifdef CONFIG_NUMA

static int sched_domains_numa_levels;
static int *sched_domains_numa_distance;
static struct cpumask ***sched_domains_numa_masks;

-static int sched_domains_curr_level;

-static inline int sd_local_flags(int level)
-{

  if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)

          return 0;

+#endif

  return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;

-}
+/*

SD_flags allowed in topology descriptions.

SD_SHARE_CPUPOWER      - describes SMT topologies

SD_SHARE_PKG_RESOURCES - describes shared caches

SD_NUMA                - describes NUMA topologies

Odd one out:

SD_ASYM_PACKING        - describes SMT quirks

*/

+#define TOPOLOGY_SD_FLAGS              \

  (SD_SHARE_CPUPOWER |            \

   SD_SHARE_PKG_RESOURCES |       \

   SD_NUMA |                      \

   SD_ASYM_PACKING)

static struct sched_domain *

-sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl, int cpu)
  {
         struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);

  int level = tl->numa_level;

  int sd_weight = cpumask_weight(

sched_domains_numa_masks[level][cpu_to_node(cpu)]);

  int sd_weight;

Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef
CONFIG_NUMA later in sd_init() though.
...

  /*

   * Ugly hack to pass state to sd_numa_mask()...

   */

  sched_domains_curr_level = tl->numa_level;

  sd_weight = cpumask_weight(tl->mask(cpu));

  if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,

                  "wrong sd_flags in topology description\n"))

          tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

   *sd = (struct sched_domain){
           .min_interval           = sd_weight,
           .max_interval           = 2*sd_weight,
           .busy_factor            = 32,
           .imbalance_pct          = 125,

          .cache_nice_tries       = 2,

          .busy_idx               = 3,

          .idle_idx               = 2,

          .cache_nice_tries       = 0,

          .busy_idx               = 0,

          .idle_idx               = 0,
           .newidle_idx            = 0,
           .wake_idx               = 0,
           .forkexec_idx           = 0,

Why we want to explicitly set those indexes to 0 here? IMHO, the memory for
*sd is zeroed out before. This is true for all data members which are set to
0 later in this function including the | 0*SD_FOO . IMHO, would make the
code more readable.
I would say that it makes the configuration more readable and
modifiable because you have the list of possible flag to set
...
...
             .flags                  = 1*SD_LOAD_BALANCE
                                     | 1*SD_BALANCE_NEWIDLE

                                  | 0*SD_BALANCE_EXEC

                                  | 0*SD_BALANCE_FORK

                                  | 1*SD_BALANCE_EXEC

                                  | 1*SD_BALANCE_FORK
                                   | 0*SD_BALANCE_WAKE

                                  | 0*SD_WAKE_AFFINE

                                  | 1*SD_WAKE_AFFINE
                                   | 0*SD_SHARE_CPUPOWER
                                   | 0*SD_SHARE_PKG_RESOURCES

                                  | 1*SD_SERIALIZE

                                  | 0*SD_SERIALIZE
                                   | 0*SD_PREFER_SIBLING

                                  | 1*SD_NUMA

                                  | sd_local_flags(level)

                                  | 0*SD_NUMA

                                  | tl->sd_flags
                                   ,

           .last_balance           = jiffies,
           .balance_interval       = sd_weight,

          .smt_gain               = 0,

          .max_newidle_lb_cost    = 0,

          .next_decay_max_lb_cost = jiffies,

+#ifdef CONFIG_SCHED_DEBUG

          .name                   = tl->name,

+#endif
         };

  SD_INIT_NAME(sd, NUMA);

  sd->private = &tl->data;

[snip]
...
...
+/*

Topology list, bottom-up.

*/

+static struct sched_domain_topology_level default_topology[] = {
+#ifdef CONFIG_SCHED_SMT

  { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,

SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_MC

  { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },

+#endif
+#ifdef CONFIG_SCHED_BOOK

  { cpu_book_mask, SD_INIT_NAME(BOOK) },

+#endif
Never got the new name DIE for CPU? Might confuse people when they use
/proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().
In fact, CPU is also confusing because it's used for different things.
But if it makes things even more confusing, i can come back to CPU
...
...

  { cpu_cpu_mask, SD_INIT_NAME(DIE) },

  { NULL, },

+};

+struct sched_domain_topology_level *sched_domain_topology =
default_topology;

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [PATCH 2/6] sched: rework of sched_domain topology definition

Signed-off-by: Vincent Guittot vincent.guittot@linaro.org