The goal of the power aware scheduler design is to integrate all cpu power management in the scheduler. As a first step the idle state selection was moved into the scheduler. Doing this helps better decide which idle state to enter into using metrics known by the scheduler. However the cost of entering and exiting an idle state can help the scheduler do load balancing better.It would be even better if the idle states can let the scheduler know about the impact on the cache contents when the cpu enters that state. The scheduler can make use of this data while waking up tasks or scheduling new tasks. To make way for such information to be propogated to the scheduler, enumerate idle states in the scheduler topology levels.
Doing so will also let the scheduler know the idle states that a *sched_group* can enter into at a given level of scheduling domain. This means the scheduler is implicitly made aware of the fact that idle state is not necessarily a per-cpu state, it can be a per-core state or a state shared by a group of cpus that is specified by the sched_group. The knowledge of this higher level cpuidle information is missing today too.
The low level platform cpuidle drivers must expose to the scheduler the idle states at the different topology levels. This patch takes up the powernv cpuidle driver to illustrate this. The scheduling topology is left to the arch to decide. Commit 143e1e28cb40bed836 introduced this. The platform idle drivers are thus in a better position to fill up the topology levels with appropriate cpuidle state information while they discover it themselves.
Signed-off-by: Preeti U Murthy preeti@linux.vnet.ibm.com ---
drivers/cpuidle/cpuidle-powernv.c | 8 ++++++++ include/linux/sched.h | 3 +++ 2 files changed, 11 insertions(+)
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index 95ef533..4232fbc 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -184,6 +184,11 @@ static int powernv_add_idle_states(void)
dt_idle_states = len_flags / sizeof(u32);
+#ifdef CONFIG_SCHED_POWER + /* Snooze is a thread level idle state; the rest are core level idle states */ + sched_domain_topology[0].states[0] = powernv_states[0]; +#endif + for (i = 0; i < dt_idle_states; i++) {
flags = be32_to_cpu(idle_state_flags[i]); @@ -209,6 +214,9 @@ static int powernv_add_idle_states(void) powernv_states[nr_idle_states].enter = &fastsleep_loop; nr_idle_states++; } +#ifdef CONFIG_SCHED_POWER + sched_domain_topology[1].states[i] = powernv_states[nr_idle_states]; +#endif }
return nr_idle_states; diff --git a/include/linux/sched.h b/include/linux/sched.h index 5dd99b5..009da6a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1027,6 +1027,9 @@ struct sched_domain_topology_level { #ifdef CONFIG_SCHED_DEBUG char *name; #endif +#ifdef CONFIG_SCHED_POWER + struct cpuidle_state states[CPUIDLE_STATE_MAX]; +#endif };
extern struct sched_domain_topology_level *sched_domain_topology;