On 23 December 2013 18:22, Dietmar Eggemann dietmar.eggemann@arm.com wrote:
Hi Vincent,
On 18/12/13 14:13, Vincent Guittot wrote:
This patch applies on top of the two patches [1][2] that have been proposed by Peter for creating a new way to initialize sched_domain. It includes some minor compilation fixes and a trial of using this new method on ARM platform. [1] https://lkml.org/lkml/2013/11/5/239 [2] https://lkml.org/lkml/2013/11/5/449
I came up w/ a similar implementation proposal for an arch specific interface for scheduler domain set-up a couple of days ago:
[1] https://lkml.org/lkml/2013/12/13/182
I had the following requirements in mind:
- The arch should not be able to fine tune individual scheduler behaviour,
i.e. get rid of the arch specific SD_FOO_INIT macros.
Unify the set-up code for conventional and NUMA scheduler domains.
The arch is able to specify additional scheduler domain level, other than
SMT, MC, BOOK, and CPU.
- Allow to integrate the provision of additional topology related data
(e.g. energy information) to the scheduler.
Moreover, I think now that:
- Something like the existing default set-up via default_topology[] is
needed to avoid code duplication for archs not interested in (3) or (4).
Hi Dietmar,
I agree. This default array is available in Peter's patch and my patches overwrites the default array only if it wants to add more/new levels
[snip]
CPU2: domain 0: span 2-3 level: SMT flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 0 1 domain 1: span 2-7 level: MC flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN groups: 2-7 4-5 6-7 domain 2: span 0-7 level: MC flags: SD_SHARE_PKG_RESOURCES groups: 2-7 0-1 domain 3: span 0-15 level: CPU flags: groups: 0-7 8-15
In this case, we have an aditionnal sched_domain MC level for this subset (2-7) of cores so we can trigger some load balance in this subset before doing that on the complete cluster (which is the last level of cache in my example)
I think the weakest point right now is the condition in sd_init() where we convert the topology flags into scheduler behaviour. We not only introduce a very tight coupling between topology flags and scheduler domain level but also we need to follow a certain order in the initialization. This bit needs more thinking.
IMHO, these settings will disappear sooner or later, as an example the idle/busy _idx are going to be removed by Alex's patch.
We can add more levels that will describe other dependency/independency like the frequency scaling dependency and as a result the final sched_domain topology will have additional levels (if they have not been removed during the degenerate sequence)
My concern is about the configuration of the table that is used to create the sched_domain. Some levels are "duplicated" with different flags configuration which make the table not easily readable and we must also take care of the order because parents have to gather all cpus of its childs. So we must choose which capabilities will be a subset of the other one. The order is almost straight forward when we describe 1 or 2 kind of capabilities (package ressource sharing and power sharing) but it can become complex if we want to add more.
I'm not sure if the idea to create a dedicated sched_domain level for every topology flag representing a specific functionality will scale. From the
It's up to the arch to decide how many levels they want to add; if a dedicated level is needed or if it can gather some features/flags. IMHO, having sub structs for energy information like what we have for the cpu/group capacity will not prevent from having a 1st and quick topology tree description
perspective of energy-aware scheduling we need e.g. energy costs (P and C state) which can only be populated towards the scheduler via an additional sub-struct and additional function arch_sd_energy() like depicted in Morten's email:
[2] lkml.org/lkml/2013/11/14/102
[snip]
+static int __init arm_sched_topology(void) +{
sched_domain_topology = arm_topology;
return missing
good catch
Thanks
Vincent