On Wed, Nov 13, 2013 at 04:29:19PM +0000, Peter Zijlstra wrote:
On Wed, Nov 13, 2013 at 03:47:16PM +0000, Dietmar Eggemann wrote:
On 12/11/13 18:08, Peter Zijlstra wrote:
On Tue, Nov 12, 2013 at 05:43:36PM +0000, Dietmar Eggemann wrote:
This patch removes the sched_domain initializer macros SD_[SIBLING|MC|BOOK|CPU]_INIT in core.c and in archs and replaces them with calls to the new function sd_init(). The function sd_init incorporates the already existing function sd_numa_init().
Your patch retains far too much of the weird behavioural variations we have, nor does it create a proper separation between topology and behaviour.
Could you please explain a little bit further on the weird behavioural variations. Are you referring to the specific SD_ flags or sd_domain levels?
+++ b/arch/ia64/kernel/topology.c @@ -99,6 +99,14 @@ out:
subsys_initcall(topology_init);
+void arch_sd_customize(int level, struct sched_domain *sd, int cpu) +{
if (level == SD_LVL_CPU) {
sd->cache_nice_tries = 2;
sd->flags &= ~SD_PREFER_SIBLING;
}
+}
+++ b/arch/tile/kernel/smp.c @@ -254,3 +254,15 @@ void smp_send_reschedule(int cpu) }
#endif /* CHIP_HAS_IPI() */
+void arch_sd_customize(int level, struct sched_domain *sd, int cpu) +{
if (level == SD_LVL_CPU) {
sd->min_interval = 4;
sd->max_interval = 128;
sd->flags &= ~(SD_WAKE_AFFINE | SD_PREFER_SIBLING);
sd->balance_interval = 32;
}
+}
Many of these differences are just bitrot / accidents, and the different min interval for tile was already taken care of by basing the intervals off of the domain weight.
On that, you should also not rely on these SD_LVL things; if we wanted to inject an extra level they'd go all funny.
I agree that this patch doesn't separate behaviour and topology and I will consider this going forward.
Please take the patch I did and work from there.
We might indeed have to have a single arch_() function that adds SD_flags, but please restrict the flags it can set -- never allow it to set behavioural flags.
Understood. Simply exporting an sd_domain pointer is a no-go.
I was more thinking along the lines of:
unsigned long arch_sd_flags(unsigned long sd_flags) { return 0 }
Used like:
extra_sd_flags = arch_sd_flags(sd->sd_flags); if (extra_sd_flags & FOO) { WARN("silly bugger: %x\n", extra_sd_flags); extra_sd_flags &= ~FOO; } sd->sd_flags |= extra_sd_flags;
Or something.
We need a way to know which group of cpus the flag applies to. If we don't want to pass a pointer to the sched_domain and we want to replace the current named sched_domain levels with something more flexible, the only solution I can think of right now is to pass a cpumask to the arch code. Better suggestions?
If we let arch generate the topology it could set the flags as well. But that means that an arch would have to deal with generating the topology even if it just needs to flip a single flag in the default topology.
Another thing is if we want to put energy related information into the sched_domain hierarchy. If we want various energy costs (P and C state) to be represented here we would need to modify more than just flags.
One way to do that is to put the energy information into a sub-struct and have another arch_sd_energy() call that allows the arch to populate that struct with relevant information.
Morten