Hi Viresh,
On Thu, Oct 04, 2012 at 07:02:03AM +0100, Viresh Kumar wrote:
Hi Morten,
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
From: Morten Rasmussen morten.rasmussen@arm.com
This patch introduces the basic SCHED_HMP infrastructure. Each class of cpus is represented by a hmp_domain and tasks will only be moved between these domains when their load profiles suggest it is beneficial.
SCHED_HMP relies heavily on the task load-tracking introduced in Paul Turners fair group scheduling patch set:
https://lkml.org/lkml/2012/8/23/267
SCHED_HMP requires that the platform implements arch_get_hmp_domains() which should set up the platform specific list of hmp_domains. It is also assumed that the platform disables SD_LOAD_BALANCE for the appropriate sched_domains.
An explanation of this requirement would be helpful here.
Yes. This is to prevent the load-balancer from moving tasks between hmp_domains. This will be done exclusively by SCHED_HMP instead to implement a strict task migration policy and avoid changing the load-balancer behaviour. The load-balancer will take care of load-balacing within each hmp_domain.
Tasks placement takes place every time a task is to be inserted into a runqueue based on its load history. The task placement decision is based on load thresholds.
There are no restrictions on the number of hmp_domains, however, multiple (>2) has not been tested and the up/down migration policy is rather simple.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
arch/arm/Kconfig | 17 +++++ include/linux/sched.h | 6 ++ kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 6 ++ 4 files changed, 197 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f4a5d58..5b09684 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1554,6 +1554,23 @@ config SCHED_SMT MultiThreading at a cost of slightly increased overhead in some places. If unsure say N here.
+config DISABLE_CPU_SCHED_DOMAIN_BALANCE
bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
help
Disables scheduler load-balancing at CPU sched domain level.
Shouldn't this depend on EXPERIMENTAL?
It should. The ongoing discussion about CONFIG_EXPERIMENTAL that Amit is referring to hasn't come to a conclusion yet.
+config SCHED_HMP
bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling"
ditto.
depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE && SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP
help
Experimental scheduler optimizations for heterogeneous platforms.
Attempts to introspectively select task affinity to optimize power
and performance. Basic support for multiple (>2) cpu types is in place,
but it has only been tested with two types of cpus.
There is currently no support for migration of task groups, hence
!SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
config HAVE_ARM_SCU bool help diff --git a/include/linux/sched.h b/include/linux/sched.h index 81e4e82..df971a3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
bool cpus_share_cache(int this_cpu, int that_cpu);
+#ifdef CONFIG_SCHED_HMP +struct hmp_domain {
struct cpumask cpus;
struct list_head hmp_domains;
Probably need a better name here. domain_list?
Yes. hmp_domain_list would be better and stick with the hmp_* naming convention.
+}; +#endif /* CONFIG_SCHED_HMP */ #else /* CONFIG_SMP */
struct sched_domain_attr; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e17dd5..d80de46 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+#ifdef CONFIG_SCHED_HMP +/*
- Heterogenous multiprocessor (HMP) optimizations
- The cpu types are distinguished using a list of hmp_domains
- which each represent one cpu type using a cpumask.
- The list is assumed ordered by compute capacity with the
- fastest domain first.
- */
+DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
+extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list);
+/* Setup hmp_domains */ +static int __init hmp_cpu_mask_setup(void)
How should we interpret its return value? Can you mention what does 0 & 1 mean here?
Returns 0 if domain setup failed, i.e. the domain list is empty, and 1 otherwise.
+{
char buf[64];
struct hmp_domain *domain;
struct list_head *pos;
int dc, cpu;
pr_debug("Initializing HMP scheduler:\n");
/* Initialize hmp_domains using platform code */
arch_get_hmp_domains(&hmp_domains);
if (list_empty(&hmp_domains)) {
pr_debug("HMP domain list is empty!\n");
return 0;
}
/* Print hmp_domains */
dc = 0;
Should be done during definition of dc.
list_for_each(pos, &hmp_domains) {
domain = list_entry(pos, struct hmp_domain, hmp_domains);
cpulist_scnprintf(buf, 64, &domain->cpus);
pr_debug(" HMP domain %d: %s\n", dc, buf);
Spaces before HMP are intentional?
Yes. It makes the boot log easier to read when the hmp_domain listing is indented.
for_each_cpu_mask(cpu, domain->cpus) {
per_cpu(hmp_cpu_domain, cpu) = domain;
}
Should use hmp_cpu_domain(cpu) here. Also no need of {} for single line loop.
dc++;
You aren't using it... Only for testing? Should we remove it from mainline patchset and keep it locally?
I'm using it in the pr_debug line a little earlier. It is used for enumerating the hmp_domains.
}
return 1;
+}
+/*
- Migration thresholds should be in the range [0..1023]
- hmp_up_threshold: min. load required for migrating tasks to a faster cpu
- hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
- The default values (512, 256) offer good responsiveness, but may need
- tweaking suit particular needs.
- */
+unsigned int hmp_up_threshold = 512; +unsigned int hmp_down_threshold = 256;
For default values, it is fine. But still we should get user preferred values via DT or CONFIG_*.
Yes, but for now getting the scheduler to do the right thing has higher priority than proper integration with DT.
+static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
+/* Check if cpu is in fastest hmp_domain */ +static inline unsigned int hmp_cpu_is_fastest(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return pos == hmp_domains.next;
better create list_is_first() for this.
I had the same thought, but I see that as a separate patch that should be submitted separately.
+}
+/* Check if cpu is in slowest hmp_domain */ +static inline unsigned int hmp_cpu_is_slowest(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_is_last(pos, &hmp_domains);
+}
+/* Next (slower) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_slower_domain(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_entry(pos->next, struct hmp_domain, hmp_domains);
+}
+/* Previous (faster) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_faster_domain(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_entry(pos->prev, struct hmp_domain, hmp_domains);
+}
For all four routines, first two lines of body can be merged. If u wish :)
I have kept these helper functions fairly generic on purpose. It might be necessary for multi-domain platforms (>2) to modify these functions to implement better multi-domain task migration policies. I don't know any such platform, so for know these functions are very simple.
+/*
- Selects a cpu in previous (faster) hmp_domain
- Note that cpumask_any_and() returns the first cpu in the cpumask
- */
+static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk,
int cpu)
+{
return cpumask_any_and(&hmp_faster_domain(cpu)->cpus,
tsk_cpus_allowed(tsk));
+}
+/*
- Selects a cpu in next (slower) hmp_domain
- Note that cpumask_any_and() returns the first cpu in the cpumask
- */
+static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk,
int cpu)
+{
return cpumask_any_and(&hmp_slower_domain(cpu)->cpus,
tsk_cpus_allowed(tsk));
+}
+#endif /* CONFIG_SCHED_HMP */
/*
- sched_balance_self: balance the current task (running on cpu) in domains
- that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3203,6 +3322,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) unlock: rcu_read_unlock();
+#ifdef CONFIG_SCHED_HMP
if (hmp_up_migration(prev_cpu, &p->se))
return hmp_select_faster_cpu(p, prev_cpu);
if (hmp_down_migration(prev_cpu, &p->se))
return hmp_select_slower_cpu(p, prev_cpu);
/* Make sure that the task stays in its previous hmp domain */
if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus))
Why is this tested?
I don't think it is needed. It is there as an extra guarantee that select_task_rq_fair() doesn't pick a cpu outside the task's current hmp_domain in cases where there is no up or down migration. Disabling SD_LOAD_BALANCE for the appropriate domains should give that guarantee. I just haven't completely convinced myself yet.
return prev_cpu;
+#endif
return new_cpu;
}
@@ -5354,6 +5483,41 @@ need_kick: static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } #endif
+#ifdef CONFIG_SCHED_HMP +/* Check if task should migrate to a faster cpu */ +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) +{
struct task_struct *p = task_of(se);
if (hmp_cpu_is_fastest(cpu))
return 0;
if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus,
tsk_cpus_allowed(p))
&& se->avg.load_avg_ratio > hmp_up_threshold) {
return 1;
}
I know all these comparisons are not very costly, still i would prefer
se->avg.load_avg_ratio > hmp_up_threshold
as the first comparison in this routine.
We should check first, if the task needs migration or not. Rather than checking if it can migrate to other cpus or not.
Agree.
return 0;
+}
+/* Check if task should migrate to a slower cpu */ +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) +{
struct task_struct *p = task_of(se);
if (hmp_cpu_is_slowest(cpu))
return 0;
if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus,
tsk_cpus_allowed(p))
&& se->avg.load_avg_ratio < hmp_down_threshold) {
return 1;
}
same here.
Agree.
return 0;
+}
+#endif /* CONFIG_SCHED_HMP */
/*
- run_rebalance_domains is triggered when needed from the scheduler tick.
- Also triggered for nohz idle balancing (with nohz_balancing_kick set).
@@ -5861,6 +6025,10 @@ __init void init_sched_fair_class(void) zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); cpu_notifier(sched_ilb_notifier, 0); #endif
+#ifdef CONFIG_SCHED_HMP
hmp_cpu_mask_setup();
Should we check the return value? If not required then should we make fn() declaration return void?
It can be changed to void if we don't add any error handling anyway.
+#endif #endif /* SMP */
} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81135f9..4990d9e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -547,6 +547,12 @@ DECLARE_PER_CPU(int, sd_llc_id);
extern int group_balance_cpu(struct sched_group *sg);
+#ifdef CONFIG_SCHED_HMP +static LIST_HEAD(hmp_domains); +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain); +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu)))
can drop "()" around per_cpu().
Both, per_cpu variable and macro to get it, have the same name. Can we try giving them better names. Or atleast add an "_" before per_cpu pointers name?
Yes.
-- viresh