From: Morten Rasmussen morten.rasmussen@arm.com
Hi Paul, Paul, Peter, Suresh, linaro-sched-sig, and LKML,
As a follow-up on my Linux Plumbers Conference talk about my experiments with scheduling on heterogeneous systems I'm posting a proof-of-concept patch set with my modifications. The intention behind the modifications is to tweak scheduling behaviour to only use fast (and power hungry) cores when it is necessary and also improve performance consistency. Without the modifications it is more or less random where tasks are scheduled and so is the execution time.
I'm seeing good improvements on performance consistency for web browsing on Android using Bbench http://www.gem5.org/Bbench on the ARM big.LITTLE TC2 chip, which has two fast cores (Cortex-A15) and three power-efficient cores (Cortex-A7). The total execution time numbers below are for Androids SurfaceFlinger process is key for page rendering performance. The average execution time is lower with the patches enabled and the standard deviation is much smaller. Similar improvements can be seen for the Android.Browser and WebViewCoreThread processes.
Total execution time statistics based on 50 runs.
SurfaceFlinger SMP kernel [s] HMP modifications [s] ------------------------------------------------------ Average 14.617 11.012 St. Dev. 4.577 0.902 10% Pctl. 9.343 10.783 90% Pctl. 18.743 11.695
Unfortunately, I cannot share power-efficiency numbers at this stage.
This patch set introduces proof-of-concept scheduler modifications which attempt to improve scheduling decisions on heterogeneous multi-processor systems (HMP) such as ARM big.LITTLE systems. The patch set relies on the entity load-tracking re-work patch set by Paul Turner:
https://lkml.org/lkml/2012/8/23/267
The modifications attempt to migrate tasks between cores with different compute capacity depending on the tracked load and priority. The aim is to only use fast cores for tasks which really need the extra performance and thereby improve power consumption by running everything else on the slow cores.
The patch introduces hmp_domains to represent the different types of cores that are available on the given platform. Multiple (>2) hmp_domains is supported but not tested. hmp_domains must be set up by platform code and the patch set includes patches for ARM platforms using device-tree.
The patches intentionally try to avoid modifying the existing code paths as much as possible. The aim is to experiment with HMP scheduling and get the overall policy right before integrating it properly with the existing load-balancer.
Morten
Morten Rasmussen (10): sched: entity load-tracking load_avg_ratio sched: Task placement for heterogeneous systems based on task load-tracking sched: Forced task migration on heterogeneous systems sched: Introduce priority-based task migration filter ARM: Add HMP scheduling support for ARM architecture ARM: sched: Use device-tree to provide fast/slow CPU list for HMP ARM: sched: Setup SCHED_HMP domains sched: Add ftrace events for entity load-tracking sched: Add HMP task migration ftrace event sched: SCHED_HMP multi-domain task migration control
arch/arm/Kconfig | 46 +++++ arch/arm/include/asm/topology.h | 32 +++ arch/arm/kernel/topology.c | 91 ++++++++ include/linux/sched.h | 11 + include/trace/events/sched.h | 153 ++++++++++++++ kernel/sched/core.c | 4 + kernel/sched/fair.c | 434 ++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 9 + 8 files changed, 779 insertions(+), 1 deletion(-)
From: Morten Rasmussen morten.rasmussen@arm.com
This patch adds load_avg_ratio to each task. The load_avg_ratio is a variant of load_avg_contrib which is not scaled by the task priority. It is calculated like this:
runnable_avg_sum * NICE_0_LOAD / (runnable_avg_period + 1).
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- include/linux/sched.h | 1 + kernel/sched/fair.c | 3 +++ 2 files changed, 4 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index 4dc4990..81e4e82 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1151,6 +1151,7 @@ struct sched_avg { u64 last_runnable_update; s64 decay_count; unsigned long load_avg_contrib; + unsigned long load_avg_ratio; u32 usage_avg_sum; };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 095d86c..3e17dd5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1192,6 +1192,9 @@ static inline void __update_task_entity_contrib(struct sched_entity *se) contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); contrib /= (se->avg.runnable_avg_period + 1); se->avg.load_avg_contrib = scale_load(contrib); + contrib = se->avg.runnable_avg_sum * scale_load_down(NICE_0_LOAD); + contrib /= (se->avg.runnable_avg_period + 1); + se->avg.load_avg_ratio = scale_load(contrib); }
/* Compute the current contribution to load_avg by se, return any delta */
From: Morten Rasmussen morten.rasmussen@arm.com
This patch introduces the basic SCHED_HMP infrastructure. Each class of cpus is represented by a hmp_domain and tasks will only be moved between these domains when their load profiles suggest it is beneficial.
SCHED_HMP relies heavily on the task load-tracking introduced in Paul Turners fair group scheduling patch set:
https://lkml.org/lkml/2012/8/23/267
SCHED_HMP requires that the platform implements arch_get_hmp_domains() which should set up the platform specific list of hmp_domains. It is also assumed that the platform disables SD_LOAD_BALANCE for the appropriate sched_domains. Tasks placement takes place every time a task is to be inserted into a runqueue based on its load history. The task placement decision is based on load thresholds.
There are no restrictions on the number of hmp_domains, however, multiple (>2) has not been tested and the up/down migration policy is rather simple.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/Kconfig | 17 +++++ include/linux/sched.h | 6 ++ kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 6 ++ 4 files changed, 197 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f4a5d58..5b09684 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1554,6 +1554,23 @@ config SCHED_SMT MultiThreading at a cost of slightly increased overhead in some places. If unsure say N here.
+config DISABLE_CPU_SCHED_DOMAIN_BALANCE + bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing" + help + Disables scheduler load-balancing at CPU sched domain level. + +config SCHED_HMP + bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling" + depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE && SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP + help + Experimental scheduler optimizations for heterogeneous platforms. + Attempts to introspectively select task affinity to optimize power + and performance. Basic support for multiple (>2) cpu types is in place, + but it has only been tested with two types of cpus. + There is currently no support for migration of task groups, hence + !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled + between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE). + config HAVE_ARM_SCU bool help diff --git a/include/linux/sched.h b/include/linux/sched.h index 81e4e82..df971a3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
bool cpus_share_cache(int this_cpu, int that_cpu);
+#ifdef CONFIG_SCHED_HMP +struct hmp_domain { + struct cpumask cpus; + struct list_head hmp_domains; +}; +#endif /* CONFIG_SCHED_HMP */ #else /* CONFIG_SMP */
struct sched_domain_attr; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e17dd5..d80de46 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+#ifdef CONFIG_SCHED_HMP +/* + * Heterogenous multiprocessor (HMP) optimizations + * + * The cpu types are distinguished using a list of hmp_domains + * which each represent one cpu type using a cpumask. + * The list is assumed ordered by compute capacity with the + * fastest domain first. + */ +DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain); + +extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list); + +/* Setup hmp_domains */ +static int __init hmp_cpu_mask_setup(void) +{ + char buf[64]; + struct hmp_domain *domain; + struct list_head *pos; + int dc, cpu; + + pr_debug("Initializing HMP scheduler:\n"); + + /* Initialize hmp_domains using platform code */ + arch_get_hmp_domains(&hmp_domains); + if (list_empty(&hmp_domains)) { + pr_debug("HMP domain list is empty!\n"); + return 0; + } + + /* Print hmp_domains */ + dc = 0; + list_for_each(pos, &hmp_domains) { + domain = list_entry(pos, struct hmp_domain, hmp_domains); + cpulist_scnprintf(buf, 64, &domain->cpus); + pr_debug(" HMP domain %d: %s\n", dc, buf); + + for_each_cpu_mask(cpu, domain->cpus) { + per_cpu(hmp_cpu_domain, cpu) = domain; + } + dc++; + } + + return 1; +} + +/* + * Migration thresholds should be in the range [0..1023] + * hmp_up_threshold: min. load required for migrating tasks to a faster cpu + * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu + * The default values (512, 256) offer good responsiveness, but may need + * tweaking suit particular needs. + */ +unsigned int hmp_up_threshold = 512; +unsigned int hmp_down_threshold = 256; + +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se); + +/* Check if cpu is in fastest hmp_domain */ +static inline unsigned int hmp_cpu_is_fastest(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return pos == hmp_domains.next; +} + +/* Check if cpu is in slowest hmp_domain */ +static inline unsigned int hmp_cpu_is_slowest(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return list_is_last(pos, &hmp_domains); +} + +/* Next (slower) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_slower_domain(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return list_entry(pos->next, struct hmp_domain, hmp_domains); +} + +/* Previous (faster) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_faster_domain(int cpu) +{ + struct list_head *pos; + + pos = &hmp_cpu_domain(cpu)->hmp_domains; + return list_entry(pos->prev, struct hmp_domain, hmp_domains); +} + +/* + * Selects a cpu in previous (faster) hmp_domain + * Note that cpumask_any_and() returns the first cpu in the cpumask + */ +static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk, + int cpu) +{ + return cpumask_any_and(&hmp_faster_domain(cpu)->cpus, + tsk_cpus_allowed(tsk)); +} + +/* + * Selects a cpu in next (slower) hmp_domain + * Note that cpumask_any_and() returns the first cpu in the cpumask + */ +static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk, + int cpu) +{ + return cpumask_any_and(&hmp_slower_domain(cpu)->cpus, + tsk_cpus_allowed(tsk)); +} + +#endif /* CONFIG_SCHED_HMP */ + /* * sched_balance_self: balance the current task (running on cpu) in domains * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and @@ -3203,6 +3322,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) unlock: rcu_read_unlock();
+#ifdef CONFIG_SCHED_HMP + if (hmp_up_migration(prev_cpu, &p->se)) + return hmp_select_faster_cpu(p, prev_cpu); + if (hmp_down_migration(prev_cpu, &p->se)) + return hmp_select_slower_cpu(p, prev_cpu); + /* Make sure that the task stays in its previous hmp domain */ + if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus)) + return prev_cpu; +#endif + return new_cpu; }
@@ -5354,6 +5483,41 @@ need_kick: static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } #endif
+#ifdef CONFIG_SCHED_HMP +/* Check if task should migrate to a faster cpu */ +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) +{ + struct task_struct *p = task_of(se); + + if (hmp_cpu_is_fastest(cpu)) + return 0; + + if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus, + tsk_cpus_allowed(p)) + && se->avg.load_avg_ratio > hmp_up_threshold) { + return 1; + } + return 0; +} + +/* Check if task should migrate to a slower cpu */ +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) +{ + struct task_struct *p = task_of(se); + + if (hmp_cpu_is_slowest(cpu)) + return 0; + + if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus, + tsk_cpus_allowed(p)) + && se->avg.load_avg_ratio < hmp_down_threshold) { + return 1; + } + return 0; +} + +#endif /* CONFIG_SCHED_HMP */ + /* * run_rebalance_domains is triggered when needed from the scheduler tick. * Also triggered for nohz idle balancing (with nohz_balancing_kick set). @@ -5861,6 +6025,10 @@ __init void init_sched_fair_class(void) zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); cpu_notifier(sched_ilb_notifier, 0); #endif + +#ifdef CONFIG_SCHED_HMP + hmp_cpu_mask_setup(); +#endif #endif /* SMP */
} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81135f9..4990d9e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -547,6 +547,12 @@ DECLARE_PER_CPU(int, sd_llc_id);
extern int group_balance_cpu(struct sched_group *sg);
+#ifdef CONFIG_SCHED_HMP +static LIST_HEAD(hmp_domains); +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain); +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu))) +#endif /* CONFIG_SCHED_HMP */ + #endif /* CONFIG_SMP */
#include "stats.h"
Hi Morten,
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
From: Morten Rasmussen morten.rasmussen@arm.com
This patch introduces the basic SCHED_HMP infrastructure. Each class of cpus is represented by a hmp_domain and tasks will only be moved between these domains when their load profiles suggest it is beneficial.
SCHED_HMP relies heavily on the task load-tracking introduced in Paul Turners fair group scheduling patch set:
https://lkml.org/lkml/2012/8/23/267
SCHED_HMP requires that the platform implements arch_get_hmp_domains() which should set up the platform specific list of hmp_domains. It is also assumed that the platform disables SD_LOAD_BALANCE for the appropriate sched_domains.
An explanation of this requirement would be helpful here.
Tasks placement takes place every time a task is to be inserted into a runqueue based on its load history. The task placement decision is based on load thresholds.
There are no restrictions on the number of hmp_domains, however, multiple (>2) has not been tested and the up/down migration policy is rather simple.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
arch/arm/Kconfig | 17 +++++ include/linux/sched.h | 6 ++ kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 6 ++ 4 files changed, 197 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f4a5d58..5b09684 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1554,6 +1554,23 @@ config SCHED_SMT MultiThreading at a cost of slightly increased overhead in some places. If unsure say N here.
+config DISABLE_CPU_SCHED_DOMAIN_BALANCE
bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
help
Disables scheduler load-balancing at CPU sched domain level.
Shouldn't this depend on EXPERIMENTAL?
+config SCHED_HMP
bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling"
ditto.
depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE && SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP
help
Experimental scheduler optimizations for heterogeneous platforms.
Attempts to introspectively select task affinity to optimize power
and performance. Basic support for multiple (>2) cpu types is in place,
but it has only been tested with two types of cpus.
There is currently no support for migration of task groups, hence
!SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
config HAVE_ARM_SCU bool help diff --git a/include/linux/sched.h b/include/linux/sched.h index 81e4e82..df971a3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
bool cpus_share_cache(int this_cpu, int that_cpu);
+#ifdef CONFIG_SCHED_HMP +struct hmp_domain {
struct cpumask cpus;
struct list_head hmp_domains;
Probably need a better name here. domain_list?
+}; +#endif /* CONFIG_SCHED_HMP */ #else /* CONFIG_SMP */
struct sched_domain_attr; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e17dd5..d80de46 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+#ifdef CONFIG_SCHED_HMP +/*
- Heterogenous multiprocessor (HMP) optimizations
- The cpu types are distinguished using a list of hmp_domains
- which each represent one cpu type using a cpumask.
- The list is assumed ordered by compute capacity with the
- fastest domain first.
- */
+DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
+extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list);
+/* Setup hmp_domains */ +static int __init hmp_cpu_mask_setup(void)
How should we interpret its return value? Can you mention what does 0 & 1 mean here?
+{
char buf[64];
struct hmp_domain *domain;
struct list_head *pos;
int dc, cpu;
pr_debug("Initializing HMP scheduler:\n");
/* Initialize hmp_domains using platform code */
arch_get_hmp_domains(&hmp_domains);
if (list_empty(&hmp_domains)) {
pr_debug("HMP domain list is empty!\n");
return 0;
}
/* Print hmp_domains */
dc = 0;
Should be done during definition of dc.
list_for_each(pos, &hmp_domains) {
domain = list_entry(pos, struct hmp_domain, hmp_domains);
cpulist_scnprintf(buf, 64, &domain->cpus);
pr_debug(" HMP domain %d: %s\n", dc, buf);
Spaces before HMP are intentional?
for_each_cpu_mask(cpu, domain->cpus) {
per_cpu(hmp_cpu_domain, cpu) = domain;
}
Should use hmp_cpu_domain(cpu) here. Also no need of {} for single line loop.
dc++;
You aren't using it... Only for testing? Should we remove it from mainline patchset and keep it locally?
}
return 1;
+}
+/*
- Migration thresholds should be in the range [0..1023]
- hmp_up_threshold: min. load required for migrating tasks to a faster cpu
- hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
- The default values (512, 256) offer good responsiveness, but may need
- tweaking suit particular needs.
- */
+unsigned int hmp_up_threshold = 512; +unsigned int hmp_down_threshold = 256;
For default values, it is fine. But still we should get user preferred values via DT or CONFIG_*.
+static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
+/* Check if cpu is in fastest hmp_domain */ +static inline unsigned int hmp_cpu_is_fastest(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return pos == hmp_domains.next;
better create list_is_first() for this.
+}
+/* Check if cpu is in slowest hmp_domain */ +static inline unsigned int hmp_cpu_is_slowest(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_is_last(pos, &hmp_domains);
+}
+/* Next (slower) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_slower_domain(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_entry(pos->next, struct hmp_domain, hmp_domains);
+}
+/* Previous (faster) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_faster_domain(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_entry(pos->prev, struct hmp_domain, hmp_domains);
+}
For all four routines, first two lines of body can be merged. If u wish :)
+/*
- Selects a cpu in previous (faster) hmp_domain
- Note that cpumask_any_and() returns the first cpu in the cpumask
- */
+static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk,
int cpu)
+{
return cpumask_any_and(&hmp_faster_domain(cpu)->cpus,
tsk_cpus_allowed(tsk));
+}
+/*
- Selects a cpu in next (slower) hmp_domain
- Note that cpumask_any_and() returns the first cpu in the cpumask
- */
+static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk,
int cpu)
+{
return cpumask_any_and(&hmp_slower_domain(cpu)->cpus,
tsk_cpus_allowed(tsk));
+}
+#endif /* CONFIG_SCHED_HMP */
/*
- sched_balance_self: balance the current task (running on cpu) in domains
- that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3203,6 +3322,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) unlock: rcu_read_unlock();
+#ifdef CONFIG_SCHED_HMP
if (hmp_up_migration(prev_cpu, &p->se))
return hmp_select_faster_cpu(p, prev_cpu);
if (hmp_down_migration(prev_cpu, &p->se))
return hmp_select_slower_cpu(p, prev_cpu);
/* Make sure that the task stays in its previous hmp domain */
if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus))
Why is this tested?
return prev_cpu;
+#endif
return new_cpu;
}
@@ -5354,6 +5483,41 @@ need_kick: static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } #endif
+#ifdef CONFIG_SCHED_HMP +/* Check if task should migrate to a faster cpu */ +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) +{
struct task_struct *p = task_of(se);
if (hmp_cpu_is_fastest(cpu))
return 0;
if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus,
tsk_cpus_allowed(p))
&& se->avg.load_avg_ratio > hmp_up_threshold) {
return 1;
}
I know all these comparisons are not very costly, still i would prefer
se->avg.load_avg_ratio > hmp_up_threshold
as the first comparison in this routine.
We should check first, if the task needs migration or not. Rather than checking if it can migrate to other cpus or not.
return 0;
+}
+/* Check if task should migrate to a slower cpu */ +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) +{
struct task_struct *p = task_of(se);
if (hmp_cpu_is_slowest(cpu))
return 0;
if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus,
tsk_cpus_allowed(p))
&& se->avg.load_avg_ratio < hmp_down_threshold) {
return 1;
}
same here.
return 0;
+}
+#endif /* CONFIG_SCHED_HMP */
/*
- run_rebalance_domains is triggered when needed from the scheduler tick.
- Also triggered for nohz idle balancing (with nohz_balancing_kick set).
@@ -5861,6 +6025,10 @@ __init void init_sched_fair_class(void) zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); cpu_notifier(sched_ilb_notifier, 0); #endif
+#ifdef CONFIG_SCHED_HMP
hmp_cpu_mask_setup();
Should we check the return value? If not required then should we make fn() declaration return void?
+#endif #endif /* SMP */
} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81135f9..4990d9e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -547,6 +547,12 @@ DECLARE_PER_CPU(int, sd_llc_id);
extern int group_balance_cpu(struct sched_group *sg);
+#ifdef CONFIG_SCHED_HMP +static LIST_HEAD(hmp_domains); +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain); +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu)))
can drop "()" around per_cpu().
Both, per_cpu variable and macro to get it, have the same name. Can we try giving them better names. Or atleast add an "_" before per_cpu pointers name?
-- viresh
On Thu, Oct 4, 2012 at 11:32 AM, Viresh Kumar viresh.kumar@linaro.org wrote:
Hi Morten,
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
From: Morten Rasmussen morten.rasmussen@arm.com
This patch introduces the basic SCHED_HMP infrastructure. Each class of cpus is represented by a hmp_domain and tasks will only be moved between these domains when their load profiles suggest it is beneficial.
SCHED_HMP relies heavily on the task load-tracking introduced in Paul Turners fair group scheduling patch set:
https://lkml.org/lkml/2012/8/23/267
SCHED_HMP requires that the platform implements arch_get_hmp_domains() which should set up the platform specific list of hmp_domains. It is also assumed that the platform disables SD_LOAD_BALANCE for the appropriate sched_domains.
An explanation of this requirement would be helpful here.
Tasks placement takes place every time a task is to be inserted into a runqueue based on its load history. The task placement decision is based on load thresholds.
There are no restrictions on the number of hmp_domains, however, multiple (>2) has not been tested and the up/down migration policy is rather simple.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
arch/arm/Kconfig | 17 +++++ include/linux/sched.h | 6 ++ kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 6 ++ 4 files changed, 197 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f4a5d58..5b09684 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1554,6 +1554,23 @@ config SCHED_SMT MultiThreading at a cost of slightly increased overhead in some places. If unsure say N here.
+config DISABLE_CPU_SCHED_DOMAIN_BALANCE
bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
help
Disables scheduler load-balancing at CPU sched domain level.
Shouldn't this depend on EXPERIMENTAL?
EXPERIMENTAL might be on its way out: https://lkml.org/lkml/2012/10/2/398
Hi Viresh,
On Thu, Oct 04, 2012 at 07:02:03AM +0100, Viresh Kumar wrote:
Hi Morten,
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
From: Morten Rasmussen morten.rasmussen@arm.com
This patch introduces the basic SCHED_HMP infrastructure. Each class of cpus is represented by a hmp_domain and tasks will only be moved between these domains when their load profiles suggest it is beneficial.
SCHED_HMP relies heavily on the task load-tracking introduced in Paul Turners fair group scheduling patch set:
https://lkml.org/lkml/2012/8/23/267
SCHED_HMP requires that the platform implements arch_get_hmp_domains() which should set up the platform specific list of hmp_domains. It is also assumed that the platform disables SD_LOAD_BALANCE for the appropriate sched_domains.
An explanation of this requirement would be helpful here.
Yes. This is to prevent the load-balancer from moving tasks between hmp_domains. This will be done exclusively by SCHED_HMP instead to implement a strict task migration policy and avoid changing the load-balancer behaviour. The load-balancer will take care of load-balacing within each hmp_domain.
Tasks placement takes place every time a task is to be inserted into a runqueue based on its load history. The task placement decision is based on load thresholds.
There are no restrictions on the number of hmp_domains, however, multiple (>2) has not been tested and the up/down migration policy is rather simple.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
arch/arm/Kconfig | 17 +++++ include/linux/sched.h | 6 ++ kernel/sched/fair.c | 168 +++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 6 ++ 4 files changed, 197 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f4a5d58..5b09684 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1554,6 +1554,23 @@ config SCHED_SMT MultiThreading at a cost of slightly increased overhead in some places. If unsure say N here.
+config DISABLE_CPU_SCHED_DOMAIN_BALANCE
bool "(EXPERIMENTAL) Disable CPU level scheduler load-balancing"
help
Disables scheduler load-balancing at CPU sched domain level.
Shouldn't this depend on EXPERIMENTAL?
It should. The ongoing discussion about CONFIG_EXPERIMENTAL that Amit is referring to hasn't come to a conclusion yet.
+config SCHED_HMP
bool "(EXPERIMENTAL) Heterogenous multiprocessor scheduling"
ditto.
depends on DISABLE_CPU_SCHED_DOMAIN_BALANCE && SCHED_MC && FAIR_GROUP_SCHED && !SCHED_AUTOGROUP
help
Experimental scheduler optimizations for heterogeneous platforms.
Attempts to introspectively select task affinity to optimize power
and performance. Basic support for multiple (>2) cpu types is in place,
but it has only been tested with two types of cpus.
There is currently no support for migration of task groups, hence
!SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled
between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
config HAVE_ARM_SCU bool help diff --git a/include/linux/sched.h b/include/linux/sched.h index 81e4e82..df971a3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1039,6 +1039,12 @@ unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
bool cpus_share_cache(int this_cpu, int that_cpu);
+#ifdef CONFIG_SCHED_HMP +struct hmp_domain {
struct cpumask cpus;
struct list_head hmp_domains;
Probably need a better name here. domain_list?
Yes. hmp_domain_list would be better and stick with the hmp_* naming convention.
+}; +#endif /* CONFIG_SCHED_HMP */ #else /* CONFIG_SMP */
struct sched_domain_attr; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e17dd5..d80de46 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3077,6 +3077,125 @@ static int select_idle_sibling(struct task_struct *p, int target) return target; }
+#ifdef CONFIG_SCHED_HMP +/*
- Heterogenous multiprocessor (HMP) optimizations
- The cpu types are distinguished using a list of hmp_domains
- which each represent one cpu type using a cpumask.
- The list is assumed ordered by compute capacity with the
- fastest domain first.
- */
+DEFINE_PER_CPU(struct hmp_domain *, hmp_cpu_domain);
+extern void __init arch_get_hmp_domains(struct list_head *hmp_domains_list);
+/* Setup hmp_domains */ +static int __init hmp_cpu_mask_setup(void)
How should we interpret its return value? Can you mention what does 0 & 1 mean here?
Returns 0 if domain setup failed, i.e. the domain list is empty, and 1 otherwise.
+{
char buf[64];
struct hmp_domain *domain;
struct list_head *pos;
int dc, cpu;
pr_debug("Initializing HMP scheduler:\n");
/* Initialize hmp_domains using platform code */
arch_get_hmp_domains(&hmp_domains);
if (list_empty(&hmp_domains)) {
pr_debug("HMP domain list is empty!\n");
return 0;
}
/* Print hmp_domains */
dc = 0;
Should be done during definition of dc.
list_for_each(pos, &hmp_domains) {
domain = list_entry(pos, struct hmp_domain, hmp_domains);
cpulist_scnprintf(buf, 64, &domain->cpus);
pr_debug(" HMP domain %d: %s\n", dc, buf);
Spaces before HMP are intentional?
Yes. It makes the boot log easier to read when the hmp_domain listing is indented.
for_each_cpu_mask(cpu, domain->cpus) {
per_cpu(hmp_cpu_domain, cpu) = domain;
}
Should use hmp_cpu_domain(cpu) here. Also no need of {} for single line loop.
dc++;
You aren't using it... Only for testing? Should we remove it from mainline patchset and keep it locally?
I'm using it in the pr_debug line a little earlier. It is used for enumerating the hmp_domains.
}
return 1;
+}
+/*
- Migration thresholds should be in the range [0..1023]
- hmp_up_threshold: min. load required for migrating tasks to a faster cpu
- hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
- The default values (512, 256) offer good responsiveness, but may need
- tweaking suit particular needs.
- */
+unsigned int hmp_up_threshold = 512; +unsigned int hmp_down_threshold = 256;
For default values, it is fine. But still we should get user preferred values via DT or CONFIG_*.
Yes, but for now getting the scheduler to do the right thing has higher priority than proper integration with DT.
+static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se);
+/* Check if cpu is in fastest hmp_domain */ +static inline unsigned int hmp_cpu_is_fastest(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return pos == hmp_domains.next;
better create list_is_first() for this.
I had the same thought, but I see that as a separate patch that should be submitted separately.
+}
+/* Check if cpu is in slowest hmp_domain */ +static inline unsigned int hmp_cpu_is_slowest(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_is_last(pos, &hmp_domains);
+}
+/* Next (slower) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_slower_domain(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_entry(pos->next, struct hmp_domain, hmp_domains);
+}
+/* Previous (faster) hmp_domain relative to cpu */ +static inline struct hmp_domain *hmp_faster_domain(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return list_entry(pos->prev, struct hmp_domain, hmp_domains);
+}
For all four routines, first two lines of body can be merged. If u wish :)
I have kept these helper functions fairly generic on purpose. It might be necessary for multi-domain platforms (>2) to modify these functions to implement better multi-domain task migration policies. I don't know any such platform, so for know these functions are very simple.
+/*
- Selects a cpu in previous (faster) hmp_domain
- Note that cpumask_any_and() returns the first cpu in the cpumask
- */
+static inline unsigned int hmp_select_faster_cpu(struct task_struct *tsk,
int cpu)
+{
return cpumask_any_and(&hmp_faster_domain(cpu)->cpus,
tsk_cpus_allowed(tsk));
+}
+/*
- Selects a cpu in next (slower) hmp_domain
- Note that cpumask_any_and() returns the first cpu in the cpumask
- */
+static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk,
int cpu)
+{
return cpumask_any_and(&hmp_slower_domain(cpu)->cpus,
tsk_cpus_allowed(tsk));
+}
+#endif /* CONFIG_SCHED_HMP */
/*
- sched_balance_self: balance the current task (running on cpu) in domains
- that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -3203,6 +3322,16 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags) unlock: rcu_read_unlock();
+#ifdef CONFIG_SCHED_HMP
if (hmp_up_migration(prev_cpu, &p->se))
return hmp_select_faster_cpu(p, prev_cpu);
if (hmp_down_migration(prev_cpu, &p->se))
return hmp_select_slower_cpu(p, prev_cpu);
/* Make sure that the task stays in its previous hmp domain */
if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus))
Why is this tested?
I don't think it is needed. It is there as an extra guarantee that select_task_rq_fair() doesn't pick a cpu outside the task's current hmp_domain in cases where there is no up or down migration. Disabling SD_LOAD_BALANCE for the appropriate domains should give that guarantee. I just haven't completely convinced myself yet.
return prev_cpu;
+#endif
return new_cpu;
}
@@ -5354,6 +5483,41 @@ need_kick: static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } #endif
+#ifdef CONFIG_SCHED_HMP +/* Check if task should migrate to a faster cpu */ +static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) +{
struct task_struct *p = task_of(se);
if (hmp_cpu_is_fastest(cpu))
return 0;
if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus,
tsk_cpus_allowed(p))
&& se->avg.load_avg_ratio > hmp_up_threshold) {
return 1;
}
I know all these comparisons are not very costly, still i would prefer
se->avg.load_avg_ratio > hmp_up_threshold
as the first comparison in this routine.
We should check first, if the task needs migration or not. Rather than checking if it can migrate to other cpus or not.
Agree.
return 0;
+}
+/* Check if task should migrate to a slower cpu */ +static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) +{
struct task_struct *p = task_of(se);
if (hmp_cpu_is_slowest(cpu))
return 0;
if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus,
tsk_cpus_allowed(p))
&& se->avg.load_avg_ratio < hmp_down_threshold) {
return 1;
}
same here.
Agree.
return 0;
+}
+#endif /* CONFIG_SCHED_HMP */
/*
- run_rebalance_domains is triggered when needed from the scheduler tick.
- Also triggered for nohz idle balancing (with nohz_balancing_kick set).
@@ -5861,6 +6025,10 @@ __init void init_sched_fair_class(void) zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); cpu_notifier(sched_ilb_notifier, 0); #endif
+#ifdef CONFIG_SCHED_HMP
hmp_cpu_mask_setup();
Should we check the return value? If not required then should we make fn() declaration return void?
It can be changed to void if we don't add any error handling anyway.
+#endif #endif /* SMP */
} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81135f9..4990d9e 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -547,6 +547,12 @@ DECLARE_PER_CPU(int, sd_llc_id);
extern int group_balance_cpu(struct sched_group *sg);
+#ifdef CONFIG_SCHED_HMP +static LIST_HEAD(hmp_domains); +DECLARE_PER_CPU(struct hmp_domain *, hmp_cpu_domain); +#define hmp_cpu_domain(cpu) (per_cpu(hmp_cpu_domain, (cpu)))
can drop "()" around per_cpu().
Both, per_cpu variable and macro to get it, have the same name. Can we try giving them better names. Or atleast add an "_" before per_cpu pointers name?
Yes.
-- viresh
On 9 October 2012 21:26, Morten Rasmussen Morten.Rasmussen@arm.com wrote:
On Thu, Oct 04, 2012 at 07:02:03AM +0100, Viresh Kumar wrote:
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
SCHED_HMP requires that the platform implements arch_get_hmp_domains() which should set up the platform specific list of hmp_domains. It is also assumed that the platform disables SD_LOAD_BALANCE for the appropriate sched_domains.
An explanation of this requirement would be helpful here.
Yes. This is to prevent the load-balancer from moving tasks between hmp_domains. This will be done exclusively by SCHED_HMP instead to implement a strict task migration policy and avoid changing the load-balancer behaviour. The load-balancer will take care of load-balacing within each hmp_domain.
Honestly speaking i understood this point now and earlier it wasn't clear to me :)
What would be ideal is to put this information in the comment just before we re-define other SCHED_*** domains where we disable balancing. And keep it in the commit log too.
+struct hmp_domain {
struct cpumask cpus;
struct list_head hmp_domains;
Probably need a better name here. domain_list?
Yes. hmp_domain_list would be better and stick with the hmp_* naming convention.
IMHO hmp_ would be better for global names, but names of variables enclosed within another hmp_*** data type don't actually need hmp_**, as this is implicity.
i.e. struct hmp_domain { struct cpumask cpus; struct list_head domain_list; }
would be better than struct list_head hmp domain_list;
as the parent structure already have hmp_***. So whatever is inside the struct is obviously hmp specific.
+/* Setup hmp_domains */ +static int __init hmp_cpu_mask_setup(void)
How should we interpret its return value? Can you mention what does 0 & 1 mean here?
Returns 0 if domain setup failed, i.e. the domain list is empty, and 1 otherwise.
Helpful. Please mention this in function comment in your next revision.
+{
char buf[64];
struct hmp_domain *domain;
struct list_head *pos;
int dc, cpu;
/* Print hmp_domains */
dc = 0;
Should be done during definition of dc.
You missed this ??
for_each_cpu_mask(cpu, domain->cpus) {
per_cpu(hmp_cpu_domain, cpu) = domain;
}
Should use hmp_cpu_domain(cpu) here. Also no need of {} for single line loop.
??
dc++;
You aren't using it... Only for testing? Should we remove it from mainline patchset and keep it locally?
I'm using it in the pr_debug line a little earlier. It is used for enumerating the hmp_domains.
My mistake :(
+/* Check if cpu is in fastest hmp_domain */ +static inline unsigned int hmp_cpu_is_fastest(int cpu) +{
struct list_head *pos;
pos = &hmp_cpu_domain(cpu)->hmp_domains;
return pos == hmp_domains.next;
better create list_is_first() for this.
I had the same thought, but I see that as a separate patch that should be submitted separately.
Correct. So better send it now, so that it is included before you send your next version. :)
-- viresh
From: Morten Rasmussen morten.rasmussen@arm.com
This patch introduces forced task migration for moving suitable currently running tasks between hmp_domains. Task behaviour is likely to change over time. Tasks running in a less capable hmp_domain may change to become more demanding and should therefore be migrated up. They are unlikely go through the select_task_rq_fair() path anytime soon and therefore need special attention.
This patch introduces a period check (SCHED_TICK) of the currently running task on all runqueues and sets up a forced migration using stop_machine_no_wait() if the task needs to be migrated.
Ideally, this should not be implemented by polling all runqueues.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- kernel/sched/fair.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 3 + 2 files changed, 198 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d80de46..490f1f0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3744,7 +3744,6 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) * 1) task is cache cold, or * 2) too many balance attempts have failed. */ - tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd); if (!tsk_cache_hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) { @@ -5516,6 +5515,199 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) return 0; }
+/* + * hmp_can_migrate_task - may task p from runqueue rq be migrated to this_cpu? + * Ideally this function should be merged with can_migrate_task() to avoid + * redundant code. + */ +static int hmp_can_migrate_task(struct task_struct *p, struct lb_env *env) +{ + int tsk_cache_hot = 0; + + /* + * We do not migrate tasks that are: + * 1) running (obviously), or + * 2) cannot be migrated to this CPU due to cpus_allowed + */ + if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) { + schedstat_inc(p, se.statistics.nr_failed_migrations_affine); + return 0; + } + env->flags &= ~LBF_ALL_PINNED; + + if (task_running(env->src_rq, p)) { + schedstat_inc(p, se.statistics.nr_failed_migrations_running); + return 0; + } + + /* + * Aggressive migration if: + * 1) task is cache cold, or + * 2) too many balance attempts have failed. + */ + + tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd); + if (!tsk_cache_hot || + env->sd->nr_balance_failed > env->sd->cache_nice_tries) { +#ifdef CONFIG_SCHEDSTATS + if (tsk_cache_hot) { + schedstat_inc(env->sd, lb_hot_gained[env->idle]); + schedstat_inc(p, se.statistics.nr_forced_migrations); + } +#endif + return 1; + } + + return 1; +} + +/* + * move_specific_task tries to move a specific task. + * Returns 1 if successful and 0 otherwise. + * Called with both runqueues locked. + */ +static int move_specific_task(struct lb_env *env, struct task_struct *pm) +{ + struct task_struct *p, *n; + + list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) { + if (throttled_lb_pair(task_group(p), env->src_rq->cpu, + env->dst_cpu)) + continue; + + if (!hmp_can_migrate_task(p, env)) + continue; + /* Check if we found the right task */ + if (p != pm) + continue; + + move_task(p, env); + /* + * Right now, this is only the third place move_task() + * is called, so we can safely collect move_task() + * stats here rather than inside move_task(). + */ + schedstat_inc(env->sd, lb_gained[env->idle]); + return 1; + } + return 0; +} + +/* + * hmp_active_task_migration_cpu_stop is run by cpu stopper and used to + * migrate a specific task from one runqueue to another. + * hmp_force_up_migration uses this to push a currently running task + * off a runqueue. + * Based on active_load_balance_stop_cpu and can potentially be merged. + */ +static int hmp_active_task_migration_cpu_stop(void *data) +{ + struct rq *busiest_rq = data; + struct task_struct *p = busiest_rq->migrate_task; + int busiest_cpu = cpu_of(busiest_rq); + int target_cpu = busiest_rq->push_cpu; + struct rq *target_rq = cpu_rq(target_cpu); + struct sched_domain *sd; + + raw_spin_lock_irq(&busiest_rq->lock); + /* make sure the requested cpu hasn't gone down in the meantime */ + if (unlikely(busiest_cpu != smp_processor_id() || + !busiest_rq->active_balance)) { + goto out_unlock; + } + /* Is there any task to move? */ + if (busiest_rq->nr_running <= 1) + goto out_unlock; + /* Task has migrated meanwhile, abort forced migration */ + if (task_rq(p) != busiest_rq) + goto out_unlock; + /* + * This condition is "impossible", if it occurs + * we need to fix it. Originally reported by + * Bjorn Helgaas on a 128-cpu setup. + */ + BUG_ON(busiest_rq == target_rq); + + /* move a task from busiest_rq to target_rq */ + double_lock_balance(busiest_rq, target_rq); + + /* Search for an sd spanning us and the target CPU. */ + rcu_read_lock(); + for_each_domain(target_cpu, sd) { + if (cpumask_test_cpu(busiest_cpu, sched_domain_span(sd))) + break; + } + + if (likely(sd)) { + struct lb_env env = { + .sd = sd, + .dst_cpu = target_cpu, + .dst_rq = target_rq, + .src_cpu = busiest_rq->cpu, + .src_rq = busiest_rq, + .idle = CPU_IDLE, + }; + + schedstat_inc(sd, alb_count); + + if (move_specific_task(&env, p)) + schedstat_inc(sd, alb_pushed); + else + schedstat_inc(sd, alb_failed); + } + rcu_read_unlock(); + double_unlock_balance(busiest_rq, target_rq); +out_unlock: + busiest_rq->active_balance = 0; + raw_spin_unlock_irq(&busiest_rq->lock); + return 0; +} + +static DEFINE_SPINLOCK(hmp_force_migration); + +/* + * hmp_force_up_migration checks runqueues for tasks that need to + * be actively migrated to a faster cpu. + */ +static void hmp_force_up_migration(int this_cpu) +{ + int cpu; + struct sched_entity *curr; + struct rq *target; + unsigned long flags; + unsigned int force; + struct task_struct *p; + + if (!spin_trylock(&hmp_force_migration)) + return; + for_each_online_cpu(cpu) { + force = 0; + target = cpu_rq(cpu); + raw_spin_lock_irqsave(&target->lock, flags); + curr = target->cfs.curr; + if (!curr || !entity_is_task(curr)) { + raw_spin_unlock_irqrestore(&target->lock, flags); + continue; + } + p = task_of(curr); + if (hmp_up_migration(cpu, curr)) { + if (!target->active_balance) { + target->active_balance = 1; + target->push_cpu = hmp_select_faster_cpu(p, cpu); + target->migrate_task = p; + force = 1; + } + } + raw_spin_unlock_irqrestore(&target->lock, flags); + if (force) + stop_one_cpu_nowait(cpu_of(target), + hmp_active_task_migration_cpu_stop, + target, &target->active_balance_work); + } + spin_unlock(&hmp_force_migration); +} +#else +static void hmp_force_up_migration(int this_cpu) { } #endif /* CONFIG_SCHED_HMP */
/* @@ -5529,6 +5721,8 @@ static void run_rebalance_domains(struct softirq_action *h) enum cpu_idle_type idle = this_rq->idle_balance ? CPU_IDLE : CPU_NOT_IDLE;
+ hmp_force_up_migration(this_cpu); + rebalance_domains(this_cpu, idle);
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4990d9e..92858e9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -425,6 +425,9 @@ struct rq { int active_balance; int push_cpu; struct cpu_stop_work active_balance_work; +#ifdef CONFIG_SCHED_HMP + struct task_struct *migrate_task; +#endif /* cpu of this runqueue: */ int cpu; int online;
Minor comments here :)
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d80de46..490f1f0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3744,7 +3744,6 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) * 1) task is cache cold, or * 2) too many balance attempts have failed. */
:(
tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd); if (!tsk_cache_hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
@@ -5516,6 +5515,199 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) return 0; }
+static int hmp_can_migrate_task(struct task_struct *p, struct lb_env *env) +{
<...>
+static int move_specific_task(struct lb_env *env, struct task_struct *pm) +{
struct task_struct *p, *n;
list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
if (throttled_lb_pair(task_group(p), env->src_rq->cpu,
env->dst_cpu))
continue;
Please fix indentation of above if statement.
<...>
+#else +static void hmp_force_up_migration(int this_cpu) { }
inline?
-- viresh
From: Morten Rasmussen morten.rasmussen@arm.com
Introduces a priority threshold which prevents low priority task from migrating to faster hmp_domains (cpus). This is useful for user-space software which assigns lower task priority to background task.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/Kconfig | 13 +++++++++++++ kernel/sched/fair.c | 15 +++++++++++++++ 2 files changed, 28 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 5b09684..05de193 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1571,6 +1571,19 @@ config SCHED_HMP !SCHED_AUTOGROUP. Furthermore, normal load-balancing must be disabled between cpus of different type (DISABLE_CPU_SCHED_DOMAIN_BALANCE).
+config SCHED_HMP_PRIO_FILTER + bool "(EXPERIMENTAL) Filter HMP migrations by task priority" + depends on SCHED_HMP + help + Enables task priority based HMP migration filter. Any task with + a NICE value above the threshold will always be on low-power cpus + with less compute capacity. + +config SCHED_HMP_PRIO_FILTER_VAL + int "NICE priority threshold" + default 5 + depends on SCHED_HMP_PRIO_FILTER + config HAVE_ARM_SCU bool help diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 490f1f0..8f0f3b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3129,9 +3129,12 @@ static int __init hmp_cpu_mask_setup(void) * hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu * The default values (512, 256) offer good responsiveness, but may need * tweaking suit particular needs. + * + * hmp_up_prio: Only up migrate task with high priority (<hmp_up_prio) */ unsigned int hmp_up_threshold = 512; unsigned int hmp_down_threshold = 256; +unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); static unsigned int hmp_down_migration(int cpu, struct sched_entity *se); @@ -5491,6 +5494,12 @@ static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_fastest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER + /* Filter by task priority */ + if (p->prio >= hmp_up_prio) + return 0; +#endif + if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio > hmp_up_threshold) { @@ -5507,6 +5516,12 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_slowest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER + /* Filter by task priority */ + if (p->prio >= hmp_up_prio) + return 1; +#endif + if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio < hmp_down_threshold) {
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
Hi Morten,
I would try to review your patches in coming days. For now, Just reporting a problem which i encountered during routine build.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 490f1f0..8f0f3b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3129,9 +3129,12 @@ static int __init hmp_cpu_mask_setup(void)
- hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
- The default values (512, 256) offer good responsiveness, but may need
- tweaking suit particular needs.
*/
- hmp_up_prio: Only up migrate task with high priority (<hmp_up_prio)
unsigned int hmp_up_threshold = 512; unsigned int hmp_down_threshold = 256;
#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
+unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
#endif
is required here for successful build without CONFIG_SCHED_HMP_PRIO_FILTER_VAL.
-- viresh
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
+config SCHED_HMP_PRIO_FILTER
bool "(EXPERIMENTAL) Filter HMP migrations by task priority"
depends on SCHED_HMP
Should it depend on EXPERIMENTAL?
help
Enables task priority based HMP migration filter. Any task with
a NICE value above the threshold will always be on low-power cpus
with less compute capacity.
+config SCHED_HMP_PRIO_FILTER_VAL
int "NICE priority threshold"
default 5
depends on SCHED_HMP_PRIO_FILTER
config HAVE_ARM_SCU bool help diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 490f1f0..8f0f3b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3129,9 +3129,12 @@ static int __init hmp_cpu_mask_setup(void)
- hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
- The default values (512, 256) offer good responsiveness, but may need
- tweaking suit particular needs.
*/
- hmp_up_prio: Only up migrate task with high priority (<hmp_up_prio)
unsigned int hmp_up_threshold = 512; unsigned int hmp_down_threshold = 256; +unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); static unsigned int hmp_down_migration(int cpu, struct sched_entity *se); @@ -5491,6 +5494,12 @@ static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_fastest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
/* Filter by task priority */
if (p->prio >= hmp_up_prio)
return 0;
+#endif
if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio > hmp_up_threshold) {
@@ -5507,6 +5516,12 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_slowest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
/* Filter by task priority */
if (p->prio >= hmp_up_prio)
return 1;
+#endif
Even if below cpumask_intersects() fails?
if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio < hmp_down_threshold) {
-- viresh
On Thu, Oct 04, 2012 at 07:27:00AM +0100, Viresh Kumar wrote:
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
+config SCHED_HMP_PRIO_FILTER
bool "(EXPERIMENTAL) Filter HMP migrations by task priority"
depends on SCHED_HMP
Should it depend on EXPERIMENTAL?
help
Enables task priority based HMP migration filter. Any task with
a NICE value above the threshold will always be on low-power cpus
with less compute capacity.
+config SCHED_HMP_PRIO_FILTER_VAL
int "NICE priority threshold"
default 5
depends on SCHED_HMP_PRIO_FILTER
config HAVE_ARM_SCU bool help diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 490f1f0..8f0f3b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3129,9 +3129,12 @@ static int __init hmp_cpu_mask_setup(void)
- hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
- The default values (512, 256) offer good responsiveness, but may need
- tweaking suit particular needs.
*/
- hmp_up_prio: Only up migrate task with high priority (<hmp_up_prio)
unsigned int hmp_up_threshold = 512; unsigned int hmp_down_threshold = 256; +unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); static unsigned int hmp_down_migration(int cpu, struct sched_entity *se); @@ -5491,6 +5494,12 @@ static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_fastest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
/* Filter by task priority */
if (p->prio >= hmp_up_prio)
return 0;
+#endif
if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio > hmp_up_threshold) {
@@ -5507,6 +5516,12 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_slowest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
/* Filter by task priority */
if (p->prio >= hmp_up_prio)
return 1;
+#endif
Even if below cpumask_intersects() fails?
No. Good catch :)
if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio < hmp_down_threshold) {
-- viresh
Thanks, Morten
在 2012-10-09二的 17:40 +0100,Morten Rasmussen写道:
On Thu, Oct 04, 2012 at 07:27:00AM +0100, Viresh Kumar wrote:
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
+config SCHED_HMP_PRIO_FILTER
bool "(EXPERIMENTAL) Filter HMP migrations by task priority"
depends on SCHED_HMP
Should it depend on EXPERIMENTAL?
help
Enables task priority based HMP migration filter. Any task with
a NICE value above the threshold will always be on low-power cpus
with less compute capacity.
+config SCHED_HMP_PRIO_FILTER_VAL
int "NICE priority threshold"
default 5
depends on SCHED_HMP_PRIO_FILTER
config HAVE_ARM_SCU bool help diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 490f1f0..8f0f3b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3129,9 +3129,12 @@ static int __init hmp_cpu_mask_setup(void)
- hmp_down_threshold: max. load allowed for tasks migrating to a slower cpu
- The default values (512, 256) offer good responsiveness, but may need
- tweaking suit particular needs.
*/
- hmp_up_prio: Only up migrate task with high priority (<hmp_up_prio)
unsigned int hmp_up_threshold = 512; unsigned int hmp_down_threshold = 256;
hmp_*_threshold maybe sysctl_hmp_*_threshold, and appear at /proc/sys/kernel, so, can be adjusted to be rational.
+unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL);
static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); static unsigned int hmp_down_migration(int cpu, struct sched_entity *se); @@ -5491,6 +5494,12 @@ static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_fastest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
/* Filter by task priority */
if (p->prio >= hmp_up_prio)
return 0;
+#endif
if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio > hmp_up_threshold) {
@@ -5507,6 +5516,12 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) if (hmp_cpu_is_slowest(cpu)) return 0;
+#ifdef CONFIG_SCHED_HMP_PRIO_FILTER
/* Filter by task priority */
if (p->prio >= hmp_up_prio)
return 1;
+#endif
Even if below cpumask_intersects() fails?
No. Good catch :)
if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio < hmp_down_threshold) {
-- viresh
Thanks, Morten
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Morten Rasmussen morten.rasmussen@arm.com
Adds Kconfig entries to enable HMP scheduling on ARM platforms. Currently, it disables CPU level sched_domain load-balacing in order to simplify things. This needs fixing in a later revision. HMP scheduling will do the load-balancing at this level instead.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/Kconfig | 14 ++++++++++++++ arch/arm/include/asm/topology.h | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 05de193..cb80846 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1584,6 +1584,20 @@ config SCHED_HMP_PRIO_FILTER_VAL default 5 depends on SCHED_HMP_PRIO_FILTER
+config HMP_FAST_CPU_MASK + string "HMP scheduler fast CPU mask" + depends on SCHED_HMP + help + Specify the cpuids of the fast CPUs in the system as a list string, + e.g. cpuid 0+1 should be specified as 0-1. + +config HMP_SLOW_CPU_MASK + string "HMP scheduler slow CPU mask" + depends on SCHED_HMP + help + Specify the cpuids of the slow CPUs in the system as a list string, + e.g. cpuid 0+1 should be specified as 0-1. + config HAVE_ARM_SCU bool help diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 58b8b84..13a03de 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -27,6 +27,38 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu);
+#ifdef CONFIG_DISABLE_CPU_SCHED_DOMAIN_BALANCE +/* Common values for CPUs */ +#ifndef SD_CPU_INIT +#define SD_CPU_INIT (struct sched_domain) { \ + .min_interval = 1, \ + .max_interval = 4, \ + .busy_factor = 64, \ + .imbalance_pct = 125, \ + .cache_nice_tries = 1, \ + .busy_idx = 2, \ + .idle_idx = 1, \ + .newidle_idx = 0, \ + .wake_idx = 0, \ + .forkexec_idx = 0, \ + \ + .flags = 0*SD_LOAD_BALANCE \ + | 1*SD_BALANCE_NEWIDLE \ + | 1*SD_BALANCE_EXEC \ + | 1*SD_BALANCE_FORK \ + | 0*SD_BALANCE_WAKE \ + | 1*SD_WAKE_AFFINE \ + | 0*SD_PREFER_LOCAL \ + | 0*SD_SHARE_CPUPOWER \ + | 0*SD_SHARE_PKG_RESOURCES \ + | 0*SD_SERIALIZE \ + , \ + .last_balance = jiffies, \ + .balance_interval = 1, \ +} +#endif +#endif /* CONFIG_DISABLE_CPU_SCHED_DOMAIN_BALANCE */ + #else
static inline void init_cpu_topology(void) { }
From: Morten Rasmussen morten.rasmussen@arm.com
We can't rely on Kconfig options to set the fast and slow CPU lists for HMP scheduling if we want a single kernel binary to support multiple devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2 big.LITTLE system), Fast Models, or even non big.LITTLE devices.
This patch adds the function arch_get_fast_and_slow_cpus() to generate the lists at run-time by parsing the CPU nodes in device-tree; it assumes slow cores are A7s and everything else is fast. The function still supports the old Kconfig options as this is useful for testing the HMP scheduler on devices without big.LITTLE.
This patch is reuse of a patch by Jon Medhurst tixy@linaro.org with a few bits left out.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/Kconfig | 4 ++- arch/arm/kernel/topology.c | 69 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index cb80846..f1271bc 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1588,13 +1588,15 @@ config HMP_FAST_CPU_MASK string "HMP scheduler fast CPU mask" depends on SCHED_HMP help - Specify the cpuids of the fast CPUs in the system as a list string, + Leave empty to use device tree information. + Specify the cpuids of the fast CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.
config HMP_SLOW_CPU_MASK string "HMP scheduler slow CPU mask" depends on SCHED_HMP help + Leave empty to use device tree information. Specify the cpuids of the slow CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..7682e12 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -317,6 +317,75 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }
+ +#ifdef CONFIG_SCHED_HMP + +static const char * const little_cores[] = { + "arm,cortex-a7", + NULL, +}; + +static bool is_little_cpu(struct device_node *cn) +{ + const char * const *lc; + for (lc = little_cores; *lc; lc++) + if (of_device_is_compatible(cn, *lc)) + return true; + return false; +} + +void __init arch_get_fast_and_slow_cpus(struct cpumask *fast, + struct cpumask *slow) +{ + struct device_node *cn = NULL; + int cpu = 0; + + cpumask_clear(fast); + cpumask_clear(slow); + + /* + * Use the config options if they are given. This helps testing + * HMP scheduling on systems without a big.LITTLE architecture. + */ + if (strlen(CONFIG_HMP_FAST_CPU_MASK) && strlen(CONFIG_HMP_SLOW_CPU_MASK)) { + if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast)) + WARN(1, "Failed to parse HMP fast cpu mask!\n"); + if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow)) + WARN(1, "Failed to parse HMP slow cpu mask!\n"); + return; + } + + /* + * Else, parse device tree for little cores. + */ + while ((cn = of_find_node_by_type(cn, "cpu"))) { + + if (cpu >= num_possible_cpus()) + break; + + if (is_little_cpu(cn)) + cpumask_set_cpu(cpu, slow); + else + cpumask_set_cpu(cpu, fast); + + cpu++; + } + + if (!cpumask_empty(fast) && !cpumask_empty(slow)) + return; + + /* + * We didn't find both big and little cores so let's call all cores + * fast as this will keep the system running, with all cores being + * treated equal. + */ + cpumask_setall(fast); + cpumask_clear(slow); +} + +#endif /* CONFIG_SCHED_HMP */ + + /* * init_cpu_topology is called at boot when only one cpu is running * which prevent simultaneous write access to cpu_topology array
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
From: Morten Rasmussen morten.rasmussen@arm.com
We can't rely on Kconfig options to set the fast and slow CPU lists for HMP scheduling if we want a single kernel binary to support multiple devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2 big.LITTLE system), Fast Models, or even non big.LITTLE devices.
This patch adds the function arch_get_fast_and_slow_cpus() to generate the lists at run-time by parsing the CPU nodes in device-tree; it assumes slow cores are A7s and everything else is fast. The function still supports the old Kconfig options as this is useful for testing the HMP scheduler on devices without big.LITTLE.
But this code is handling this case too at the end, with following logic:
cpumask_setall(fast);
cpumask_clear(slow);
Am i missing something?
This patch is reuse of a patch by Jon Medhurst tixy@linaro.org with a few bits left out.
Then probably he must be the author of this commit? Also a SOB is required from him here.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
arch/arm/Kconfig | 4 ++- arch/arm/kernel/topology.c | 69 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index cb80846..f1271bc 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1588,13 +1588,15 @@ config HMP_FAST_CPU_MASK string "HMP scheduler fast CPU mask" depends on SCHED_HMP help
Specify the cpuids of the fast CPUs in the system as a list string,
Leave empty to use device tree information.
Specify the cpuids of the fast CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.
config HMP_SLOW_CPU_MASK string "HMP scheduler slow CPU mask" depends on SCHED_HMP help
Leave empty to use device tree information. Specify the cpuids of the slow CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..7682e12 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -317,6 +317,75 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }
+#ifdef CONFIG_SCHED_HMP
+static const char * const little_cores[] = {
"arm,cortex-a7",
NULL,
+};
+static bool is_little_cpu(struct device_node *cn) +{
const char * const *lc;
for (lc = little_cores; *lc; lc++)
if (of_device_is_compatible(cn, *lc))
return true;
return false;
+}
+void __init arch_get_fast_and_slow_cpus(struct cpumask *fast,
struct cpumask *slow)
+{
struct device_node *cn = NULL;
int cpu = 0;
cpumask_clear(fast);
cpumask_clear(slow);
/*
* Use the config options if they are given. This helps testing
* HMP scheduling on systems without a big.LITTLE architecture.
*/
if (strlen(CONFIG_HMP_FAST_CPU_MASK) && strlen(CONFIG_HMP_SLOW_CPU_MASK)) {
if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast))
WARN(1, "Failed to parse HMP fast cpu mask!\n");
if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow))
WARN(1, "Failed to parse HMP slow cpu mask!\n");
return;
}
/*
* Else, parse device tree for little cores.
*/
while ((cn = of_find_node_by_type(cn, "cpu"))) {
if (cpu >= num_possible_cpus())
break;
if (is_little_cpu(cn))
cpumask_set_cpu(cpu, slow);
else
cpumask_set_cpu(cpu, fast);
cpu++;
}
if (!cpumask_empty(fast) && !cpumask_empty(slow))
return;
/*
* We didn't find both big and little cores so let's call all cores
* fast as this will keep the system running, with all cores being
* treated equal.
*/
cpumask_setall(fast);
cpumask_clear(slow);
+}
+#endif /* CONFIG_SCHED_HMP */
All above calls to of_*() routines have dependency on CONFIG_OF
-- viresh
On Thu, Oct 04, 2012 at 07:49:32AM +0100, Viresh Kumar wrote:
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
From: Morten Rasmussen morten.rasmussen@arm.com
We can't rely on Kconfig options to set the fast and slow CPU lists for HMP scheduling if we want a single kernel binary to support multiple devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2 big.LITTLE system), Fast Models, or even non big.LITTLE devices.
This patch adds the function arch_get_fast_and_slow_cpus() to generate the lists at run-time by parsing the CPU nodes in device-tree; it assumes slow cores are A7s and everything else is fast. The function still supports the old Kconfig options as this is useful for testing the HMP scheduler on devices without big.LITTLE.
But this code is handling this case too at the end, with following logic:
cpumask_setall(fast);
cpumask_clear(slow);
Am i missing something?
The HMP setup can be defined using Kconfig or DT. If both fails, it will set all cpus to be fast cpus and effectively disable SCHED_HMP. The Kconfig option is kept to allow testing of alternative HMP setups without having to change the DT or use DT at all which might be handy for non-ARM platforms. I hope that answers you question.
This patch is reuse of a patch by Jon Medhurst tixy@linaro.org with a few bits left out.
Then probably he must be the author of this commit? Also a SOB is required from him here.
I don't know what the correct procedure is for this sort of partial patch reuse. Since I didn't know better, I adopted Tixy's own reference style that he used in one of his patches which is an extension of a previous patch by me. I will of course fix it to follow normal procedure if there is one.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
arch/arm/Kconfig | 4 ++- arch/arm/kernel/topology.c | 69 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index cb80846..f1271bc 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1588,13 +1588,15 @@ config HMP_FAST_CPU_MASK string "HMP scheduler fast CPU mask" depends on SCHED_HMP help
Specify the cpuids of the fast CPUs in the system as a list string,
Leave empty to use device tree information.
Specify the cpuids of the fast CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.
config HMP_SLOW_CPU_MASK string "HMP scheduler slow CPU mask" depends on SCHED_HMP help
Leave empty to use device tree information. Specify the cpuids of the slow CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..7682e12 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -317,6 +317,75 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }
+#ifdef CONFIG_SCHED_HMP
+static const char * const little_cores[] = {
"arm,cortex-a7",
NULL,
+};
+static bool is_little_cpu(struct device_node *cn) +{
const char * const *lc;
for (lc = little_cores; *lc; lc++)
if (of_device_is_compatible(cn, *lc))
return true;
return false;
+}
+void __init arch_get_fast_and_slow_cpus(struct cpumask *fast,
struct cpumask *slow)
+{
struct device_node *cn = NULL;
int cpu = 0;
cpumask_clear(fast);
cpumask_clear(slow);
/*
* Use the config options if they are given. This helps testing
* HMP scheduling on systems without a big.LITTLE architecture.
*/
if (strlen(CONFIG_HMP_FAST_CPU_MASK) && strlen(CONFIG_HMP_SLOW_CPU_MASK)) {
if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast))
WARN(1, "Failed to parse HMP fast cpu mask!\n");
if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow))
WARN(1, "Failed to parse HMP slow cpu mask!\n");
return;
}
/*
* Else, parse device tree for little cores.
*/
while ((cn = of_find_node_by_type(cn, "cpu"))) {
if (cpu >= num_possible_cpus())
break;
if (is_little_cpu(cn))
cpumask_set_cpu(cpu, slow);
else
cpumask_set_cpu(cpu, fast);
cpu++;
}
if (!cpumask_empty(fast) && !cpumask_empty(slow))
return;
/*
* We didn't find both big and little cores so let's call all cores
* fast as this will keep the system running, with all cores being
* treated equal.
*/
cpumask_setall(fast);
cpumask_clear(slow);
+}
+#endif /* CONFIG_SCHED_HMP */
All above calls to of_*() routines have dependency on CONFIG_OF
It would be very easy to blame someone else here... :) I will fix it.
Thanks, Morten
-- viresh
On 10 October 2012 15:47, Morten Rasmussen Morten.Rasmussen@arm.com wrote:
On Thu, Oct 04, 2012 at 07:49:32AM +0100, Viresh Kumar wrote:
This patch is reuse of a patch by Jon Medhurst tixy@linaro.org with a few bits left out.
Then probably he must be the author of this commit? Also a SOB is required from him here.
I don't know what the correct procedure is for this sort of partial patch reuse. Since I didn't know better, I adopted Tixy's own reference style that he used in one of his patches which is an extension of a previous patch by me. I will of course fix it to follow normal procedure if there is one.
AFAIK, if you have used only some part of the earlier patch, then you just need to add an SOB of original author. But if you have picked most of the stuff from original patch, which i feel is the case here, you must have original author in author & SOB + your SOB.
It would be very easy to blame someone else here... :) I will fix it.
:)
Hi Tixy,
Could you have a look at my code stealing patch below? Since it is basically a trimmed version of one of your patches I would prefer to put you as author and have your SOB on it. What is your opinion?
Thanks, Morten
On Fri, Sep 21, 2012 at 07:32:21PM +0100, Morten Rasmussen wrote:
From: Morten Rasmussen morten.rasmussen@arm.com
We can't rely on Kconfig options to set the fast and slow CPU lists for HMP scheduling if we want a single kernel binary to support multiple devices with different CPU topology. E.g. TC2 (ARM's Test-Chip-2 big.LITTLE system), Fast Models, or even non big.LITTLE devices.
This patch adds the function arch_get_fast_and_slow_cpus() to generate the lists at run-time by parsing the CPU nodes in device-tree; it assumes slow cores are A7s and everything else is fast. The function still supports the old Kconfig options as this is useful for testing the HMP scheduler on devices without big.LITTLE.
This patch is reuse of a patch by Jon Medhurst tixy@linaro.org with a few bits left out.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com
arch/arm/Kconfig | 4 ++- arch/arm/kernel/topology.c | 69 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index cb80846..f1271bc 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1588,13 +1588,15 @@ config HMP_FAST_CPU_MASK string "HMP scheduler fast CPU mask" depends on SCHED_HMP help
Specify the cpuids of the fast CPUs in the system as a list string,
Leave empty to use device tree information.
e.g. cpuid 0+1 should be specified as 0-1.Specify the cpuids of the fast CPUs in the system as a list string,
config HMP_SLOW_CPU_MASK string "HMP scheduler slow CPU mask" depends on SCHED_HMP help
Specify the cpuids of the slow CPUs in the system as a list string, e.g. cpuid 0+1 should be specified as 0-1.Leave empty to use device tree information.
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 26c12c6..7682e12 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -317,6 +317,75 @@ void store_cpu_topology(unsigned int cpuid) cpu_topology[cpuid].socket_id, mpidr); }
+#ifdef CONFIG_SCHED_HMP
+static const char * const little_cores[] = {
- "arm,cortex-a7",
- NULL,
+};
+static bool is_little_cpu(struct device_node *cn) +{
- const char * const *lc;
- for (lc = little_cores; *lc; lc++)
if (of_device_is_compatible(cn, *lc))
return true;
- return false;
+}
+void __init arch_get_fast_and_slow_cpus(struct cpumask *fast,
struct cpumask *slow)
+{
- struct device_node *cn = NULL;
- int cpu = 0;
- cpumask_clear(fast);
- cpumask_clear(slow);
- /*
* Use the config options if they are given. This helps testing
* HMP scheduling on systems without a big.LITTLE architecture.
*/
- if (strlen(CONFIG_HMP_FAST_CPU_MASK) && strlen(CONFIG_HMP_SLOW_CPU_MASK)) {
if (cpulist_parse(CONFIG_HMP_FAST_CPU_MASK, fast))
WARN(1, "Failed to parse HMP fast cpu mask!\n");
if (cpulist_parse(CONFIG_HMP_SLOW_CPU_MASK, slow))
WARN(1, "Failed to parse HMP slow cpu mask!\n");
return;
- }
- /*
* Else, parse device tree for little cores.
*/
- while ((cn = of_find_node_by_type(cn, "cpu"))) {
if (cpu >= num_possible_cpus())
break;
if (is_little_cpu(cn))
cpumask_set_cpu(cpu, slow);
else
cpumask_set_cpu(cpu, fast);
cpu++;
- }
- if (!cpumask_empty(fast) && !cpumask_empty(slow))
return;
- /*
* We didn't find both big and little cores so let's call all cores
* fast as this will keep the system running, with all cores being
* treated equal.
*/
- cpumask_setall(fast);
- cpumask_clear(slow);
+}
+#endif /* CONFIG_SCHED_HMP */
/*
- init_cpu_topology is called at boot when only one cpu is running
- which prevent simultaneous write access to cpu_topology array
-- 1.7.9.5
On Wed, 2012-10-10 at 12:04 +0100, Morten Rasmussen wrote:
Hi Tixy,
Could you have a look at my code stealing patch below? Since it is basically a trimmed version of one of your patches I would prefer to put you as author and have your SOB on it. What is your opinion?
Yes, I can agree with that opinion, (and my employer likes to count their patch totals ;-) so please feel free to add:
From: Jon Medhurst tixy@linaro.org Signed-off-by: Jon Medhurst tixy@linaro.org
Thanks
From: Morten Rasmussen morten.rasmussen@arm.com
SCHED_HMP requires the different cpu types to be represented by an ordered list of hmp_domains. Each hmp_domain represents all cpus of a particular type using a cpumask.
The list is platform specific and therefore must be generated by platform code by implementing arch_get_hmp_domains().
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/kernel/topology.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 7682e12..ec8ad5c 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -383,6 +383,28 @@ void __init arch_get_fast_and_slow_cpus(struct cpumask *fast, cpumask_clear(slow); }
+void __init arch_get_hmp_domains(struct list_head *hmp_domains_list) +{ + struct cpumask hmp_fast_cpu_mask; + struct cpumask hmp_slow_cpu_mask; + struct hmp_domain *domain; + + arch_get_fast_and_slow_cpus(&hmp_fast_cpu_mask, &hmp_slow_cpu_mask); + + /* + * Initialize hmp_domains + * Must be ordered with respect to compute capacity. + * Fastest domain at head of list. + */ + domain = (struct hmp_domain *) + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL); + cpumask_copy(&domain->cpus, &hmp_slow_cpu_mask); + list_add(&domain->hmp_domains, hmp_domains_list); + domain = (struct hmp_domain *) + kmalloc(sizeof(struct hmp_domain), GFP_KERNEL); + cpumask_copy(&domain->cpus, &hmp_fast_cpu_mask); + list_add(&domain->hmp_domains, hmp_domains_list); +} #endif /* CONFIG_SCHED_HMP */
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
+void __init arch_get_hmp_domains(struct list_head *hmp_domains_list) +{
struct cpumask hmp_fast_cpu_mask;
struct cpumask hmp_slow_cpu_mask;
can be merged to single line.
struct hmp_domain *domain;
arch_get_fast_and_slow_cpus(&hmp_fast_cpu_mask, &hmp_slow_cpu_mask);
/*
* Initialize hmp_domains
* Must be ordered with respect to compute capacity.
* Fastest domain at head of list.
*/
domain = (struct hmp_domain *)
kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
should be:
domain = kmalloc(sizeof(*domain), GFP_KERNEL);
cpumask_copy(&domain->cpus, &hmp_slow_cpu_mask);
what if kmalloc failed?
list_add(&domain->hmp_domains, hmp_domains_list);
domain = (struct hmp_domain *)
kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
would be better to kmalloc only once with size 2* sizeof(*domain)
cpumask_copy(&domain->cpus, &hmp_fast_cpu_mask);
list_add(&domain->hmp_domains, hmp_domains_list);
Also would be better to create a macro for above two lines to remove code redundancy.
On Thu, Oct 04, 2012 at 07:58:45AM +0100, Viresh Kumar wrote:
On 22 September 2012 00:02, morten.rasmussen@arm.com wrote:
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
+void __init arch_get_hmp_domains(struct list_head *hmp_domains_list) +{
struct cpumask hmp_fast_cpu_mask;
struct cpumask hmp_slow_cpu_mask;
can be merged to single line.
struct hmp_domain *domain;
arch_get_fast_and_slow_cpus(&hmp_fast_cpu_mask, &hmp_slow_cpu_mask);
/*
* Initialize hmp_domains
* Must be ordered with respect to compute capacity.
* Fastest domain at head of list.
*/
domain = (struct hmp_domain *)
kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
should be:
domain = kmalloc(sizeof(*domain), GFP_KERNEL);
cpumask_copy(&domain->cpus, &hmp_slow_cpu_mask);
what if kmalloc failed?
list_add(&domain->hmp_domains, hmp_domains_list);
domain = (struct hmp_domain *)
kmalloc(sizeof(struct hmp_domain), GFP_KERNEL);
would be better to kmalloc only once with size 2* sizeof(*domain)
cpumask_copy(&domain->cpus, &hmp_fast_cpu_mask);
list_add(&domain->hmp_domains, hmp_domains_list);
Also would be better to create a macro for above two lines to remove code redundancy.
Agree on all of the above.
Thanks, Morten
From: Morten Rasmussen morten.rasmussen@arm.com
Adds ftrace events for key variables related to the entity load-tracking to help debugging scheduler behaviour. Allows tracing of load contribution and runqueue residency ratio for both entities and runqueues as well as entity CPU usage ratio.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- include/trace/events/sched.h | 125 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 7 +++ 2 files changed, 132 insertions(+)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 5a8671e..847eb76 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -430,6 +430,131 @@ TRACE_EVENT(sched_pi_setprio, __entry->oldprio, __entry->newprio) );
+/* + * Tracepoint for showing tracked load contribution. + */ +TRACE_EVENT(sched_task_load_contrib, + + TP_PROTO(struct task_struct *tsk, unsigned long load_contrib), + + TP_ARGS(tsk, load_contrib), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(unsigned long, load_contrib) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->load_contrib = load_contrib; + ), + + TP_printk("comm=%s pid=%d load_contrib=%lu", + __entry->comm, __entry->pid, + __entry->load_contrib) +); + +/* + * Tracepoint for showing tracked task runnable ratio [0..1023]. + */ +TRACE_EVENT(sched_task_runnable_ratio, + + TP_PROTO(struct task_struct *tsk, unsigned long ratio), + + TP_ARGS(tsk, ratio), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(unsigned long, ratio) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->ratio = ratio; + ), + + TP_printk("comm=%s pid=%d ratio=%lu", + __entry->comm, __entry->pid, + __entry->ratio) +); + +/* + * Tracepoint for showing tracked rq runnable ratio [0..1023]. + */ +TRACE_EVENT(sched_rq_runnable_ratio, + + TP_PROTO(int cpu, unsigned long ratio), + + TP_ARGS(cpu, ratio), + + TP_STRUCT__entry( + __field(int, cpu) + __field(unsigned long, ratio) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->ratio = ratio; + ), + + TP_printk("cpu=%d ratio=%lu", + __entry->cpu, + __entry->ratio) +); + +/* + * Tracepoint for showing tracked rq runnable load. + */ +TRACE_EVENT(sched_rq_runnable_load, + + TP_PROTO(int cpu, u64 load), + + TP_ARGS(cpu, load), + + TP_STRUCT__entry( + __field(int, cpu) + __field(u64, load) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->load = load; + ), + + TP_printk("cpu=%d load=%llu", + __entry->cpu, + __entry->load) +); + +/* + * Tracepoint for showing tracked task cpu usage ratio [0..1023]. + */ +TRACE_EVENT(sched_task_usage_ratio, + + TP_PROTO(struct task_struct *tsk, unsigned long ratio), + + TP_ARGS(tsk, ratio), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(unsigned long, ratio) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->ratio = ratio; + ), + + TP_printk("comm=%s pid=%d ratio=%lu", + __entry->comm, __entry->pid, + __entry->ratio) +); #endif /* _TRACE_SCHED_H */
/* This part must be outside protection */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8f0f3b9..0be53be 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1192,9 +1192,11 @@ static inline void __update_task_entity_contrib(struct sched_entity *se) contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight); contrib /= (se->avg.runnable_avg_period + 1); se->avg.load_avg_contrib = scale_load(contrib); + trace_sched_task_load_contrib(task_of(se), se->avg.load_avg_contrib); contrib = se->avg.runnable_avg_sum * scale_load_down(NICE_0_LOAD); contrib /= (se->avg.runnable_avg_period + 1); se->avg.load_avg_ratio = scale_load(contrib); + trace_sched_task_runnable_ratio(task_of(se), se->avg.load_avg_ratio); }
/* Compute the current contribution to load_avg by se, return any delta */ @@ -1286,9 +1288,14 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
static inline void update_rq_runnable_avg(struct rq *rq, int runnable) { + u32 contrib; __update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable, runnable); __update_tg_runnable_avg(&rq->avg, &rq->cfs); + contrib = rq->avg.runnable_avg_sum * scale_load_down(1024); + contrib /= (rq->avg.runnable_avg_period + 1); + trace_sched_rq_runnable_ratio(cpu_of(rq), scale_load(contrib)); + trace_sched_rq_runnable_load(cpu_of(rq), rq->cfs.runnable_load_avg); }
/* Add the load generated by se into cfs_rq's child load-average */
From: Morten Rasmussen morten.rasmussen@arm.com
Adds ftrace event for tracing task migrations using HMP optimized scheduling.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- include/trace/events/sched.h | 28 ++++++++++++++++++++++++++++ kernel/sched/fair.c | 15 +++++++++++---- 2 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 847eb76..501aa32 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -555,6 +555,34 @@ TRACE_EVENT(sched_task_usage_ratio, __entry->comm, __entry->pid, __entry->ratio) ); + +/* + * Tracepoint for HMP (CONFIG_SCHED_HMP) task migrations. + */ +TRACE_EVENT(sched_hmp_migrate, + + TP_PROTO(struct task_struct *tsk, int dest, int force), + + TP_ARGS(tsk, dest, force), + + TP_STRUCT__entry( + __array(char, comm, TASK_COMM_LEN) + __field(pid_t, pid) + __field(int, dest) + __field(int, force) + ), + + TP_fast_assign( + memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); + __entry->pid = tsk->pid; + __entry->dest = dest; + __entry->force = force; + ), + + TP_printk("comm=%s pid=%d dest=%d force=%d", + __entry->comm, __entry->pid, + __entry->dest, __entry->force) +); #endif /* _TRACE_SCHED_H */
/* This part must be outside protection */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0be53be..811b2b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3333,10 +3333,16 @@ unlock: rcu_read_unlock();
#ifdef CONFIG_SCHED_HMP - if (hmp_up_migration(prev_cpu, &p->se)) - return hmp_select_faster_cpu(p, prev_cpu); - if (hmp_down_migration(prev_cpu, &p->se)) - return hmp_select_slower_cpu(p, prev_cpu); + if (hmp_up_migration(prev_cpu, &p->se)) { + new_cpu = hmp_select_faster_cpu(p, prev_cpu); + trace_sched_hmp_migrate(p, new_cpu, 0); + return new_cpu; + } + if (hmp_down_migration(prev_cpu, &p->se)) { + new_cpu = hmp_select_slower_cpu(p, prev_cpu); + trace_sched_hmp_migrate(p, new_cpu, 0); + return new_cpu; + } /* Make sure that the task stays in its previous hmp domain */ if (!cpumask_test_cpu(new_cpu, &hmp_cpu_domain(prev_cpu)->cpus)) return prev_cpu; @@ -5718,6 +5724,7 @@ static void hmp_force_up_migration(int this_cpu) target->push_cpu = hmp_select_faster_cpu(p, cpu); target->migrate_task = p; force = 1; + trace_sched_hmp_migrate(p, target->push_cpu, 1); } } raw_spin_unlock_irqrestore(&target->lock, flags);
From: Morten Rasmussen morten.rasmussen@arm.com
We need a way to prevent tasks that are migrating up and down the hmp_domains from migrating straight on through before the load has adapted to the new compute capacity of the CPU on the new hmp_domain. This patch adds a next up/down migration delay that prevents the task from doing another migration in the same direction until the delay has expired.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- include/linux/sched.h | 4 ++++ kernel/sched/core.c | 4 ++++ kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 46 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index df971a3..ca3890a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1158,6 +1158,10 @@ struct sched_avg { s64 decay_count; unsigned long load_avg_contrib; unsigned long load_avg_ratio; +#ifdef CONFIG_SCHED_HMP + u64 hmp_last_up_migration; + u64 hmp_last_down_migration; +#endif u32 usage_avg_sum; };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 652b86b..a3b1ff6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1723,6 +1723,10 @@ static void __sched_fork(struct task_struct *p) #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED) p->se.avg.runnable_avg_period = 0; p->se.avg.runnable_avg_sum = 0; +#ifdef CONFIG_SCHED_HMP + p->se.avg.hmp_last_up_migration = 0; + p->se.avg.hmp_last_down_migration = 0; +#endif #endif #ifdef CONFIG_SCHEDSTATS memset(&p->se.statistics, 0, sizeof(p->se.statistics)); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 811b2b9..56cbda1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3138,10 +3138,14 @@ static int __init hmp_cpu_mask_setup(void) * tweaking suit particular needs. * * hmp_up_prio: Only up migrate task with high priority (<hmp_up_prio) + * hmp_next_up_threshold: Delay before next up migration (1024 ~= 1 ms) + * hmp_next_down_threshold: Delay before next down migration (1024 ~= 1 ms) */ unsigned int hmp_up_threshold = 512; unsigned int hmp_down_threshold = 256; unsigned int hmp_up_prio = NICE_TO_PRIO(CONFIG_SCHED_HMP_PRIO_FILTER_VAL); +unsigned int hmp_next_up_threshold = 4096; +unsigned int hmp_next_down_threshold = 4096;
static unsigned int hmp_up_migration(int cpu, struct sched_entity *se); static unsigned int hmp_down_migration(int cpu, struct sched_entity *se); @@ -3204,6 +3208,21 @@ static inline unsigned int hmp_select_slower_cpu(struct task_struct *tsk, tsk_cpus_allowed(tsk)); }
+static inline void hmp_next_up_delay(struct sched_entity *se, int cpu) +{ + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + + se->avg.hmp_last_up_migration = cfs_rq_clock_task(cfs_rq); + se->avg.hmp_last_down_migration = 0; +} + +static inline void hmp_next_down_delay(struct sched_entity *se, int cpu) +{ + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + + se->avg.hmp_last_down_migration = cfs_rq_clock_task(cfs_rq); + se->avg.hmp_last_up_migration = 0; +} #endif /* CONFIG_SCHED_HMP */
/* @@ -3335,11 +3354,13 @@ unlock: #ifdef CONFIG_SCHED_HMP if (hmp_up_migration(prev_cpu, &p->se)) { new_cpu = hmp_select_faster_cpu(p, prev_cpu); + hmp_next_up_delay(&p->se, new_cpu); trace_sched_hmp_migrate(p, new_cpu, 0); return new_cpu; } if (hmp_down_migration(prev_cpu, &p->se)) { new_cpu = hmp_select_slower_cpu(p, prev_cpu); + hmp_next_down_delay(&p->se, new_cpu); trace_sched_hmp_migrate(p, new_cpu, 0); return new_cpu; } @@ -5503,6 +5524,8 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle) { } static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) { struct task_struct *p = task_of(se); + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + u64 now;
if (hmp_cpu_is_fastest(cpu)) return 0; @@ -5513,6 +5536,12 @@ static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) return 0; #endif
+ /* Let the task load settle before doing another up migration */ + now = cfs_rq_clock_task(cfs_rq); + if (((now - se->avg.hmp_last_up_migration) >> 10) + < hmp_next_up_threshold) + return 0; + if (cpumask_intersects(&hmp_faster_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio > hmp_up_threshold) { @@ -5525,6 +5554,8 @@ static unsigned int hmp_up_migration(int cpu, struct sched_entity *se) static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) { struct task_struct *p = task_of(se); + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; + u64 now;
if (hmp_cpu_is_slowest(cpu)) return 0; @@ -5535,6 +5566,12 @@ static unsigned int hmp_down_migration(int cpu, struct sched_entity *se) return 1; #endif
+ /* Let the task load settle before doing another down migration */ + now = cfs_rq_clock_task(cfs_rq); + if (((now - se->avg.hmp_last_down_migration) >> 10) + < hmp_next_down_threshold) + return 0; + if (cpumask_intersects(&hmp_slower_domain(cpu)->cpus, tsk_cpus_allowed(p)) && se->avg.load_avg_ratio < hmp_down_threshold) { @@ -5725,6 +5762,7 @@ static void hmp_force_up_migration(int this_cpu) target->migrate_task = p; force = 1; trace_sched_hmp_migrate(p, target->push_cpu, 1); + hmp_next_up_delay(&p->se, target->push_cpu); } } raw_spin_unlock_irqrestore(&target->lock, flags);