This series demonstrates cpu frequency scaling via a simple policy driven by the scheduler. Specifically the policy evaluates cpu frequency when cpu utilization is updated from enqueue_task_fair and dequeue_task_fair. The policy itself uses a simple up/down threshold scheme using the same 80%/20% cpu utilization boundaries that are used by default in the ondemand cpufreq governor.
This series is not intended for merging, but instead to ignite some discussion around scheduler-driven cpu frequency selection. Of particular interest to me is the policy itself and how it might integrate with task placement in CFS's load_balance. Additionally I'd like to ask the scheduler experts about which call sites in CFS are right for evaluating cpu frequency selection; maybe {en,de}queue_task_fair are not such a good idea?
The messiest part of this series is the cpumask stuff, where I tried to track which cpus have updated statistics in the case of a sched_entity which contains several other sched_entities that are spread across cpus. As discussed at Linux Plumbers Conference 2014, I will replace this complexity with simpler logic that ignores scheduler cgroups in the next version. In any case I am posting the code I have now.
This code is experiemental and bugs are Guaranteed.
These patches are based on the scale invariance series from Morten[0]. The variables names in this RFC will doubtless change once that work is rebased onto Vincent's series[1].
[0] http://lkml.kernel.org/r/1411403047-32010-1-git-send-email-morten.rasmussen@arm.com [1] http://lkml.kernel.org/r/1412684017-16595-1-git-send-email-vincent.guittot@linaro.org
Mike Turquette (6): sched: cfs: declare capacity_of in sched.h sched: fair: add usage_util_of helper cpufreq: add per-governor private data sched: cfs: cpu frequency scaling arch functions sched: cfs: cpu frequency scaling based on task placement sched: energy_model: simple cpu frequency scaling policy
Morten Rasmussen (1): sched: Make energy awareness a sched feature
drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 69 ++++++++- kernel/sched/features.h | 6 + kernel/sched/sched.h | 3 + 7 files changed, 445 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/energy_model.c
From: Morten Rasmussen morten.rasmussen@arm.com
This patch introduces the ENERGY_AWARE sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set false when SCHED_DEBUG is not defined. Hence this doesn't allow energy awareness to be enabled without SCHED_DEBUG. This sched_feature knob will be replaced later with a more appropriate control knob when things have matured a bit.
Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com Signed-off-by: Mike Turquette mturquette@linaro.org [mturquette@linaro.org: moved energy_aware above enqueue_task_fair] --- kernel/sched/fair.c | 5 +++++ kernel/sched/features.h | 6 ++++++ 2 files changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6738160..90b36cc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3978,6 +3978,11 @@ static inline void hrtick_update(struct rq *rq) } #endif
+static inline bool energy_aware(void) +{ + return sched_feat(ENERGY_AWARE); +} + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 90284d1..199ee3a 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -83,3 +83,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif + +/* + * Energy aware scheduling. Use platform energy model to guide scheduling + * decisions optimizing for energy efficiency. + */ +SCHED_FEAT(ENERGY_AWARE, false)
capacity_of is useful for cpu frequency scaling policies. Share it via sched.h so that selectable cpu frequency scaling policies can make use of it.
Signed-off-by: Mike Turquette mturquette@linaro.org --- kernel/sched/fair.c | 7 +++++-- kernel/sched/sched.h | 2 ++ 2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 90b36cc..15f5638 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1018,7 +1018,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(const int cpu); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); -static unsigned long capacity_of(int cpu); static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
/* Cached statistics for all CPUs within a node */ @@ -2056,6 +2055,10 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) } #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_SMP +unsigned long capacity_of(int cpu); +#endif /* CONFIG_SMP */ + static void account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) { @@ -4132,7 +4135,7 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); }
-static unsigned long capacity_of(int cpu) +unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 04940f8..9a28d38 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -309,6 +309,8 @@ struct cfs_bandwidth { };
#endif /* CONFIG_CGROUP_SCHED */
+extern unsigned long capacity_of(int cpu); + /* CFS-related fields in a runqueue */ struct cfs_rq { struct load_weight load;
Signed-off-by: Mike Turquette mturquette@linaro.org --- kernel/sched/fair.c | 6 ++++++ kernel/sched/sched.h | 1 + 2 files changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 15f5638..0930ad8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2057,6 +2057,7 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
#ifdef CONFIG_SMP unsigned long capacity_of(int cpu); +unsigned long usage_util_of(int cpu); #endif /* CONFIG_SMP */
static void @@ -4140,6 +4141,11 @@ unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; }
+unsigned long usage_util_of(int cpu) +{ + return cpu_rq(cpu)->cfs.usage_util_avg; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9a28d38..c34cbfc 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -310,6 +310,7 @@ struct cfs_bandwidth { }; #endif /* CONFIG_CGROUP_SCHED */
extern unsigned long capacity_of(int cpu); +extern unsigned long usage_util_of(int cpu);
/* CFS-related fields in a runqueue */ struct cfs_rq {
Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Signed-off-by: Mike Turquette mturquette@linaro.org --- include/linux/cpufreq.h | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 138336b..91d173c 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -115,6 +115,9 @@ struct cpufreq_policy {
/* For cpufreq driver's internal use */ void *driver_data; + + /* For cpufreq governor's internal use */ + void *gov_data; };
/* Only for ACPI */
On 22 October 2014 11:37, Mike Turquette mturquette@linaro.org wrote:
Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Signed-off-by: Mike Turquette mturquette@linaro.org
include/linux/cpufreq.h | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 138336b..91d173c 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -115,6 +115,9 @@ struct cpufreq_policy {
/* For cpufreq driver's internal use */ void *driver_data;
/* For cpufreq governor's internal use */
void *gov_data;
Its already there: governor_data ..
Am I missing something ?
On Tue, Oct 21, 2014 at 11:26 PM, Viresh Kumar viresh.kumar@linaro.org wrote:
On 22 October 2014 11:37, Mike Turquette mturquette@linaro.org wrote:
Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Signed-off-by: Mike Turquette mturquette@linaro.org
include/linux/cpufreq.h | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 138336b..91d173c 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -115,6 +115,9 @@ struct cpufreq_policy {
/* For cpufreq driver's internal use */ void *driver_data;
/* For cpufreq governor's internal use */
void *gov_data;
Its already there: governor_data ..
Am I missing something ?
Oops. Thats what I get for hacking while jetlagged. Please disregard the noise.
Regards, Mike
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the scheduler to evaluate if cpu frequency should change and to invoke that change from a safe context.
They are weakly defined arch functions that do nothing by default. A CPUfreq governor could use these functions to implement a frequency scaling policy based on updates to per-task statistics or updates to per-cpu utilization.
As discussed at Linux Plumbers Conference 2014, the goal will be to focus on a single cpu frequency scaling policy that works for everyone. That may mean that the weak arch functions definitions can be removed entirely and a single policy implements that logic for all architectures.
Not-signed-off-by: Mike Turquette mturquette@linaro.org --- kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0930ad8..1af6f6d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2265,6 +2265,8 @@ static u32 __compute_runnable_contrib(u64 n) }
unsigned long arch_scale_load_capacity(int cpu); +void arch_eval_cpu_freq(struct cpumask *cpus); +void arch_scale_cpu_freq(void);
/* * We can represent the historical contribution to runnable average as the @@ -5805,6 +5807,16 @@ unsigned long __weak arch_scale_load_capacity(int cpu) return default_scale_load_capacity(cpu); }
+void __weak arch_eval_cpu_freq(struct cpumask *cpus) +{ + return; +} + +void __weak arch_scale_cpu_freq(void) +{ + return; +} + static unsigned long scale_rt_capacity(int cpu) { struct rq *rq = cpu_rq(cpu);
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 10/22/2014 02:07 AM, Mike Turquette wrote:
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the scheduler to evaluate if cpu frequency should change and to invoke that change from a safe context.
They are weakly defined arch functions that do nothing by default. A CPUfreq governor could use these functions to implement a frequency scaling policy based on updates to per-task statistics or updates to per-cpu utilization.
As discussed at Linux Plumbers Conference 2014, the goal will be to focus on a single cpu frequency scaling policy that works for everyone. That may mean that the weak arch functions definitions can be removed entirely and a single policy implements that logic for all architectures.
On virtual machines, we probably want to use both frequency and steal time to calculate the factor.
- -- All rights reversed
On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel riel@redhat.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 10/22/2014 02:07 AM, Mike Turquette wrote:
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the scheduler to evaluate if cpu frequency should change and to invoke that change from a safe context.
They are weakly defined arch functions that do nothing by default. A CPUfreq governor could use these functions to implement a frequency scaling policy based on updates to per-task statistics or updates to per-cpu utilization.
As discussed at Linux Plumbers Conference 2014, the goal will be to focus on a single cpu frequency scaling policy that works for everyone. That may mean that the weak arch functions definitions can be removed entirely and a single policy implements that logic for all architectures.
On virtual machines, we probably want to use both frequency and steal time to calculate the factor.
You mean for calculating desired cpu frequency on a virtual guest? Is that something we want to do?
Thanks, Mike
All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iQEcBAEBAgAGBQJUSA5XAAoJEM553pKExN6DeRYH/jeXImjO2/WZFp82Yv6ukMxI r8/kzrLMA+NS1XXCWYIcOiBqReEabkZZmypt21Tdnpkvi4GbZPpG0PEApSvOfqWE w71J87cpMGV/e4uLcBDcvgHJX8RBQLO/ZqDcMm+zcSoeJ3G3NMK2YlZp3Uf8xqcB tE2VGW7o2yEqNJL1fqYb++3upQmc10vIFqxVIJfP+TqZRyaVP+5kBqOMDTWb5qCV qZjBKe1jDX5sLLGfY0ddAeuUH1iEJBIUMCcr027ezcqRp4YoqIrHRInHmNxEs5Az 9PN8N0yGgqhvkcCfXG7He+tQBHECOnjyQlrM/2K8Cw11RziwDkC/yYIp3DPgjxc= =f/8V -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 10/22/2014 07:20 PM, Mike Turquette wrote:
On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel riel@redhat.com wrote: On 10/22/2014 02:07 AM, Mike Turquette wrote:
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the scheduler to evaluate if cpu frequency should change and to invoke that change from a safe context.
They are weakly defined arch functions that do nothing by default. A CPUfreq governor could use these functions to implement a frequency scaling policy based on updates to per-task statistics or updates to per-cpu utilization.
As discussed at Linux Plumbers Conference 2014, the goal will be to focus on a single cpu frequency scaling policy that works for everyone. That may mean that the weak arch functions definitions can be removed entirely and a single policy implements that logic for all architectures.
On virtual machines, we probably want to use both frequency and steal time to calculate the factor.
You mean for calculating desired cpu frequency on a virtual guest? Is that something we want to do?
A guest will be unable to set the cpu frequency, but it should know what the frequency is, so it can take the capacity of each CPU into account when doing things like load balancing.
This has little impact on this patch series, the impact is more in the load balancer, which can see how much compute capacity is available on each CPU, and adjust the load accordingly.
I have seen some code come by that adjusts each cpu's compute_capacity, but do not remember whether it looks at cpu frequency, and am pretty sure it does not look at steal time currently :)
- -- All rights reversed
On Wed, 2014-10-22 at 21:42 -0400, Rik van Riel wrote:
On 10/22/2014 07:20 PM, Mike Turquette wrote:
On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel riel@redhat.com wrote: On 10/22/2014 02:07 AM, Mike Turquette wrote:
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the scheduler to evaluate if cpu frequency should change and to invoke that change from a safe context.
They are weakly defined arch functions that do nothing by default. A CPUfreq governor could use these functions to implement a frequency scaling policy based on updates to per-task statistics or updates to per-cpu utilization.
As discussed at Linux Plumbers Conference 2014, the goal will be to focus on a single cpu frequency scaling policy that works for everyone. That may mean that the weak arch functions definitions can be removed entirely and a single policy implements that logic for all architectures.
On virtual machines, we probably want to use both frequency and steal time to calculate the factor.
You mean for calculating desired cpu frequency on a virtual guest? Is that something we want to do?
A guest will be unable to set the cpu frequency, but it should know what the frequency is, so it can take the capacity of each CPU into account when doing things like load balancing.
Hm. Why does using vaporite freq/capacity/whatever make any sense, the silicon under the V(aporite)PU can/does change at the drop of a hat, no?
-Mike
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 10/22/2014 10:12 PM, Mike Galbraith wrote:
On Wed, 2014-10-22 at 21:42 -0400, Rik van Riel wrote:
On 10/22/2014 07:20 PM, Mike Turquette wrote:
On Wed, Oct 22, 2014 at 1:06 PM, Rik van Riel riel@redhat.com wrote: On 10/22/2014 02:07 AM, Mike Turquette wrote:
arch_eval_cpu_freq and arch_scale_cpu_freq are added to allow the scheduler to evaluate if cpu frequency should change and to invoke that change from a safe context.
They are weakly defined arch functions that do nothing by default. A CPUfreq governor could use these functions to implement a frequency scaling policy based on updates to per-task statistics or updates to per-cpu utilization.
As discussed at Linux Plumbers Conference 2014, the goal will be to focus on a single cpu frequency scaling policy that works for everyone. That may mean that the weak arch functions definitions can be removed entirely and a single policy implements that logic for all architectures.
On virtual machines, we probably want to use both frequency and steal time to calculate the factor.
You mean for calculating desired cpu frequency on a virtual guest? Is that something we want to do?
A guest will be unable to set the cpu frequency, but it should know what the frequency is, so it can take the capacity of each CPU into account when doing things like load balancing.
Hm. Why does using vaporite freq/capacity/whatever make any sense, the silicon under the V(aporite)PU can/does change at the drop of a hat, no?
It can, but IIRC that should cause the kvmclock data for that VCPU to be regenerated, and the VCPU should be able to use that to figure out that the frequency changed the next time it runs the scheduler code on that VCPU.
- -- All rights reversed
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?
Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette mturquette@linaro.org --- kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1af6f6d..3619f63 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + struct cpumask update_cpus; + + cpumask_clear(&update_cpus);
for_each_sched_entity(se) { if (se->on_rq) @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1); + /* track cpus that need to be re-evaluated */ + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); }
+ /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); + + /* + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to + * typedef update_cpus into an int and skip all of the cpumask + * stuff + */ + cpumask_set_cpu(cpu_of(rq), &update_cpus); } + + if (energy_aware()) + if (!cpumask_empty(&update_cpus)) + arch_eval_cpu_freq(&update_cpus); + hrtick_update(rq); }
@@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; int task_sleep = flags & DEQUEUE_SLEEP; + struct cpumask update_cpus; + + cpumask_clear(&update_cpus);
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1); + /* track runqueues/cpus that need to be re-evaluated */ + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); }
+ /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); + + /* + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to + * typedef update_cpus into an int and skip all of the cpumask + * stuff + */ + cpumask_set_cpu(cpu_of(rq), &update_cpus); } + + if (energy_aware()) + if (!cpumask_empty(&update_cpus)) + arch_eval_cpu_freq(&update_cpus); + hrtick_update(rq); }
@@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h) * stopped. */ nohz_idle_balance(this_rq, idle); + + if (energy_aware()) + arch_scale_cpu_freq(); }
/*
Hi Mike,
On 10/22/2014 11:37 AM, Mike Turquette wrote:
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the
Can you explain how pstate selection can get affected by the presence of task groups? We are after all concerned with the cpu load. So when we enqueue/dequeue a task, we update the cpu load and pass it on for cpu pstate scaling. How does this change if we have task groups? I know that this issue was brought up during LPC, but I have yet not managed to gain clarity here.
next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?
Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette mturquette@linaro.org
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1af6f6d..3619f63 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se;
struct cpumask update_cpus;
cpumask_clear(&update_cpus);
for_each_sched_entity(se) { if (se->on_rq)
@@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
All the cfs_rqs that you iterate through here will belong to the same rq/cpu right?
Regards Preeti U Murthy
On Tue, Oct 21, 2014 at 11:07:30PM -0700, Mike Turquette wrote:
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
Yeah, I'm not sure about the arch eval hook, ideally it'd be all integrated with the energy model.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
We might want a better name for that :-) dvfs_set_freq() or whatnot, or maybe preserve the cpufreq_*() namespace, people seen to know that that is the linux dvfs name.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task?
Like I said, I don't think so, we guestimate and approximate everything anyhow, don't bother trying to be 'perfect' here, its excessively expensive.
I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
Yes please, make the cpumask stuff go away :-)
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
Ideally it'd be natural fallout of Morten's energy model.
If you take a multi-core energy model, find its bifurcations and map its solution spaces I suspect there to be a fairly small set of actual behaviours.
The problem is, nobody seems to have done this yet so we don't know.
Once you've done this, you can try and minimize the model by proving you retain all behaviour modes, but for now Morten has a rather full parameter space (not complete though, and the impact of the missing parameters might or might not be relevant, impossible to prove until we have the above done).
Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?
For the legacy systems, maybe.
Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.
I think it would be best to start with this, ideally we'd be able to RCU free the thing such that either holding the rwsem or rcu_read_lock is sufficient for usage, that way the sleeping muck can grab the rwsem, the non-sleeping stuff can grab rcu_read_lock.
But I've not looked at the cpufreq stuff at all.
Hi Mike,
On 22/10/14 07:07, Mike Turquette wrote:
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing.
The sentence is a little bit misleading. We update the se utilization contrib and the cfs_rq utilization in {en,de}queue_task_fair for a specific se and a specific cpu = rq_of(cfs_rq_of(se))->cpu .
The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
I'm not sure if separating the evaluation and the setting of the cpu frequency makes sense. You could evaluate and possibly set the cpu frequency in one go. Right now you evaluate if the cfs_rq utilization exceeds the thresholds for the current index every time a task is enqueued or dequeued but that's not necessary since you only try to set the cpu frequency in the softirq. The history (and the future if we consider blocked utilization) is already captured in the cfs_rq utilization itself.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
The name is misleading from the viewpoint of the CFS sched class. The original scaling function of the CFS scheduler (arch_scale_{freq,smt/cpu,rt}_capacity) scale capacity based on frequency, uarch or rt. So your function should be call arch_scale_util_cpu_freq or even better arch_set_cpu_freq.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task?
Not really because cfs_rq utilization tracks the history/(future) of cpu utilization and you can evaluate the signal when you want to set the cpu frequency.
I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
I don't understand why you should care about task groups at all. The task groups contribution to the utilization of a cpu should be already encountered for in the appropriate cpu's cfs_rq utilization signal.
But I can see a dependency to the fact that there is a difference between systems with per-cluster (package) or per-cpu frequency scaling. But there is no SD_SHARE_FREQDOMAIN (sched domain flag) today which applied to the SD level MC could tell you tahts we deal with per-cluster frequency scaling though. On systems with per-cpu frequency scaling you can set the frequency for individual cpus by hooking into the scheduler but for systems with per-cluster frequency scaling, you would have to respect the maximum cpu utilization of all cpus in the cluster.
A similar problem occurs with hardware threads (SMT sd level).
But I don't know right now how the sd topology hierarchy can become handy here.
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?
At least it's a good place to test this feature for now.
Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.
Not-signed-off-by: Mike Turquette mturquette@linaro.org
kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1af6f6d..3619f63 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3999,6 +3999,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se;
- struct cpumask update_cpus;
- cpumask_clear(&update_cpus);
for_each_sched_entity(se) { if (se->on_rq) @@ -4028,12 +4031,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
}cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
- /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
/*
* FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
* typedef update_cpus into an int and skip all of the cpumask
* stuff
*/
}cpumask_set_cpu(cpu_of(rq), &update_cpus);
- if (energy_aware())
if (!cpumask_empty(&update_cpus))
arch_eval_cpu_freq(&update_cpus);
- hrtick_update(rq);
} @@ -4049,6 +4067,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; int task_sleep = flags & DEQUEUE_SLEEP;
- struct cpumask update_cpus;
- cpumask_clear(&update_cpus);
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -4089,12 +4110,27 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track runqueues/cpus that need to be re-evaluated */
}cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
- /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1);
/*
* FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
* typedef update_cpus into an int and skip all of the cpumask
* stuff
*/
}cpumask_set_cpu(cpu_of(rq), &update_cpus);
- if (energy_aware())
if (!cpumask_empty(&update_cpus))
arch_eval_cpu_freq(&update_cpus);
- hrtick_update(rq);
} @@ -7536,6 +7572,9 @@ static void run_rebalance_domains(struct softirq_action *h) * stopped. */ nohz_idle_balance(this_rq, idle);
- if (energy_aware())
arch_scale_cpu_freq();
} /*
On 10/22/2014 11:37 AM, Mike Turquette wrote:
{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.
arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.
All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.
Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.
I believe load balancing would be the right place to evaluate the frequency at which CPUs must run. find_busiest_group() is already iterating through all the CPUs and calculating the load on them. So this information is readily available and that which remains to be seen is which of the CPUs in the group have their load > some_threshold and queue a kthread on that cpu to scale its frequency, while the current cpu continues with its load balancing.
There is another positive I see in evaluating cpu frequency in load balancing. The frequency at which load balancing is run is already optimized for scalability. One of the factors that is considered is if any sibling cpus has carried out load balancing in the recent past, the current cpu defers doing the same. This means it is naturally ensured that only one cpu in the power domain takes care of frequency scaling each time and there is no need of explicit synchronization between the policy cpus to do this.
Regards Preeti U Murthy
Building on top of the scale invariant capacity patches and earlier patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model.
This policy is implemented using the CPUfreq governor interface for two main reasons:
1) re-using the CPUfreq machine drivers without using the governor interface is hard. I do not forsee any issue continuing to use the governor interface going forward but it is worth making clear what this patch does up front.
2) using the CPUfreq interface allows us to switch between the energy_model governor and other CPUfreq governors (such as ondemand) at run-time. This is very useful for comparative testing and tuning.
A caveat to #2 above is that the weak arch function used by the governor means that only one scheduler-driven policy can be linked at a time. This limitation does not apply to "traditional" governors. I raised this in my previous capacity_ops patches[0] but as discussed at LPC14 last week, it seems desirable to pursue a single cpu frequency scaling policy at first, and try to make that work for everyone interested in using it. If that model breaks down then we can revisit the idea of dynamic selection of scheduler-driven cpu frequency scaling.
Unlike legacy CPUfreq governors, this policy does not implement its own logic loop (such as a workqueue triggered by a timer), but instead uses an event-driven design. Frequency is evaluated by entering {en,de}queue_task_fair and then a kthread is woken from run_rebalance_domains which scales cpu frequency based on the latest evaluation.
The policy implemented in this patch takes the highest cpu utilization from policy->cpus and uses that select a frequency target based on the same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled thresholds are pre-computed when energy_model inits. The frequency selection is a simple comparison of cpu utilization (as defined in Morten's latest RFC) to the threshold values. In the future this logic could be replaced with something more sophisticated that uses PELT to get a historical overview. Ideas are welcome.
Note that the pre-computed thresholds above do not take into account micro-architecture differences (SMT or big.LITTLE hardware), only frequency invariance.
Not-signed-off-by: Mike Turquette mturquette@linaro.org --- drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 366 insertions(+) create mode 100644 kernel/sched/energy_model.c
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index 22b42d5..78a2caa 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL + bool "energy_model" + select CPU_FREQ_GOV_ENERGY_MODEL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the CPUfreq governor 'energy_model' as default. This + scales cpu frequency from the scheduler as per-task statistics + are updated. endchoice
config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,18 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_ENERGY_MODEL + tristate "'energy model' cpufreq governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_COMMON + help + 'energy_model' - this governor scales cpu frequency from the + scheduler as a function of cpu utilization. It does not + evaluate utilization on a periodic basis (unlike ondemand) but + instead is invoked from CFS when updating per-task statistics. + + If in doubt, say N. + config CPUFREQ_GENERIC tristate "Generic cpufreq driver" depends on HAVE_CLK && OF diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 91d173c..69cbbec 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -482,6 +482,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL) +extern struct cpufreq_governor cpufreq_gov_energy_model; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_energy_model) #endif
/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index ab32b7b..7cd404c 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_ENERGY_MODEL) += energy_model.o diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c new file mode 100644 index 0000000..5cdea9a --- /dev/null +++ b/kernel/sched/energy_model.c @@ -0,0 +1,341 @@ +/* + * Copyright (C) 2014 Michael Turquette mturquette@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> + +#include "sched.h" + +#define THROTTLE_MSEC 50 +#define UP_THRESHOLD 80 +#define DOWN_THRESHOLD 20 + +/** + * em_data - per-policy data used by energy_mode + * @throttle: bail if current time is less than than ktime_throttle. + * Derived from THROTTLE_MSEC + * @up_threshold: table of normalized capacity states to determine if cpu + * should run faster. Derived from UP_THRESHOLD + * @down_threshold: table of normalized capacity states to determine if cpu + * should run slower. Derived from DOWN_THRESHOLD + * + * struct em_data is the per-policy energy_model-specific data structure. A + * per-policy instance of it is created when the energy_model governor receives + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data + * member of struct cpufreq_policy. + * + * Readers of this data must call down_read(policy->rwsem). Writers must + * call down_write(policy->rwsem). + */ +struct em_data { + /* per-policy throttling */ + ktime_t throttle; + unsigned int *up_threshold; + unsigned int *down_threshold; + struct task_struct *task; + atomic_long_t target_freq; + atomic_t need_wake_task; +}; + +/* + * we pass in struct cpufreq_policy. This is safe because changing out the + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP), + * which tears all of the data structures down and __cpufreq_governor(policy, + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the + * new policy pointer + */ +static int energy_model_thread(void *data) +{ + struct sched_param param; + struct cpufreq_policy *policy; + struct em_data *em; + int ret; + + policy = (struct cpufreq_policy *) data; + if (!policy) { + pr_warn("%s: missing policy\n", __func__); + do_exit(-EINVAL); + } + + em = policy->gov_data; + if (!em) { + pr_warn("%s: missing governor data\n", __func__); + do_exit(-EINVAL); + } + + param.sched_priority = 0; + sched_setscheduler(current, SCHED_FIFO, ¶m); + + + do { + down_write(&policy->rwsem); + if (!atomic_read(&em->need_wake_task)) { + up_write(&policy->rwsem); + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + continue; + } + + ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq), + CPUFREQ_RELATION_H); + if (ret) + pr_debug("%s: __cpufreq_driver_target returned %d\n", + __func__, ret); + + em->throttle = ktime_get(); + atomic_set(&em->need_wake_task, 0); + up_write(&policy->rwsem); + } while (!kthread_should_stop()); + + do_exit(0); +} + +static void em_wake_up_process(struct task_struct *task) +{ + /* this is null during early boot */ + if (IS_ERR_OR_NULL(task)) { + return; + } + + wake_up_process(task); +} + +void arch_scale_cpu_freq(void) +{ + struct cpufreq_policy *policy; + struct em_data *em; + int cpu; + + for_each_online_cpu(cpu) { + policy = cpufreq_cpu_get(cpu); + if (IS_ERR_OR_NULL(policy)) + continue; + + em = policy->gov_data; + if (!em) + continue; + + /* + * FIXME replace the atomic stuff by holding write-locks + * in arch_eval_cpu_freq? + */ + if (atomic_read(&em->need_wake_task)) { + em_wake_up_process(em->task); + } + + cpufreq_cpu_put(policy); + } +} + +/** + * arch_eval_cpu_freq - scale cpu frequency based on CFS utilization + * @update_cpus: mask of CPUs with updated utilization and capacity + * + * Declared and weakly defined in kernel/sched/fair.c This definition overrides + * the default. In the case of CONFIG_FAIR_GROUP_SCHED, update_cpus may + * contains cpus that are not in the same policy. Otherwise update_cpus will be + * a single cpu. + * + * Holds read lock for policy->rw_sem. + * + * FIXME weak arch function means that only one definition of this function can + * be linked. How to support multiple energy model policies? + */ +void arch_eval_cpu_freq(struct cpumask *update_cpus) +{ + struct cpufreq_policy *policy; + struct em_data *em; + int index; + unsigned int cpu, tmp; + unsigned long percent_util = 0, max_util = 0, cap = 0, util = 0; + + /* + * In the case of CONFIG_FAIR_GROUP_SCHED, policy->cpus may be a subset + * of update_cpus. In such case take the first cpu in update_cpus, get + * its policy and try to scale the affects cpus. Then we clear the + * corresponding bits from update_cpus and try again. If a policy does + * not exist for a cpu then we remove that bit as well, preventing an + * infinite loop. + */ + while (!cpumask_empty(update_cpus)) { + percent_util = 0; + max_util = 0; + cap = 0; + util = 0; + + cpu = cpumask_first(update_cpus); + policy = cpufreq_cpu_get(cpu); + if (IS_ERR_OR_NULL(policy)) { + cpumask_clear_cpu(cpu, update_cpus); + continue; + } + + if (!policy->gov_data) + return; + + em = policy->gov_data; + + if (ktime_before(ktime_get(), em->throttle)) { + trace_printk("THROTTLED"); + goto bail; + } + + /* + * try scaling cpus + * + * algorithm assumptions & description: + * all cpus in a policy run at the same rate/capacity. + * choose frequency target based on most utilized cpu. + * do not care about aggregating cpu utilization. + * do not track any historical trends beyond utilization + * if max_util > 80% of current capacity, + * go to max capacity + * if max_util < 20% of current capacity, + * go to the next lowest capacity + * otherwise, stay at the same capacity state + */ + for_each_cpu(tmp, policy->cpus) { + util = usage_util_of(cpu); + if (util > max_util) + max_util = util; + } + + cap = capacity_of(cpu); + if (!cap) { + goto bail; + } + + index = cpufreq_frequency_table_get_index(policy, policy->cur); + if (max_util > em->up_threshold[index]) { + /* write em->target_freq with read lock held */ + atomic_long_set(&em->target_freq, policy->max); + /* + * FIXME this is gross. convert arch_eval_cpu_freq to + * hold the write lock? + */ + atomic_set(&em->need_wake_task, 1); + } else if (max_util < em->down_threshold[index]) { + /* write em->target_freq with read lock held */ + atomic_long_set(&em->target_freq, policy->cur - 1); + /* + * FIXME this is gross. convert arch_eval_cpu_freq to + * hold the write lock? + */ + atomic_set(&em->need_wake_task, 1); + } + +bail: + /* remove policy->cpus fromm update_cpus */ + cpumask_andnot(update_cpus, update_cpus, policy->cpus); + cpufreq_cpu_put(policy); + } + + return; +} + +static void em_start(struct cpufreq_policy *policy) +{ + int index = 0, count = 0; + unsigned int capacity; + struct em_data *em; + struct cpufreq_frequency_table *pos; + + /* prepare per-policy private data */ + em = kzalloc(sizeof(*em), GFP_KERNEL); + if (!em) { + pr_debug("%s: failed to allocate private data\n", __func__); + return; + } + + policy->gov_data = em; + + /* how many entries in the frequency table? */ + cpufreq_for_each_entry(pos, policy->freq_table) + count++; + + /* pre-compute thresholds */ + em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); + em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); + + cpufreq_for_each_entry(pos, policy->freq_table) { + /* FIXME capacity below is not scaled for uarch */ + capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max; + em->up_threshold[index] = capacity * UP_THRESHOLD / 100; + em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100; + pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n", + __func__, cpumask_first(policy->cpus), index, + capacity, em->up_threshold[index], + em->down_threshold[index]); + index++; + } + + /* init per-policy kthread */ + em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task"); + if (IS_ERR_OR_NULL(em->task)) + pr_err("%s: failed to create kenergy_model_task thread\n", __func__); +} + + +static void em_stop(struct cpufreq_policy *policy) +{ + struct em_data *em; + + em = policy->gov_data; + + kthread_stop(em->task); + + /* replace with devm counterparts */ + kfree(em->up_threshold); + kfree(em->down_threshold); + kfree(em); +} + +static int energy_model_setup(struct cpufreq_policy *policy, unsigned int event) +{ + switch (event) { + case CPUFREQ_GOV_START: + /* Start managing the frequency */ + em_start(policy); + return 0; + + case CPUFREQ_GOV_STOP: + em_stop(policy); + return 0; + + case CPUFREQ_GOV_LIMITS: /* unused */ + case CPUFREQ_GOV_POLICY_INIT: /* unused */ + case CPUFREQ_GOV_POLICY_EXIT: /* unused */ + break; + } + return 0; +} + +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL +static +#endif +struct cpufreq_governor cpufreq_gov_energy_model = { + .name = "energy_model", + .governor = energy_model_setup, + .owner = THIS_MODULE, +}; + +static int __init energy_model_init(void) +{ + return cpufreq_register_governor(&cpufreq_gov_energy_model); +} + +static void __exit energy_model_exit(void) +{ + cpufreq_unregister_governor(&cpufreq_gov_energy_model); +} + +/* Try to make this the default governor */ +fs_initcall(energy_model_init); + +MODULE_LICENSE("GPL");
On 22/10/14 07:07, Mike Turquette wrote:
Building on top of the scale invariant capacity patches and earlier
We don't have scale invariant capacity yet but scale invariant load/utilization.
patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model.
Maybe it's worth mentioning that you simply take SCHED_CAPACITY_SCALE and multiply it with the OPP frequency/max frequency of that cpu to get the capacity at that OPP. You're not using the capacity related energy values 'struct capacity:cap' from the energy model which would have to be measured for the particular platform.
[...]
The policy implemented in this patch takes the highest cpu utilization from policy->cpus and uses that select a frequency target based on the same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled thresholds are pre-computed when energy_model inits. The frequency selection is a simple comparison of cpu utilization (as defined in Morten's latest RFC) to the threshold values. In the future this logic could be replaced with something more sophisticated that uses PELT to get a historical overview. Ideas are welcome.
This is what I don't grasp. The se utilization contrib and the cfs_rq utilization are PELT signals and they provide history information? I mean comparing the cfs_rq utilization PELT signal with a number from an energy model, that's essentially EAS.
Note that the pre-computed thresholds above do not take into account micro-architecture differences (SMT or big.LITTLE hardware), only frequency invariance.
Not-signed-off-by: Mike Turquette mturquette@linaro.org
drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 341 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 366 insertions(+) create mode 100644 kernel/sched/energy_model.c
[...]
+/**
- em_data - per-policy data used by energy_mode
- @throttle: bail if current time is less than than ktime_throttle.
Derived from THROTTLE_MSEC
- @up_threshold: table of normalized capacity states to determine if cpu
should run faster. Derived from UP_THRESHOLD
- @down_threshold: table of normalized capacity states to determine if cpu
should run slower. Derived from DOWN_THRESHOLD
- struct em_data is the per-policy energy_model-specific data structure. A
- per-policy instance of it is created when the energy_model governor receives
- the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
- member of struct cpufreq_policy.
- Readers of this data must call down_read(policy->rwsem). Writers must
- call down_write(policy->rwsem).
- */
+struct em_data {
/* per-policy throttling */
ktime_t throttle;
unsigned int *up_threshold;
unsigned int *down_threshold;
struct task_struct *task;
atomic_long_t target_freq;
atomic_t need_wake_task;
+};
On my Chromebook2 (Exynos 5 Octa 5800) I end up with 2 kernel threads (one for each cluster). There is an 'for_each_online_cpu' in arch_scale_cpu_freq and I can see that the em data thread is invoked for both clusters every time. Is this the intended behaviour?
It looks like you achieve the desired behaviour for freq-scaling per cluster for this system but it's not clear to me how this is done from the design perspective and what would have to be changed if we want to run it on a per-cpu frequency scaling system.
Coming back to your question where you should call arch_scale_cpu_freq. Another issue is for which cpu you should call it? For EAS we want to be able to either raise the cpu frequency of the busiest cpu or do task migration away from the busiest cpu. So maybe arch_scale_cpu_freq should be called later in load_balance when we figured out which one is the busiest cpu? This would map nicely to load balance in MC sd level for per-cpu frequency scaling and in DIE sd level for per-cluster frequency scaling. But then, where do you hook in to lower the frequency eventually? And what happens in load-balance for all the other 'sd level <-> per-foo frequency scaling' combinations?
[...]
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL +static +#endif +struct cpufreq_governor cpufreq_gov_energy_model = {
.name = "energy_model",
.governor = energy_model_setup,
.owner = THIS_MODULE,
+};
+static int __init energy_model_init(void) +{
return cpufreq_register_governor(&cpufreq_gov_energy_model);
+}
Probably not that important at this stage. I always hit
[ 8.601824] ------------[ cut here ]------------ [ 8.601869] WARNING: CPU: 6 PID: 3229 at drivers/cpufreq/cpufreq_governor.c:266 cpufreq_governor_dbs+0x6f4/0x6f8() [ 8.601884] Modules linked in: [ 8.601912] CPU: 6 PID: 3229 Comm: cpufreq-set Not tainted 3.17.0-rc3-00293-g5cf54ebcaea6 #16 [ 8.601953] [<c0015224>] (unwind_backtrace) from [<c0011cd4>] (show_stack+0x18/0x1c) [ 8.601982] [<c0011cd4>] (show_stack) from [<c04c5b28>] (dump_stack+0x80/0xc0) [ 8.602011] [<c04c5b28>] (dump_stack) from [<c0022fd8>] (warn_slowpath_common+0x78/0x94) [ 8.602041] [<c0022fd8>] (warn_slowpath_common) from [<c00230a8>] (warn_slowpath_null+0x24/0x2c) [ 8.602071] [<c00230a8>] (warn_slowpath_null) from [<c03a74c8>] (cpufreq_governor_dbs+0x6f4/0x6f8) [ 8.602100] [<c03a74c8>] (cpufreq_governor_dbs) from [<c03a1b58>] (__cpufreq_governor+0x140/0x240) [ 8.602126] [<c03a1b58>] (__cpufreq_governor) from [<c03a31b0>] (cpufreq_set_policy+0x18c/0x20c) [ 8.602153] [<c03a31b0>] (cpufreq_set_policy) from [<c03a3400>] (store_scaling_governor+0x78/0xa4) [ 8.602179] [<c03a3400>] (store_scaling_governor) from [<c03a149c>] (store+0x94/0xc0) [ 8.602207] [<c03a149c>] (store) from [<c015c268>] (kernfs_fop_write+0xc8/0x188) [ 8.602236] [<c015c268>] (kernfs_fop_write) from [<c00ffc00>] (vfs_write+0xac/0x1b8) [ 8.602263] [<c00ffc00>] (vfs_write) from [<c010023c>] (SyS_write+0x48/0x9c) [ 8.602290] [<c010023c>] (SyS_write) from [<c000e600>] (ret_fast_syscall+0x0/0x30) [ 8.602307] ---[ end trace bedc9e3b94a57ef2 ]---
when I configure CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL=y during initial system start.
[...]
On Tue, Oct 21, 2014 at 11:07:31PM -0700, Mike Turquette wrote:
Unlike legacy CPUfreq governors, this policy does not implement its own logic loop (such as a workqueue triggered by a timer), but instead uses an event-driven design. Frequency is evaluated by entering {en,de}queue_task_fair and then a kthread is woken from run_rebalance_domains which scales cpu frequency based on the latest evaluation.
Also note that we probably want to extend the governor to include the other sched classes, deadline for example is a good candidate to include as it already explicitly provides utilization requirements from which you can compute a hard minimum frequency, below which the task set is unschedulable.
fifo/rr are far harder to do, since for them we don't have anything useful, the best we can do I suppose is some statistical over provisioning but no guarantees.
linaro-kernel@lists.linaro.org