On Fri, Apr 17, 2015 at 03:34:30PM +0100, Ashwin Chaugule wrote:
Hi Mike,
On 16 April 2015 at 01:29, Michael Turquette mturquette@linaro.org wrote:
Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.
This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.
Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).
cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.
Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.
This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.
This policy is implemented using the cpufreq governor interface for two main reasons:
- re-using the cpufreq machine drivers without using the governor
interface is hard.
- using the cpufreq interface allows us to switch between the
scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.
Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.
Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.
[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22
Signed-off-by: Michael Turquette mturquette@linaro.org
drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.
+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
bool "cap_gov"
select CPU_FREQ_GOV_CAP_GOV
select CPU_FREQ_GOV_PERFORMANCE
help
Use the CPUfreq governor 'cap_gov' as default. This scales cpu
frequency from the scheduler as per-entity load tracking
statistics are updated.
endchoice
config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
tristate "'capacity governor' cpufreq governor"
depends on CPU_FREQ
select CPU_FREQ_GOV_COMMON
help
'cap_gov' - this governor scales cpu frequency from the
scheduler as a function of cpu capacity utilization. It does
not evaluate utilization on a periodic basis (unlike ondemand)
but instead is invoked from CFS when updating per-entity load
tracking statistics.
If in doubt, say N.
comment "CPU frequency scaling drivers"
config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif
/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*
- Copyright (C) 2014 Michael Turquette mturquette@linaro.org
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License version 2 as
- published by the Free Software Foundation.
- */
+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>
+#include "sched.h"
+#define UP_THRESHOLD 95 +#define THROTTLE_NSEC 50000000 /* 50ms default */
+/*
- per-cpu pointer to atomic_t gov_data->cap_gov_wake_task
- used in scheduler hot paths {en,de}queueu, task_tick without having to
- access struct cpufreq_policy and struct gov_data
- */
+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);
+/**
- gov_data - per-policy data internal to the governor
- @throttle: time until throttling period expires. Derived from THROTTLE_NSEC
- @task: worker task for dvfs transition that may block/sleep
- @need_wake_task: flag the governor to wake this policy's worker thread
- struct gov_data is the per-policy cap_gov-specific data structure. A
- per-policy instance of it is created when the cap_gov governor receives
- the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
- member of struct cpufreq_policy.
- Readers of this data must call down_read(policy->rwsem). Writers must
- call down_write(policy->rwsem).
- */
+struct gov_data {
ktime_t throttle;
unsigned int throttle_nsec;
struct task_struct *task;
atomic_t need_wake_task;
+};
+/**
- cap_gov_select_freq - pick the next frequency for a cpu
- @cpu: the cpu whose frequency may be changed
- cap_gov_select_freq works in a way similar to the ondemand governor. First
- we inspect the utilization of all of the cpus in this policy to find the
- most utilized cpu. This is achieved by calling get_cpu_usage, which returns
- frequency-invarant capacity utilization.
- This max utilization is compared against the up_threshold (default 95%
- utilization). If the max cpu utilization is greater than this threshold then
- we scale the policy up to the max frequency. Othewise we find the lowest
- frequency (smallest cpu capacity) that is still larger than the max capacity
- utilization for this policy.
- Returns frequency selected.
- */
+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
int cpu = 0;
struct gov_data *gd;
int index;
unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
struct cpufreq_frequency_table *pos;
if (!policy->gov_data)
goto out;
gd = policy->gov_data;
/*
* get_cpu_usage is called without locking the runqueues. This is the
* same behavior used by find_busiest_cpu in load_balance. We are
* willing to accept occasionally stale data here in exchange for
* lockless behavior.
*/
for_each_cpu(cpu, policy->cpus) {
usage = get_cpu_usage(cpu);
trace_printk("cpu = %d usage = %lu", cpu, usage);
if (usage > max_usage)
max_usage = usage;
}
trace_printk("max_usage = %lu", max_usage);
/* find the utilization threshold at which we scale up frequency */
index = cpufreq_frequency_table_get_index(policy, policy->cur);
/*
* converge towards max_usage. We want the lowest frequency whose
* capacity is >= to max_usage. In other words:
*
* find capacity == floor(usage)
*
* Sadly cpufreq freq tables are not guaranteed to be ordered by
* frequency...
*/
freq = policy->max;
cpufreq_for_each_entry(pos, policy->freq_table) {
cap = pos->frequency * SCHED_CAPACITY_SCALE /
policy->max;
if (max_usage < cap && pos->frequency < freq)
freq = pos->frequency;
trace_printk("cpu = %u max_usage = %lu cap = %lu \
table_freq = %u freq = %lu",
cpumask_first(policy->cpus), max_usage, cap,
pos->frequency, freq);
}
This code assumes all backend drivers will support a frequency table. I think this may not be always true (e.g. pcc-cpufreq.c which is already upstream or even the WIP CPPC driver). What do you think about detecting if the backend does not support a freq table, then just pass the max_usage down to the driver and let the driver handle it from there?
Unless you change the driver interface it has to be a frequency passed to the driver.
Do you actually need to consider the OPP frequencies at all? It looks for the lowest frequency with enough capacity for max_usage. But, I don't think it is necessary to find a valid OPP frequency. __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_H) should be able handle that as long as we can provide it a minimum frequency (freq) which is sufficient for max_usage.
Can't we get away with just returning:
(max_usage * policy->max)/SCHED_CAPACITY_SCALE
and avoid having to look at the frequency table?
It isn't as simple as that though, some margin or threshold has to be added so we don't have to wait until max_usage == cap before we ask for a higher OPP. It takes ages to get there (100+ ms).
Btw, I noted that __cpufreq_driver_target() is deprecated. Are you planning on using that for the CPPC driver? Based on my rather limited understanding of CPPC I would have guessed you would go with setpolicy() instead?
Morten