Quoting Amit Kucheria (2015-04-16 00:11:22)
On Thu, Apr 16, 2015 at 10:59 AM, Michael Turquette mturquette@linaro.org wrote:
+config CPU_FREQ_GOV_CAP_GOV
Two GOVs are redundant here and make it hard to read. A few name suggestions for your baby:
CPU_FREQ_GOV_SCHED_CAP CPU_FREQ_GOV_SCHED_STATS
I don't want the name to be too generic, since we're only dealing with cfs right now. Perhaps your SCHED_CAP variant or maybe CPU_FREQ_GOV_SCHED_CFS? That leaves room for SCHED_DL and others later on.
tristate "'capacity governor' cpufreq governor"
depends on CPU_FREQ
select CPU_FREQ_GOV_COMMON
help
'cap_gov' - this governor scales cpu frequency from the
same as above
scheduler as a function of cpu capacity utilization. It does
not evaluate utilization on a periodic basis (unlike ondemand)
but instead is invoked from CFS when updating per-entity load
tracking statistics.
perhaps add something to the effect that it is more responsive than existing governors to really sell it? :)
Good idea. I'll add,
"Response to changes in load is improved over polling governors due to its event-driven design"
If in doubt, say N.
comment "CPU frequency scaling drivers"
config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif
/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*
- Copyright (C) 2014 Michael Turquette mturquette@linaro.org
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License version 2 as
- published by the Free Software Foundation.
- */
+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>
+#include "sched.h"
+#define UP_THRESHOLD 95
A comment that this probably belong as a sysfs tunable
Doh, this shouldn't be here at all. I don't use any up or down thresholds in this version.
+#define THROTTLE_NSEC 50000000 /* 50ms default */
+/*
- per-cpu pointer to atomic_t gov_data->cap_gov_wake_task
s/cap_gov_wake_task/need_wake_task/
Ack. I might also be able to gid rid of this entirely with the irq_work stuff.
- used in scheduler hot paths {en,de}queueu, task_tick without having to
- access struct cpufreq_policy and struct gov_data
- */
+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);
+/**
- gov_data - per-policy data internal to the governor
- @throttle: time until throttling period expires. Derived from THROTTLE_NSEC
@throttle_nsec ?
Ack.
- @task: worker task for dvfs transition that may block/sleep
- @need_wake_task: flag the governor to wake this policy's worker thread
- struct gov_data is the per-policy cap_gov-specific data structure. A
- per-policy instance of it is created when the cap_gov governor receives
- the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
- member of struct cpufreq_policy.
- Readers of this data must call down_read(policy->rwsem). Writers must
- call down_write(policy->rwsem).
- */
+struct gov_data {
ktime_t throttle;
unsigned int throttle_nsec;
struct task_struct *task;
atomic_t need_wake_task;
+};
+/**
- cap_gov_select_freq - pick the next frequency for a cpu
- @cpu: the cpu whose frequency may be changed
- cap_gov_select_freq works in a way similar to the ondemand governor. First
- we inspect the utilization of all of the cpus in this policy to find the
- most utilized cpu. This is achieved by calling get_cpu_usage, which returns
- frequency-invarant capacity utilization.
- This max utilization is compared against the up_threshold (default 95%
- utilization). If the max cpu utilization is greater than this threshold then
- we scale the policy up to the max frequency. Othewise we find the lowest
- frequency (smallest cpu capacity) that is still larger than the max capacity
- utilization for this policy.
- Returns frequency selected.
- */
+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
int cpu = 0;
struct gov_data *gd;
int index;
unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
struct cpufreq_frequency_table *pos;
if (!policy->gov_data)
goto out;
gd = policy->gov_data;
/*
* get_cpu_usage is called without locking the runqueues. This is the
* same behavior used by find_busiest_cpu in load_balance. We are
* willing to accept occasionally stale data here in exchange for
* lockless behavior.
*/
for_each_cpu(cpu, policy->cpus) {
usage = get_cpu_usage(cpu);
trace_printk("cpu = %d usage = %lu", cpu, usage);
if (usage > max_usage)
max_usage = usage;
}
trace_printk("max_usage = %lu", max_usage);
/* find the utilization threshold at which we scale up frequency */
index = cpufreq_frequency_table_get_index(policy, policy->cur);
/*
* converge towards max_usage. We want the lowest frequency whose
* capacity is >= to max_usage. In other words:
*
* find capacity == floor(usage)
*
* Sadly cpufreq freq tables are not guaranteed to be ordered by
* frequency...
*/
freq = policy->max;
cpufreq_for_each_entry(pos, policy->freq_table) {
cap = pos->frequency * SCHED_CAPACITY_SCALE /
policy->max;
if (max_usage < cap && pos->frequency < freq)
freq = pos->frequency;
trace_printk("cpu = %u max_usage = %lu cap = %lu \
table_freq = %u freq = %lu",
cpumask_first(policy->cpus), max_usage, cap,
pos->frequency, freq);
}
+out:
trace_printk("cpu %d final freq %lu", cpu, freq);
return freq;
+}
+/*
- we pass in struct cpufreq_policy. This is safe because changing out the
- policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
- which tears down all of the data structures and __cpufreq_governor(policy,
- CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
- new policy pointer
- */
+static int cap_gov_thread(void *data) +{
struct sched_param param;
struct cpufreq_policy *policy;
struct gov_data *gd;
unsigned long freq;
int ret;
policy = (struct cpufreq_policy *) data;
if (!policy) {
pr_warn("%s: missing policy\n", __func__);
do_exit(-EINVAL);
}
gd = policy->gov_data;
if (!gd) {
pr_warn("%s: missing governor data\n", __func__);
do_exit(-EINVAL);
}
param.sched_priority = 0;
sched_setscheduler(current, SCHED_FIFO, ¶m);
set_cpus_allowed_ptr(current, policy->related_cpus);
/* main loop of the per-policy kthread */
do {
down_write(&policy->rwsem);
if (!atomic_read(&gd->need_wake_task)) {
if (kthread_should_stop())
break;
trace_printk("NOT waking up kthread (%d)", gd->task->pid);
up_write(&policy->rwsem);
set_current_state(TASK_INTERRUPTIBLE);
schedule();
continue;
}
trace_printk("kthread %d requested freq switch", gd->task->pid);
freq = cap_gov_select_freq(policy);
ret = __cpufreq_driver_target(policy, freq,
CPUFREQ_RELATION_H);
if (ret)
pr_debug("%s: __cpufreq_driver_target returned %d\n",
__func__, ret);
trace_printk("kthread %d requested freq switch", gd->task->pid);
gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
atomic_set(&gd->need_wake_task, 0);
up_write(&policy->rwsem);
} while (!kthread_should_stop());
do_exit(0);
+}
+static void cap_gov_wake_up_process(struct task_struct *task) +{
/* this is null during early boot */
if (IS_ERR_OR_NULL(task)) {
return;
}
wake_up_process(task);
+}
+void cap_gov_kick_thread(int cpu) +{
struct cpufreq_policy *policy;
struct gov_data *gd = NULL;
policy = cpufreq_cpu_get(cpu);
if (IS_ERR_OR_NULL(policy))
return;
gd = policy->gov_data;
if (!gd)
goto out;
/* per-cpu access not needed here since we have gd */
if (atomic_read(&gd->need_wake_task)) {
trace_printk("waking up kthread (%d)", gd->task->pid);
cap_gov_wake_up_process(gd->task);
}
+out:
cpufreq_cpu_put(policy);
+}
+/**
- cap_gov_update_cpu - interface to scheduler for changing capacity values
- @cpu: cpu whose capacity utilization has recently changed
- cap_gov_udpate_cpu is an interface exposed to the scheduler so that the
- scheduler may inform the governor of updates to capacity utilization and
- make changes to cpu frequency. Currently this interface is designed around
- PELT values in CFS. It can be expanded to other scheduling classes in the
- future if needed.
- The semantics of this call vary based on the cpu frequency scaling
- characteristics of the hardware.
- If kicking off a dvfs transition is an operation that might block or sleep
- in the cpufreq driver then we set the need_wake_task flag in this function
The comment here isn't obvious since first glance you don't touch need_wake_task. Perhaps clarify it as follows?
we set the need_wake_task (cap_gov_wake_task is a pointer to it)
I can do that. Additionally the kerneldoc description should remove all of the text about hardware that has async/non-blocking dvfs transition. This version of the patch ALWAYS kicks the kthread and the previous "driver_might_sleep" bool has been removed.
Trying to keep the submission as simple and not-over-engineered as possible.
Regards, Mike