Quoting Amit Kucheria (2015-04-16 00:11:22)
On Thu, Apr 16, 2015 at 10:59 AM, Michael Turquette mturquette@linaro.org wrote:
+config CPU_FREQ_GOV_CAP_GOV
Two GOVs are redundant here and make it hard to read. A few name suggestions for your baby:
CPU_FREQ_GOV_SCHED_CAP CPU_FREQ_GOV_SCHED_STATS
I don't want the name to be too generic, since we're only dealing with cfs right now. Perhaps your SCHED_CAP variant or maybe CPU_FREQ_GOV_SCHED_CFS? That leaves room for SCHED_DL and others later on.
tristate "'capacity governor' cpufreq governor"depends on CPU_FREQselect CPU_FREQ_GOV_COMMONhelp'cap_gov' - this governor scales cpu frequency from thesame as above
scheduler as a function of cpu capacity utilization. It doesnot evaluate utilization on a periodic basis (unlike ondemand)but instead is invoked from CFS when updating per-entity loadtracking statistics.perhaps add something to the effect that it is more responsive than existing governors to really sell it? :)
Good idea. I'll add,
"Response to changes in load is improved over polling governors due to its event-driven design"
If in doubt, say N.comment "CPU frequency scaling drivers"
config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif
/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*
- Copyright (C) 2014 Michael Turquette mturquette@linaro.org
 
- This program is free software; you can redistribute it and/or modify
 
- it under the terms of the GNU General Public License version 2 as
 
- published by the Free Software Foundation.
 - */
 +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>
+#include "sched.h"
+#define UP_THRESHOLD 95
A comment that this probably belong as a sysfs tunable
Doh, this shouldn't be here at all. I don't use any up or down thresholds in this version.
+#define THROTTLE_NSEC 50000000 /* 50ms default */
+/*
- per-cpu pointer to atomic_t gov_data->cap_gov_wake_task
 s/cap_gov_wake_task/need_wake_task/
Ack. I might also be able to gid rid of this entirely with the irq_work stuff.
- used in scheduler hot paths {en,de}queueu, task_tick without having to
 
- access struct cpufreq_policy and struct gov_data
 - */
 +static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);
+/**
- gov_data - per-policy data internal to the governor
 
- @throttle: time until throttling period expires. Derived from THROTTLE_NSEC
 @throttle_nsec ?
Ack.
- @task: worker task for dvfs transition that may block/sleep
 
- @need_wake_task: flag the governor to wake this policy's worker thread
 
- struct gov_data is the per-policy cap_gov-specific data structure. A
 
- per-policy instance of it is created when the cap_gov governor receives
 
- the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
 
- member of struct cpufreq_policy.
 
- Readers of this data must call down_read(policy->rwsem). Writers must
 
- call down_write(policy->rwsem).
 - */
 +struct gov_data {
ktime_t throttle;unsigned int throttle_nsec;struct task_struct *task;atomic_t need_wake_task;+};
+/**
- cap_gov_select_freq - pick the next frequency for a cpu
 
- @cpu: the cpu whose frequency may be changed
 
- cap_gov_select_freq works in a way similar to the ondemand governor. First
 
- we inspect the utilization of all of the cpus in this policy to find the
 
- most utilized cpu. This is achieved by calling get_cpu_usage, which returns
 
- frequency-invarant capacity utilization.
 
- This max utilization is compared against the up_threshold (default 95%
 
- utilization). If the max cpu utilization is greater than this threshold then
 
- we scale the policy up to the max frequency. Othewise we find the lowest
 
- frequency (smallest cpu capacity) that is still larger than the max capacity
 
- utilization for this policy.
 
- Returns frequency selected.
 - */
 +static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
int cpu = 0;struct gov_data *gd;int index;unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;struct cpufreq_frequency_table *pos;if (!policy->gov_data)goto out;gd = policy->gov_data;/** get_cpu_usage is called without locking the runqueues. This is the* same behavior used by find_busiest_cpu in load_balance. We are* willing to accept occasionally stale data here in exchange for* lockless behavior.*/for_each_cpu(cpu, policy->cpus) {usage = get_cpu_usage(cpu);trace_printk("cpu = %d usage = %lu", cpu, usage);if (usage > max_usage)max_usage = usage;}trace_printk("max_usage = %lu", max_usage);/* find the utilization threshold at which we scale up frequency */index = cpufreq_frequency_table_get_index(policy, policy->cur);/** converge towards max_usage. We want the lowest frequency whose* capacity is >= to max_usage. In other words:** find capacity == floor(usage)** Sadly cpufreq freq tables are not guaranteed to be ordered by* frequency...*/freq = policy->max;cpufreq_for_each_entry(pos, policy->freq_table) {cap = pos->frequency * SCHED_CAPACITY_SCALE /policy->max;if (max_usage < cap && pos->frequency < freq)freq = pos->frequency;trace_printk("cpu = %u max_usage = %lu cap = %lu \table_freq = %u freq = %lu",cpumask_first(policy->cpus), max_usage, cap,pos->frequency, freq);}+out:
trace_printk("cpu %d final freq %lu", cpu, freq);return freq;+}
+/*
- we pass in struct cpufreq_policy. This is safe because changing out the
 
- policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
 
- which tears down all of the data structures and __cpufreq_governor(policy,
 
- CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
 
- new policy pointer
 - */
 +static int cap_gov_thread(void *data) +{
struct sched_param param;struct cpufreq_policy *policy;struct gov_data *gd;unsigned long freq;int ret;policy = (struct cpufreq_policy *) data;if (!policy) {pr_warn("%s: missing policy\n", __func__);do_exit(-EINVAL);}gd = policy->gov_data;if (!gd) {pr_warn("%s: missing governor data\n", __func__);do_exit(-EINVAL);}param.sched_priority = 0;sched_setscheduler(current, SCHED_FIFO, ¶m);set_cpus_allowed_ptr(current, policy->related_cpus);/* main loop of the per-policy kthread */do {down_write(&policy->rwsem);if (!atomic_read(&gd->need_wake_task)) {if (kthread_should_stop())break;trace_printk("NOT waking up kthread (%d)", gd->task->pid);up_write(&policy->rwsem);set_current_state(TASK_INTERRUPTIBLE);schedule();continue;}trace_printk("kthread %d requested freq switch", gd->task->pid);freq = cap_gov_select_freq(policy);ret = __cpufreq_driver_target(policy, freq,CPUFREQ_RELATION_H);if (ret)pr_debug("%s: __cpufreq_driver_target returned %d\n",__func__, ret);trace_printk("kthread %d requested freq switch", gd->task->pid);gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);atomic_set(&gd->need_wake_task, 0);up_write(&policy->rwsem);} while (!kthread_should_stop());do_exit(0);+}
+static void cap_gov_wake_up_process(struct task_struct *task) +{
/* this is null during early boot */if (IS_ERR_OR_NULL(task)) {return;}wake_up_process(task);+}
+void cap_gov_kick_thread(int cpu) +{
struct cpufreq_policy *policy;struct gov_data *gd = NULL;policy = cpufreq_cpu_get(cpu);if (IS_ERR_OR_NULL(policy))return;gd = policy->gov_data;if (!gd)goto out;/* per-cpu access not needed here since we have gd */if (atomic_read(&gd->need_wake_task)) {trace_printk("waking up kthread (%d)", gd->task->pid);cap_gov_wake_up_process(gd->task);}+out:
cpufreq_cpu_put(policy);+}
+/**
- cap_gov_update_cpu - interface to scheduler for changing capacity values
 
- @cpu: cpu whose capacity utilization has recently changed
 
- cap_gov_udpate_cpu is an interface exposed to the scheduler so that the
 
- scheduler may inform the governor of updates to capacity utilization and
 
- make changes to cpu frequency. Currently this interface is designed around
 
- PELT values in CFS. It can be expanded to other scheduling classes in the
 
- future if needed.
 
- The semantics of this call vary based on the cpu frequency scaling
 
- characteristics of the hardware.
 
- If kicking off a dvfs transition is an operation that might block or sleep
 
- in the cpufreq driver then we set the need_wake_task flag in this function
 The comment here isn't obvious since first glance you don't touch need_wake_task. Perhaps clarify it as follows?
we set the need_wake_task (cap_gov_wake_task is a pointer to it)
I can do that. Additionally the kerneldoc description should remove all of the text about hardware that has async/non-blocking dvfs transition. This version of the patch ALWAYS kicks the kthread and the previous "driver_might_sleep" bool has been removed.
Trying to keep the submission as simple and not-over-engineered as possible.
Regards, Mike