[PATCH 0/6] scheduler-driven cpu frequency scaling

List overview All Threads
Download

newer

older

Updated Invitation: sched-dvfs...

Invitation: sched-dvfs design...

Juri Lelli

9 Mar 2015 9 Mar '15

2:06 p.m.

Hi all,

as agreed with Mike, I'm posting here his sched-dvfs patchset [1] plus my EAS bindings to make it use the energy model. This is meant to re-start discussion on the attempt to drive dvfs directly from the scheduler.

This patchset is based on 4.0-rc2 + EASv3 [2].

Mike is already working on a slightly different, more "reactive", approach. The way we can use EAS infrastructure shouldn't change much anyway, so this works also as a first example of such thing. The core of my little delta resides in the possibility to read the energy tables to get freq and uarch invariance in the new governor (4/6) and the use of get_cpu_usage() to get the utilization signal from the scheduler (that I squashed in 02/06 together with some little fixes).

Best,

- Juri

[1] https://lkml.org/lkml/2014/10/22/21 [2] https://lkml.org/lkml/2015/2/4/537

Juri Lelli (3): sched/energy_model: use EAS energy model sched/energy_model: bound kthreads to run on related_cpus sched/energy_model: fix spurious kicks

Michael Turquette (1): sched: cfs: cpu frequency scaling based on task placement

Mike Turquette (2): cpufreq: add per-governor private data sched: energy_model: simple cpu frequency scaling policy

drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 364 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 51 ++++++- kernel/sched/sched.h | 5 + 6 files changed, 441 insertions(+), 7 deletions(-) create mode 100644 kernel/sched/energy_model.c

-- 2.2.2

Show replies by date

Juri Lelli

9 Mar 9 Mar

2:06 p.m.

New subject: [PATCH 1/6] cpufreq: add per-governor private data

From: Mike Turquette mturquette@linaro.org

Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Signed-off-by: Mike Turquette mturquette@linaro.org Signed-off-by: Michael Turquette mturquette@linaro.org --- include/linux/cpufreq.h | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 2ee4888..7cdf63a 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -116,6 +116,9 @@ struct cpufreq_policy {

/* For cpufreq driver's internal use */ void *driver_data; + + /* For cpufreq governor's internal use */ + void *gov_data; };

/* Only for ACPI */ -- 2.2.2

Juri Lelli

2:06 p.m.

New subject: [PATCH 2/6] sched: energy_model: simple cpu frequency scaling policy

From: Mike Turquette mturquette@linaro.org

Building on top of the scale invariant capacity patches and earlier patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model.

This policy is implemented using the CPUfreq governor interface for two main reasons:

1) re-using the CPUfreq machine drivers without using the governor interface is hard. I do not forsee any issue continuing to use the governor interface going forward but it is worth making clear what this patch does up front.

2) using the CPUfreq interface allows us to switch between the energy_model governor and other CPUfreq governors (such as ondemand) at run-time. This is very useful for comparative testing and tuning.

A caveat to #2 above is that the weak arch function used by the governor means that only one scheduler-driven policy can be linked at a time. This limitation does not apply to "traditional" governors. I raised this in my previous capacity_ops patches[0] but as discussed at LPC14 last week, it seems desirable to pursue a single cpu frequency scaling policy at first, and try to make that work for everyone interested in using it. If that model breaks down then we can revisit the idea of dynamic selection of scheduler-driven cpu frequency scaling.

Unlike legacy CPUfreq governors, this policy does not implement its own logic loop (such as a workqueue triggered by a timer), but instead uses an event-driven design. Frequency is evaluated by entering {en,de}queue_task_fair and then a kthread is woken from run_rebalance_domains which scales cpu frequency based on the latest evaluation.

The policy implemented in this patch takes the highest cpu utilization from policy->cpus and uses that select a frequency target based on the same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled thresholds are pre-computed when energy_model inits. The frequency selection is a simple comparison of cpu utilization (as defined in Morten's latest RFC) to the threshold values. In the future this logic could be replaced with something more sophisticated that uses PELT to get a historical overview. Ideas are welcome.

Note that the pre-computed thresholds above do not take into account micro-architecture differences (SMT or big.LITTLE hardware), only frequency invariance.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org --- drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 343 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 4 +- kernel/sched/sched.h | 3 + 6 files changed, 373 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/energy_model.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..7792646 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL + bool "energy_model" + select CPU_FREQ_GOV_ENERGY_MODEL + select CPU_FREQ_GOV_PERFORMANCE + help + Use the CPUfreq governor 'energy_model' as default. This + scales cpu frequency from the scheduler as per-task statistics + are updated. endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -198,6 +207,18 @@ config CPUFREQ_DT

If in doubt, say N.

+config CPU_FREQ_GOV_ENERGY_MODEL + tristate "'energy model' cpufreq governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_COMMON + help + 'energy_model' - this governor scales cpu frequency from the + scheduler as a function of cpu utilization. It does not + evaluate utilization on a periodic basis (unlike ondemand) but + instead is invoked from CFS when updating per-task statistics. + + If in doubt, say N. + if X86 source "drivers/cpufreq/Kconfig.x86" endif diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..03b0ff5 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL) +extern struct cpufreq_governor cpufreq_gov_energy_model; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_energy_model) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..b75721d 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_ENERGY_MODEL) += energy_model.o diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c new file mode 100644 index 0000000..7fec5af --- /dev/null +++ b/kernel/sched/energy_model.c @@ -0,0 +1,343 @@ +/* + * Copyright (C) 2014 Michael Turquette mturquette@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> + +#include "sched.h" + +#define THROTTLE_MSEC 50 +#define UP_THRESHOLD 80 +#define DOWN_THRESHOLD 20 + +/** + * em_data - per-policy data used by energy_mode + * @throttle: bail if current time is less than than ktime_throttle. + * Derived from THROTTLE_MSEC + * @up_threshold: table of normalized capacity states to determine if cpu + * should run faster. Derived from UP_THRESHOLD + * @down_threshold: table of normalized capacity states to determine if cpu + * should run slower. Derived from DOWN_THRESHOLD + * + * struct em_data is the per-policy energy_model-specific data structure. A + * per-policy instance of it is created when the energy_model governor receives + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data + * member of struct cpufreq_policy. + * + * Readers of this data must call down_read(policy->rwsem). Writers must + * call down_write(policy->rwsem). + */ +struct em_data { + /* per-policy throttling */ + ktime_t throttle; + unsigned int *up_threshold; + unsigned int *down_threshold; + struct task_struct *task; + atomic_long_t target_freq; + atomic_t need_wake_task; +}; + +/* + * we pass in struct cpufreq_policy. This is safe because changing out the + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP), + * which tears all of the data structures down and __cpufreq_governor(policy, + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the + * new policy pointer + */ +static int energy_model_thread(void *data) +{ + struct sched_param param; + struct cpufreq_policy *policy; + struct em_data *em; + int ret; + + policy = (struct cpufreq_policy *) data; + if (!policy) { + pr_warn("%s: missing policy\n", __func__); + do_exit(-EINVAL); + } + + em = policy->gov_data; + if (!em) { + pr_warn("%s: missing governor data\n", __func__); + do_exit(-EINVAL); + } + + param.sched_priority = 0; + sched_setscheduler(current, SCHED_FIFO, &param); + + + do { + down_write(&policy->rwsem); + if (!atomic_read(&em->need_wake_task)) { + up_write(&policy->rwsem); + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + continue; + } + + ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq), + CPUFREQ_RELATION_H); + if (ret) + pr_debug("%s: __cpufreq_driver_target returned %d\n", + __func__, ret); + + em->throttle = ktime_get(); + atomic_set(&em->need_wake_task, 0); + up_write(&policy->rwsem); + } while (!kthread_should_stop()); + + do_exit(0); +} + +static void em_wake_up_process(struct task_struct *task) +{ + /* this is null during early boot */ + if (IS_ERR_OR_NULL(task)) { + return; + } + + wake_up_process(task); +} + +void arch_scale_cpu_freq(void) +{ + struct cpufreq_policy *policy; + struct em_data *em; + int cpu; + + for_each_online_cpu(cpu) { + policy = cpufreq_cpu_get(cpu); + if (IS_ERR_OR_NULL(policy)) + continue; + + em = policy->gov_data; + if (!em) { + cpufreq_cpu_put(policy); + continue; + } + + /* + * FIXME replace the atomic stuff by holding write-locks + * in arch_eval_cpu_freq? + */ + if (atomic_read(&em->need_wake_task)) { + em_wake_up_process(em->task); + } + + cpufreq_cpu_put(policy); + } +} + +/** + * arch_eval_cpu_freq - scale cpu frequency based on CFS utilization + * @update_cpus: mask of CPUs with updated utilization and capacity + * + * Declared and weakly defined in kernel/sched/fair.c This definition overrides + * the default. In the case of CONFIG_FAIR_GROUP_SCHED, update_cpus may + * contains cpus that are not in the same policy. Otherwise update_cpus will be + * a single cpu. + * + * Holds read lock for policy->rw_sem. + * + * FIXME weak arch function means that only one definition of this function can + * be linked. How to support multiple energy model policies? + */ +void arch_eval_cpu_freq(struct cpumask *update_cpus) +{ + struct cpufreq_policy *policy; + struct em_data *em; + int index; + unsigned int cpu, tmp; + unsigned long percent_util = 0, max_util = 0, cap = 0, util = 0; + + /* + * In the case of CONFIG_FAIR_GROUP_SCHED, policy->cpus may be a subset + * of update_cpus. In such case take the first cpu in update_cpus, get + * its policy and try to scale the affects cpus. Then we clear the + * corresponding bits from update_cpus and try again. If a policy does + * not exist for a cpu then we remove that bit as well, preventing an + * infinite loop. + */ + while (!cpumask_empty(update_cpus)) { + percent_util = 0; + max_util = 0; + cap = 0; + util = 0; + + cpu = cpumask_first(update_cpus); + policy = cpufreq_cpu_get(cpu); + if (IS_ERR_OR_NULL(policy)) { + cpumask_clear_cpu(cpu, update_cpus); + continue; + } + + if (!policy->gov_data) + goto bail; + + em = policy->gov_data; + + if (ktime_before(ktime_get(), em->throttle)) { + trace_printk("THROTTLED"); + goto bail; + } + + /* + * try scaling cpus + * + * algorithm assumptions & description: + * all cpus in a policy run at the same rate/capacity. + * choose frequency target based on most utilized cpu. + * do not care about aggregating cpu utilization. + * do not track any historical trends beyond utilization + * if max_util > 80% of current capacity, + * go to max capacity + * if max_util < 20% of current capacity, + * go to the next lowest capacity + * otherwise, stay at the same capacity state + */ + for_each_cpu(tmp, policy->cpus) { + util = get_cpu_usage(tmp); + if (util > max_util) + max_util = util; + } + + cap = capacity_of(cpu); + if (!cap) { + goto bail; + } + + index = cpufreq_frequency_table_get_index(policy, policy->cur); + if (max_util > em->up_threshold[index]) { + /* write em->target_freq with read lock held */ + atomic_long_set(&em->target_freq, policy->max); + /* + * FIXME this is gross. convert arch_eval_cpu_freq to + * hold the write lock? + */ + atomic_set(&em->need_wake_task, 1); + } else if (max_util < em->down_threshold[index]) { + /* write em->target_freq with read lock held */ + atomic_long_set(&em->target_freq, policy->cur - 1); + /* + * FIXME this is gross. convert arch_eval_cpu_freq to + * hold the write lock? + */ + atomic_set(&em->need_wake_task, 1); + } + +bail: + /* remove policy->cpus fromm update_cpus */ + cpumask_andnot(update_cpus, update_cpus, policy->cpus); + cpufreq_cpu_put(policy); + } + + return; +} + +static void em_start(struct cpufreq_policy *policy) +{ + int index = 0, count = 0; + unsigned int capacity; + struct em_data *em; + struct cpufreq_frequency_table *pos; + + /* prepare per-policy private data */ + em = kzalloc(sizeof(*em), GFP_KERNEL); + if (!em) { + pr_debug("%s: failed to allocate private data\n", __func__); + return; + } + + policy->gov_data = em; + + /* how many entries in the frequency table? */ + cpufreq_for_each_entry(pos, policy->freq_table) + count++; + + /* pre-compute thresholds */ + em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); + em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); + + cpufreq_for_each_entry(pos, policy->freq_table) { + /* FIXME capacity below is not scaled for uarch */ + capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max; + em->up_threshold[index] = capacity * UP_THRESHOLD / 100; + em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100; + pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n", + __func__, cpumask_first(policy->cpus), index, + capacity, em->up_threshold[index], + em->down_threshold[index]); + index++; + } + + /* init per-policy kthread */ + em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task"); + if (IS_ERR_OR_NULL(em->task)) + pr_err("%s: failed to create kenergy_model_task thread\n", __func__); +} + + +static void em_stop(struct cpufreq_policy *policy) +{ + struct em_data *em; + + em = policy->gov_data; + + kthread_stop(em->task); + + /* replace with devm counterparts */ + kfree(em->up_threshold); + kfree(em->down_threshold); + kfree(em); +} + +static int energy_model_setup(struct cpufreq_policy *policy, unsigned int event) +{ + switch (event) { + case CPUFREQ_GOV_START: + /* Start managing the frequency */ + em_start(policy); + return 0; + + case CPUFREQ_GOV_STOP: + em_stop(policy); + return 0; + + case CPUFREQ_GOV_LIMITS: /* unused */ + case CPUFREQ_GOV_POLICY_INIT: /* unused */ + case CPUFREQ_GOV_POLICY_EXIT: /* unused */ + break; + } + return 0; +} + +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL +static +#endif +struct cpufreq_governor cpufreq_gov_energy_model = { + .name = "energy_model", + .governor = energy_model_setup, + .owner = THIS_MODULE, +}; + +static int __init energy_model_init(void) +{ + return cpufreq_register_governor(&cpufreq_gov_energy_model); +} + +static void __exit energy_model_exit(void) +{ + cpufreq_unregister_governor(&cpufreq_gov_energy_model); +} + +/* Try to make this the default governor */ +fs_initcall(energy_model_init); + +MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 188aa13..d9386b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4385,7 +4385,7 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); }

-static unsigned long capacity_of(int cpu) +unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; } @@ -4609,7 +4609,7 @@ static int __get_cpu_usage(int cpu, int delta) return sum; }

-static int get_cpu_usage(int cpu) +int get_cpu_usage(int cpu) { return __get_cpu_usage(cpu, 0); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 23b4c11..c357c77 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -806,6 +806,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) return sd; }

+unsigned long capacity_of(int cpu); +int get_cpu_usage(int cpu); + DECLARE_PER_CPU(struct sched_domain *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id); -- 2.2.2

Daniel Lezcano

11 Mar 11 Mar

10:08 a.m.

New subject: [PATCH 2/6] sched: energy_model: simple cpu frequency scaling policy

On 03/09/2015 03:06 PM, Juri Lelli wrote:

...

From: Mike Turquette mturquette@linaro.org

Building on top of the scale invariant capacity patches and earlier patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model.

Well, I would change the name to something more trivial, eg: load-driven or whatever. Introducing 'energy model' here has no sense from my pov except than buzzing :)

...

This policy is implemented using the CPUfreq governor interface for two main reasons:

re-using the CPUfreq machine drivers without using the governor

interface is hard. I do not forsee any issue continuing to use the governor interface going forward but it is worth making clear what this patch does up front.

using the CPUfreq interface allows us to switch between the

energy_model governor and other CPUfreq governors (such as ondemand) at run-time. This is very useful for comparative testing and tuning.

A caveat to #2 above is that the weak arch function used by the governor means that only one scheduler-driven policy can be linked at a time. This limitation does not apply to "traditional" governors. I raised this in my previous capacity_ops patches[0] but as discussed at LPC14 last week, it seems desirable to pursue a single cpu frequency scaling policy at first, and try to make that work for everyone interested in using it. If that model breaks down then we can revisit the idea of dynamic selection of scheduler-driven cpu frequency scaling.

Unlike legacy CPUfreq governors, this policy does not implement its own logic loop (such as a workqueue triggered by a timer), but instead uses an event-driven design. Frequency is evaluated by entering {en,de}queue_task_fair and then a kthread is woken from run_rebalance_domains which scales cpu frequency based on the latest evaluation.

The policy implemented in this patch takes the highest cpu utilization from policy->cpus and uses that select a frequency target based on the same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled thresholds are pre-computed when energy_model inits. The frequency selection is a simple comparison of cpu utilization (as defined in Morten's latest RFC) to the threshold values. In the future this logic could be replaced with something more sophisticated that uses PELT to get a historical overview. Ideas are welcome.

Note that the pre-computed thresholds above do not take into account micro-architecture differences (SMT or big.LITTLE hardware), only frequency invariance.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

Please remove the Not-signed-off which means "I don't testify this is open source code" [1].

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documen...

And Juri, as you are in the delivery path, you should add your signed-off-by.

...

drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 343 ++++++++++++++++++++++++++++++++++++++++++++

I remember Peter said he doesn't want a new file.

...

kernel/sched/fair.c | 4 +- kernel/sched/sched.h | 3 + 6 files changed, 373 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/energy_model.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..7792646 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL

bool "energy_model"

select CPU_FREQ_GOV_ENERGY_MODEL

select CPU_FREQ_GOV_PERFORMANCE

Why is needed gov perf ?

...

help
 Use the CPUfreq governor 'energy_model' as default. This
 scales cpu frequency from the scheduler as per-task statistics
 are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -198,6 +207,18 @@ config CPUFREQ_DT
 If in doubt, say N.
+config CPU_FREQ_GOV_ENERGY_MODEL
tristate "'energy model' cpufreq governor"

depends on CPU_FREQ

select CPU_FREQ_GOV_COMMON

help
 'energy_model' - this governor scales cpu frequency from the
 scheduler as a function of cpu utilization. It does not
 evaluate utilization on a periodic basis (unlike ondemand) but
 instead is invoked from CFS when updating per-task statistics.
 If in doubt, say N.
if X86 source "drivers/cpufreq/Kconfig.x86" endif
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..03b0ff5 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL) +extern struct cpufreq_governor cpufreq_gov_energy_model; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_energy_model) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..b75721d 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_ENERGY_MODEL) += energy_model.o diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c new file mode 100644 index 0000000..7fec5af --- /dev/null +++ b/kernel/sched/energy_model.c @@ -0,0 +1,343 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h>

+#include "sched.h"

+#define THROTTLE_MSEC 50 +#define UP_THRESHOLD 80 +#define DOWN_THRESHOLD 20

+/**
em_data - per-policy data used by energy_mode

@throttle: bail if current time is less than than ktime_throttle.
    Derived from THROTTLE_MSEC
@up_threshold: table of normalized capacity states to determine if cpu
    should run faster. Derived from UP_THRESHOLD
@down_threshold: table of normalized capacity states to determine if cpu
    should run slower. Derived from DOWN_THRESHOLD
struct em_data is the per-policy energy_model-specific data structure. A

per-policy instance of it is created when the energy_model governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/
+struct em_data {

/* per-policy throttling */

ktime_t throttle;

unsigned int *up_threshold;

unsigned int *down_threshold;

struct task_struct *task;

atomic_long_t target_freq;

atomic_t need_wake_task;

+};

+/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears all of the data structures down and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

+static int energy_model_thread(void *data) +{
struct sched_param param;

struct cpufreq_policy *policy;

struct em_data *em;

int ret;

policy = (struct cpufreq_policy *) data;

if (!policy) {
pr_warn("%s: missing policy\n", __func__);
do_exit(-EINVAL);
}

em = policy->gov_data;

if (!em) {
pr_warn("%s: missing governor data\n", __func__);
do_exit(-EINVAL);
}

param.sched_priority = 0;

sched_setscheduler(current, SCHED_FIFO, &param);

extra line.

...

do {
```
down_write(&policy->rwsem);
```

Why a semaphore is needed here ? They are very expensive, no ? Well I guess this is coming the cpufreq framework itself, and you can do anything about that, right ?

...

if (!atomic_read(&em->need_wake_task))  {

I don't get the purpose of this test. Assuming we skip the first pass leading us to 'schedule()', when we exit schedule() it is because we have been switched on with wakeup_task which was invoked by em_wake_up_process because em->need_wake_task == 1. Right after this test, you set em->need_wake_task = 0, entering again in schedule.

Why not just call schedule after setting the freq and get rid of this test ?

...

```
	up_write(&policy->rwsem);
```

	set_current_state(TASK_INTERRUPTIBLE);

```
	schedule();
```
```
	continue;
```
```
}
```

Can't you move the atomic test before holding the semaphore ?

...

ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq),

```
		CPUFREQ_RELATION_H);
```
```
if (ret)
```

	pr_debug("%s: __cpufreq_driver_target returned %d\n",

```
			__func__, ret);
```
```
em->throttle = ktime_get();
```
```
atomic_set(&em->need_wake_task, 0);
```
```
up_write(&policy->rwsem);
```

... and release the semaphore after __cpufreq_driver_target ?

...

} while (!kthread_should_stop());

do_exit(0);

+}

Assuming what I am saying above is right, why not write:

while (!kthread_should_stop()) {

set_current_state(TASK_INTERRUPTIBLE); schedule();

down_write(&policy->rwsem); ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq), CPUFREQ_RELATION_H); up_write(&policy->rwsem); em->throttle = ktime_get(); };

...

+static void em_wake_up_process(struct task_struct *task) +{
/* this is null during early boot */

if (IS_ERR_OR_NULL(task)) {
return;
}

IMO, this test means somewhere in the code we are putting the cart before the horse.

...

wake_up_process(task);

+}

+void arch_scale_cpu_freq(void) +{
struct cpufreq_policy *policy;

struct em_data *em;

int cpu;

for_each_online_cpu(cpu) {
policy = cpufreq_cpu_get(cpu);
if (IS_ERR_OR_NULL(policy))
	continue;
em = policy->gov_data;
if (!em) {
	cpufreq_cpu_put(policy);
	continue;
}
/*
 * FIXME replace the atomic stuff by holding write-locks
 * in arch_eval_cpu_freq?
 */
if (atomic_read(&em->need_wake_task)) {
	em_wake_up_process(em->task);
}
cpufreq_cpu_put(policy);
}
+}

+/**

arch_eval_cpu_freq - scale cpu frequency based on CFS utilization

@update_cpus: mask of CPUs with updated utilization and capacity

Declared and weakly defined in kernel/sched/fair.c This definition overrides

the default. In the case of CONFIG_FAIR_GROUP_SCHED, update_cpus may

contains cpus that are not in the same policy. Otherwise update_cpus will be

a single cpu.

Holds read lock for policy->rw_sem.

FIXME weak arch function means that only one definition of this function can

be linked. How to support multiple energy model policies?

using ops ?

...

+void arch_eval_cpu_freq(struct cpumask *update_cpus) +{

struct cpufreq_policy *policy;
struct em_data *em;
int index;
unsigned int cpu, tmp;
unsigned long percent_util = 0, max_util = 0, cap = 0, util = 0;
/*

* In the case of CONFIG_FAIR_GROUP_SCHED, policy->cpus may be a subset

* of update_cpus. In such case take the first cpu in update_cpus, get

* its policy and try to scale the affects cpus. Then we clear the

* corresponding bits from update_cpus and try again. If a policy does

* not exist for a cpu then we remove that bit as well, preventing an

```
* infinite loop.
```
```
*/
```
while (!cpumask_empty(update_cpus)) {
```
percent_util = 0;
```
```
max_util = 0;
```
```
cap = 0;
```
```
util = 0;
```
```
cpu = cpumask_first(update_cpus);
```
```
policy = cpufreq_cpu_get(cpu);
```
```
if (IS_ERR_OR_NULL(policy)) {
```
```
	cpumask_clear_cpu(cpu, update_cpus);
```
```
	continue;
```
```
}
```
```
if (!policy->gov_data)
```
```
	goto bail;
```
```
em = policy->gov_data;
```

if (ktime_before(ktime_get(), em->throttle)) {

```
	trace_printk("THROTTLED");
```
```
	goto bail;
```
```
}
```

Can you explain the throttle thing ?

...

```
/*
```
```
 * try scaling cpus
```
```
 *
```

 * algorithm assumptions & description:

 * 	all cpus in a policy run at the same rate/capacity.

 * 	choose frequency target based on most utilized cpu.

 * 	do not care about aggregating cpu utilization.

 * 	do not track any historical trends beyond utilization

 * 	if max_util > 80% of current capacity,

```
 * 		go to max capacity
```

 * 	if max_util < 20% of current capacity,

```
 * 		go to the next lowest capacity
```

 * 	otherwise, stay at the same capacity state

```
 */
```
```
for_each_cpu(tmp, policy->cpus) {
```
```
	util = get_cpu_usage(tmp);
```
```
	if (util > max_util)
```
```
		max_util = util;
```
```
}
```
```
cap = capacity_of(cpu);
```
```
if (!cap) {
```
```
	goto bail;
```
```
}
```

When 'cpu' does not have a capacity ?

...

index = cpufreq_frequency_table_get_index(policy, policy->cur);
if (max_util > em->up_threshold[index]) {
	/* write em->target_freq with read lock held */
	atomic_long_set(&em->target_freq, policy->max);
	/*
	 * FIXME this is gross. convert arch_eval_cpu_freq to
	 * hold the write lock?
	 */
	atomic_set(&em->need_wake_task, 1);
} else if (max_util < em->down_threshold[index]) {
	/* write em->target_freq with read lock held */
	atomic_long_set(&em->target_freq, policy->cur - 1);
	/*
	 * FIXME this is gross. convert arch_eval_cpu_freq to
	 * hold the write lock?
	 */
	atomic_set(&em->need_wake_task, 1);
}
+bail:
/* remove policy->cpus fromm update_cpus */
cpumask_andnot(update_cpus, update_cpus, policy->cpus);
cpufreq_cpu_put(policy);
}

return;
+}

+static void em_start(struct cpufreq_policy *policy) +{
int index = 0, count = 0;

unsigned int capacity;

struct em_data *em;

struct cpufreq_frequency_table *pos;

/* prepare per-policy private data */

em = kzalloc(sizeof(*em), GFP_KERNEL);

if (!em) {
pr_debug("%s: failed to allocate private data\n", __func__);
return;
}

policy->gov_data = em;

/* how many entries in the frequency table? */

cpufreq_for_each_entry(pos, policy->freq_table)
count++;
/* pre-compute thresholds */

em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);

em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);

Allocation check missing.

...

cpufreq_for_each_entry(pos, policy->freq_table) {
/* FIXME capacity below is not scaled for uarch */
capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
		__func__, cpumask_first(policy->cpus), index,
		capacity, em->up_threshold[index],
		em->down_threshold[index]);
index++;
}

/* init per-policy kthread */

em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task");

if (IS_ERR_OR_NULL(em->task))
pr_err("%s: failed to create kenergy_model_task thread\n", __func__);
+}

+static void em_stop(struct cpufreq_policy *policy) +{

struct em_data *em;

em = policy->gov_data;

kthread_stop(em->task);

/* replace with devm counterparts */

kfree(em->up_threshold);

kfree(em->down_threshold);

kfree(em);

+}

+static int energy_model_setup(struct cpufreq_policy *policy, unsigned int event) +{
switch (event) {
case CPUFREQ_GOV_START:
	/* Start managing the frequency */
	em_start(policy);
	return 0;
case CPUFREQ_GOV_STOP:
	em_stop(policy);
	return 0;
case CPUFREQ_GOV_LIMITS:	/* unused */
case CPUFREQ_GOV_POLICY_INIT:	/* unused */
case CPUFREQ_GOV_POLICY_EXIT:	/* unused */
	break;
}

return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL +static +#endif +struct cpufreq_governor cpufreq_gov_energy_model = {

.name = "energy_model",

.governor = energy_model_setup,

.owner = THIS_MODULE,

+};

+static int __init energy_model_init(void) +{

return cpufreq_register_governor(&cpufreq_gov_energy_model);

+}

+static void __exit energy_model_exit(void) +{

cpufreq_unregister_governor(&cpufreq_gov_energy_model);

+}

+/* Try to make this the default governor */ +fs_initcall(energy_model_init);

+MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 188aa13..d9386b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4385,7 +4385,7 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); }

-static unsigned long capacity_of(int cpu) +unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; } @@ -4609,7 +4609,7 @@ static int __get_cpu_usage(int cpu, int delta) return sum; }

-static int get_cpu_usage(int cpu) +int get_cpu_usage(int cpu) { return __get_cpu_usage(cpu, 0); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 23b4c11..c357c77 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -806,6 +806,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) return sd; }

+unsigned long capacity_of(int cpu); +int get_cpu_usage(int cpu);

DECLARE_PER_CPU(struct sched_domain *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id);

-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Juri Lelli

11:08 a.m.

New subject: [PATCH 2/6] sched: energy_model: simple cpu frequency scaling policy

Hi Daniel,

On 11/03/15 10:08, Daniel Lezcano wrote:

...

On 03/09/2015 03:06 PM, Juri Lelli wrote:

...
From: Mike Turquette mturquette@linaro.org

Building on top of the scale invariant capacity patches and earlier patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model.

Well, I would change the name to something more trivial, eg: load-driven or whatever. Introducing 'energy model' here has no sense from my pov except than buzzing :)

Or something like cpufreq_scheduler with "scheduler" as the new policy name. This thing has to be driven by the scheduler after all, and we might change our minds in the future about which signal we use to actually drive it.

...

...
This policy is implemented using the CPUfreq governor interface for two main reasons:

re-using the CPUfreq machine drivers without using the governor

interface is hard. I do not forsee any issue continuing to use the governor interface going forward but it is worth making clear what this patch does up front.

using the CPUfreq interface allows us to switch between the

energy_model governor and other CPUfreq governors (such as ondemand) at run-time. This is very useful for comparative testing and tuning.

A caveat to #2 above is that the weak arch function used by the governor means that only one scheduler-driven policy can be linked at a time. This limitation does not apply to "traditional" governors. I raised this in my previous capacity_ops patches[0] but as discussed at LPC14 last week, it seems desirable to pursue a single cpu frequency scaling policy at first, and try to make that work for everyone interested in using it. If that model breaks down then we can revisit the idea of dynamic selection of scheduler-driven cpu frequency scaling.

Unlike legacy CPUfreq governors, this policy does not implement its own logic loop (such as a workqueue triggered by a timer), but instead uses an event-driven design. Frequency is evaluated by entering {en,de}queue_task_fair and then a kthread is woken from run_rebalance_domains which scales cpu frequency based on the latest evaluation.

The policy implemented in this patch takes the highest cpu utilization from policy->cpus and uses that select a frequency target based on the same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled thresholds are pre-computed when energy_model inits. The frequency selection is a simple comparison of cpu utilization (as defined in Morten's latest RFC) to the threshold values. In the future this logic could be replaced with something more sophisticated that uses PELT to get a historical overview. Ideas are welcome.

Note that the pre-computed thresholds above do not take into account micro-architecture differences (SMT or big.LITTLE hardware), only frequency invariance.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

Please remove the Not-signed-off which means "I don't testify this is open source code" [1].

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documen...

And Juri, as you are in the delivery path, you should add your signed-off-by.

Oh, sure. Didn't added/removed anything here as this is simply a re-post with a little delta of Mike's original patchset. This wasn't mean for LKML. It's just a way to restart discussion.

Thanks for your feedback anyway. I guess I'll let Mike answer your points as he might already completely reworked this code :).

Thanks,

- Juri

...

...

drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 343 ++++++++++++++++++++++++++++++++++++++++++++

I remember Peter said he doesn't want a new file.

...
kernel/sched/fair.c | 4 +- kernel/sched/sched.h | 3 + 6 files changed, 373 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/energy_model.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..7792646 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
bool "energy_model"
select CPU_FREQ_GOV_ENERGY_MODEL
select CPU_FREQ_GOV_PERFORMANCE
Why is needed gov perf ?

...
help
  Use the CPUfreq governor 'energy_model' as default. This
  scales cpu frequency from the scheduler as per-task statistics
  are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -198,6 +207,18 @@ config CPUFREQ_DT
    If in doubt, say N.
+config CPU_FREQ_GOV_ENERGY_MODEL
tristate "'energy model' cpufreq governor"
depends on CPU_FREQ
select CPU_FREQ_GOV_COMMON
help
  'energy_model' - this governor scales cpu frequency from the
  scheduler as a function of cpu utilization. It does not
  evaluate utilization on a periodic basis (unlike ondemand) but
  instead is invoked from CFS when updating per-task statistics.
  If in doubt, say N.
if X86 source "drivers/cpufreq/Kconfig.x86" endif
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..03b0ff5 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL) +extern struct cpufreq_governor cpufreq_gov_energy_model; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_energy_model) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..b75721d 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_ENERGY_MODEL) += energy_model.o diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c new file mode 100644 index 0000000..7fec5af --- /dev/null +++ b/kernel/sched/energy_model.c @@ -0,0 +1,343 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h>

+#include "sched.h"

+#define THROTTLE_MSEC 50 +#define UP_THRESHOLD 80 +#define DOWN_THRESHOLD 20

+/**
em_data - per-policy data used by energy_mode

@throttle: bail if current time is less than than ktime_throttle.
          Derived from THROTTLE_MSEC
@up_threshold: table of normalized capacity states to determine if cpu
          should run faster. Derived from UP_THRESHOLD
@down_threshold: table of normalized capacity states to determine if cpu
          should run slower. Derived from DOWN_THRESHOLD
struct em_data is the per-policy energy_model-specific data structure. A

per-policy instance of it is created when the energy_model governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/
+struct em_data {
/* per-policy throttling */
ktime_t throttle;
unsigned int *up_threshold;
unsigned int *down_threshold;
struct task_struct *task;
atomic_long_t target_freq;
atomic_t need_wake_task;
+};

+/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears all of the data structures down and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

+static int energy_model_thread(void *data) +{
struct sched_param param;
struct cpufreq_policy *policy;
struct em_data *em;
int ret;
policy = (struct cpufreq_policy *) data;
if (!policy) {
        pr_warn("%s: missing policy\n", __func__);
        do_exit(-EINVAL);
}
em = policy->gov_data;
if (!em) {
        pr_warn("%s: missing governor data\n", __func__);
        do_exit(-EINVAL);
}
param.sched_priority = 0;
sched_setscheduler(current, SCHED_FIFO, &param);
extra line.

...
do {
        down_write(&policy->rwsem);
Why a semaphore is needed here ? They are very expensive, no ? Well I guess this is coming the cpufreq framework itself, and you can do anything about that, right ?

...
        if (!atomic_read(&em->need_wake_task))  {
I don't get the purpose of this test. Assuming we skip the first pass leading us to 'schedule()', when we exit schedule() it is because we have been switched on with wakeup_task which was invoked by em_wake_up_process because em->need_wake_task == 1. Right after this test, you set em->need_wake_task = 0, entering again in schedule.

Why not just call schedule after setting the freq and get rid of this test ?

...
                up_write(&policy->rwsem);
                set_current_state(TASK_INTERRUPTIBLE);
                schedule();
                continue;
        }
Can't you move the atomic test before holding the semaphore ?

...
        ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq),
                        CPUFREQ_RELATION_H);
        if (ret)
                pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                __func__, ret);
        em->throttle = ktime_get();
        atomic_set(&em->need_wake_task, 0);
        up_write(&policy->rwsem);
... and release the semaphore after __cpufreq_driver_target ?

...
} while (!kthread_should_stop());
do_exit(0);
+}
Assuming what I am saying above is right, why not write:

while (!kthread_should_stop()) {
    set_current_state(TASK_INTERRUPTIBLE);
    schedule();

    down_write(&policy->rwsem);
    ret = __cpufreq_driver_target(policy,
                            atomic_read(&em->target_freq),
                            CPUFREQ_RELATION_H);
    up_write(&policy->rwsem);
    em->throttle = ktime_get();
};

...
+static void em_wake_up_process(struct task_struct *task) +{
/* this is null during early boot */
if (IS_ERR_OR_NULL(task)) {
        return;
}
IMO, this test means somewhere in the code we are putting the cart before the horse.

...
wake_up_process(task);
+}

+void arch_scale_cpu_freq(void) +{
struct cpufreq_policy *policy;
struct em_data *em;
int cpu;
for_each_online_cpu(cpu) {
        policy = cpufreq_cpu_get(cpu);
        if (IS_ERR_OR_NULL(policy))
                continue;
        em = policy->gov_data;
        if (!em) {
                cpufreq_cpu_put(policy);
                continue;
        }
        /*
         * FIXME replace the atomic stuff by holding write-locks
         * in arch_eval_cpu_freq?
         */
        if (atomic_read(&em->need_wake_task)) {
                em_wake_up_process(em->task);
        }
        cpufreq_cpu_put(policy);
}
+}

+/**

arch_eval_cpu_freq - scale cpu frequency based on CFS utilization

@update_cpus: mask of CPUs with updated utilization and capacity

Declared and weakly defined in kernel/sched/fair.c This definition overrides

the default. In the case of CONFIG_FAIR_GROUP_SCHED, update_cpus may

contains cpus that are not in the same policy. Otherwise update_cpus will be

a single cpu.

Holds read lock for policy->rw_sem.

FIXME weak arch function means that only one definition of this function can

be linked. How to support multiple energy model policies?
using ops ?

...
*/

+void arch_eval_cpu_freq(struct cpumask *update_cpus) +{
struct cpufreq_policy *policy;
struct em_data *em;
int index;
unsigned int cpu, tmp;
unsigned long percent_util = 0, max_util = 0, cap = 0, util = 0;
/*
 * In the case of CONFIG_FAIR_GROUP_SCHED, policy->cpus may be a subset
 * of update_cpus. In such case take the first cpu in update_cpus, get
 * its policy and try to scale the affects cpus. Then we clear the
 * corresponding bits from update_cpus and try again. If a policy does
 * not exist for a cpu then we remove that bit as well, preventing an
 * infinite loop.
 */
while (!cpumask_empty(update_cpus)) {
        percent_util = 0;
        max_util = 0;
        cap = 0;
        util = 0;
        cpu = cpumask_first(update_cpus);
        policy = cpufreq_cpu_get(cpu);
        if (IS_ERR_OR_NULL(policy)) {
                cpumask_clear_cpu(cpu, update_cpus);
                continue;
        }
        if (!policy->gov_data)
                goto bail;
        em = policy->gov_data;
        if (ktime_before(ktime_get(), em->throttle)) {
                trace_printk("THROTTLED");
                goto bail;
        }
Can you explain the throttle thing ?

...
        /*
         * try scaling cpus
         *
         * algorithm assumptions & description:
         *      all cpus in a policy run at the same rate/capacity.
         *      choose frequency target based on most utilized cpu.
         *      do not care about aggregating cpu utilization.
         *      do not track any historical trends beyond utilization
         *      if max_util > 80% of current capacity,
         *              go to max capacity
         *      if max_util < 20% of current capacity,
         *              go to the next lowest capacity
         *      otherwise, stay at the same capacity state
         */
        for_each_cpu(tmp, policy->cpus) {
                util = get_cpu_usage(tmp);
                if (util > max_util)
                        max_util = util;
        }
        cap = capacity_of(cpu);
        if (!cap) {
                goto bail;
        }
When 'cpu' does not have a capacity ?

...
        index = cpufreq_frequency_table_get_index(policy, policy->cur);
        if (max_util > em->up_threshold[index]) {
                /* write em->target_freq with read lock held */
                atomic_long_set(&em->target_freq, policy->max);
                /*
                 * FIXME this is gross. convert arch_eval_cpu_freq to
                 * hold the write lock?
                 */
                atomic_set(&em->need_wake_task, 1);
        } else if (max_util < em->down_threshold[index]) {
                /* write em->target_freq with read lock held */
                atomic_long_set(&em->target_freq, policy->cur - 1);
                /*
                 * FIXME this is gross. convert arch_eval_cpu_freq to
                 * hold the write lock?
                 */
                atomic_set(&em->need_wake_task, 1);
        }
+bail:
        /* remove policy->cpus fromm update_cpus */
        cpumask_andnot(update_cpus, update_cpus, policy->cpus);
        cpufreq_cpu_put(policy);
}
return;
+}

+static void em_start(struct cpufreq_policy *policy) +{
int index = 0, count = 0;
unsigned int capacity;
struct em_data *em;
struct cpufreq_frequency_table *pos;
/* prepare per-policy private data */
em = kzalloc(sizeof(*em), GFP_KERNEL);
if (!em) {
        pr_debug("%s: failed to allocate private data\n", __func__);
        return;
}
policy->gov_data = em;
/* how many entries in the frequency table? */
cpufreq_for_each_entry(pos, policy->freq_table)
        count++;
/* pre-compute thresholds */
em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
Allocation check missing.

...
cpufreq_for_each_entry(pos, policy->freq_table) {
        /* FIXME capacity below is not scaled for uarch */
        capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
        em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
        em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
        pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
                        __func__, cpumask_first(policy->cpus), index,
                        capacity, em->up_threshold[index],
                        em->down_threshold[index]);
        index++;
}
/* init per-policy kthread */
em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task");
if (IS_ERR_OR_NULL(em->task))
        pr_err("%s: failed to create kenergy_model_task thread\n", __func__);
+}

+static void em_stop(struct cpufreq_policy *policy) +{
struct em_data *em;
em = policy->gov_data;
kthread_stop(em->task);
/* replace with devm counterparts */
kfree(em->up_threshold);
kfree(em->down_threshold);
kfree(em);
+}

+static int energy_model_setup(struct cpufreq_policy *policy, unsigned int event) +{
switch (event) {
        case CPUFREQ_GOV_START:
                /* Start managing the frequency */
                em_start(policy);
                return 0;
        case CPUFREQ_GOV_STOP:
                em_stop(policy);
                return 0;
        case CPUFREQ_GOV_LIMITS:        /* unused */
        case CPUFREQ_GOV_POLICY_INIT:   /* unused */
        case CPUFREQ_GOV_POLICY_EXIT:   /* unused */
                break;
}
return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL +static +#endif +struct cpufreq_governor cpufreq_gov_energy_model = {
.name                   = "energy_model",
.governor               = energy_model_setup,
.owner                  = THIS_MODULE,
+};

+static int __init energy_model_init(void) +{
return cpufreq_register_governor(&cpufreq_gov_energy_model);
+}

+static void __exit energy_model_exit(void) +{
cpufreq_unregister_governor(&cpufreq_gov_energy_model);
+}

+/* Try to make this the default governor */ +fs_initcall(energy_model_init);

+MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 188aa13..d9386b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4385,7 +4385,7 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); }

-static unsigned long capacity_of(int cpu) +unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; } @@ -4609,7 +4609,7 @@ static int __get_cpu_usage(int cpu, int delta) return sum; }

-static int get_cpu_usage(int cpu) +int get_cpu_usage(int cpu) { return __get_cpu_usage(cpu, 0); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 23b4c11..c357c77 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -806,6 +806,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) return sd; }

+unsigned long capacity_of(int cpu); +int get_cpu_usage(int cpu);

DECLARE_PER_CPU(struct sched_domain *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id);
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Daniel Lezcano

1:09 p.m.

New subject: [PATCH 2/6] sched: energy_model: simple cpu frequency scaling policy

On 03/11/2015 12:08 PM, Juri Lelli wrote:

...

Hi Daniel,

On 11/03/15 10:08, Daniel Lezcano wrote:

...
On 03/09/2015 03:06 PM, Juri Lelli wrote:

...
From: Mike Turquette mturquette@linaro.org

Building on top of the scale invariant capacity patches and earlier patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model.

Well, I would change the name to something more trivial, eg: load-driven or whatever. Introducing 'energy model' here has no sense from my pov except than buzzing :)

Or something like cpufreq_scheduler with "scheduler" as the new policy name. This thing has to be driven by the scheduler after all, and we might change our minds in the future about which signal we use to actually drive it.

Yes, why not.

[ ... ]

...

...
...
Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

Please remove the Not-signed-off which means "I don't testify this is open source code" [1].

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documen...

And Juri, as you are in the delivery path, you should add your signed-off-by.

Oh, sure. Didn't added/removed anything here as this is simply a re-post with a little delta of Mike's original patchset. This wasn't mean for LKML. It's just a way to restart discussion. Thanks for your feedback anyway. I guess I'll let Mike answer your points as he might already completely reworked this code :).

Ok, thanks

-- Daniel

-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Michael Turquette

16 Mar 16 Mar

4:28 p.m.

New subject: [PATCH 2/6] sched: energy_model: simple cpu frequency scaling policy

Quoting Juri Lelli (2015-03-11 04:08:55)

...

Hi Daniel,

On 11/03/15 10:08, Daniel Lezcano wrote:

...
On 03/09/2015 03:06 PM, Juri Lelli wrote:

...
From: Mike Turquette mturquette@linaro.org

Building on top of the scale invariant capacity patches and earlier patches in this series that prepare CFS for scaling cpu frequency, this patch implements a simple, naive ondemand-like cpu frequency scaling policy that is driven by enqueue_task_fair and dequeue_tassk_fair. This new policy is named "energy_model" as an homage to the on-going work in that area. It is NOT an actual energy model.

Well, I would change the name to something more trivial, eg: load-driven or whatever. Introducing 'energy model' here has no sense from my pov except than buzzing :)

Or something like cpufreq_scheduler with "scheduler" as the new policy name. This thing has to be driven by the scheduler after all, and we might change our minds in the future about which signal we use to actually drive it.

In my tree this C file is cap_gov.c, and the govenor is referred to as the "cpu capacity governor".

...

...
...
This policy is implemented using the CPUfreq governor interface for two main reasons:

re-using the CPUfreq machine drivers without using the governor

interface is hard. I do not forsee any issue continuing to use the governor interface going forward but it is worth making clear what this patch does up front.

using the CPUfreq interface allows us to switch between the

energy_model governor and other CPUfreq governors (such as ondemand) at run-time. This is very useful for comparative testing and tuning.

A caveat to #2 above is that the weak arch function used by the governor means that only one scheduler-driven policy can be linked at a time. This limitation does not apply to "traditional" governors. I raised this in my previous capacity_ops patches[0] but as discussed at LPC14 last week, it seems desirable to pursue a single cpu frequency scaling policy at first, and try to make that work for everyone interested in using it. If that model breaks down then we can revisit the idea of dynamic selection of scheduler-driven cpu frequency scaling.

Unlike legacy CPUfreq governors, this policy does not implement its own logic loop (such as a workqueue triggered by a timer), but instead uses an event-driven design. Frequency is evaluated by entering {en,de}queue_task_fair and then a kthread is woken from run_rebalance_domains which scales cpu frequency based on the latest evaluation.

The policy implemented in this patch takes the highest cpu utilization from policy->cpus and uses that select a frequency target based on the same 80%/20% thresholds used as defaults in ondemand. Frequenecy-scaled thresholds are pre-computed when energy_model inits. The frequency selection is a simple comparison of cpu utilization (as defined in Morten's latest RFC) to the threshold values. In the future this logic could be replaced with something more sophisticated that uses PELT to get a historical overview. Ideas are welcome.

Note that the pre-computed thresholds above do not take into account micro-architecture differences (SMT or big.LITTLE hardware), only frequency invariance.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

Please remove the Not-signed-off which means "I don't testify this is open source code" [1].

[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documen...

And Juri, as you are in the delivery path, you should add your signed-off-by.

Oh, sure. Didn't added/removed anything here as this is simply a re-post with a little delta of Mike's original patchset. This wasn't mean for LKML. It's just a way to restart discussion.

Thanks for your feedback anyway. I guess I'll let Mike answer your points as he might already completely reworked this code :).

Not-signed-off-by is something I do when posting a very raw RFC. I use it to remind people that this is experimental and I need as much feedback as possible.

Regards, Mike

...

Thanks,

Juri

...
...

drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 343 ++++++++++++++++++++++++++++++++++++++++++++

I remember Peter said he doesn't want a new file.

...
kernel/sched/fair.c | 4 +- kernel/sched/sched.h | 3 + 6 files changed, 373 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/energy_model.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..7792646 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL
bool "energy_model"
select CPU_FREQ_GOV_ENERGY_MODEL
select CPU_FREQ_GOV_PERFORMANCE
Why is needed gov perf ?

...
help
  Use the CPUfreq governor 'energy_model' as default. This
  scales cpu frequency from the scheduler as per-task statistics
  are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE
@@ -198,6 +207,18 @@ config CPUFREQ_DT
    If in doubt, say N.
+config CPU_FREQ_GOV_ENERGY_MODEL
tristate "'energy model' cpufreq governor"
depends on CPU_FREQ
select CPU_FREQ_GOV_COMMON
help
  'energy_model' - this governor scales cpu frequency from the
  scheduler as a function of cpu utilization. It does not
  evaluate utilization on a periodic basis (unlike ondemand) but
  instead is invoked from CFS when updating per-task statistics.
  If in doubt, say N.
if X86 source "drivers/cpufreq/Kconfig.x86" endif
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..03b0ff5 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL) +extern struct cpufreq_governor cpufreq_gov_energy_model; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_energy_model) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..b75721d 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_ENERGY_MODEL) += energy_model.o diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c new file mode 100644 index 0000000..7fec5af --- /dev/null +++ b/kernel/sched/energy_model.c @@ -0,0 +1,343 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h>

+#include "sched.h"

+#define THROTTLE_MSEC 50 +#define UP_THRESHOLD 80 +#define DOWN_THRESHOLD 20

+/**
em_data - per-policy data used by energy_mode

@throttle: bail if current time is less than than ktime_throttle.
          Derived from THROTTLE_MSEC
@up_threshold: table of normalized capacity states to determine if cpu
          should run faster. Derived from UP_THRESHOLD
@down_threshold: table of normalized capacity states to determine if cpu
          should run slower. Derived from DOWN_THRESHOLD
struct em_data is the per-policy energy_model-specific data structure. A

per-policy instance of it is created when the energy_model governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/
+struct em_data {
/* per-policy throttling */
ktime_t throttle;
unsigned int *up_threshold;
unsigned int *down_threshold;
struct task_struct *task;
atomic_long_t target_freq;
atomic_t need_wake_task;
+};

+/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears all of the data structures down and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

+static int energy_model_thread(void *data) +{
struct sched_param param;
struct cpufreq_policy *policy;
struct em_data *em;
int ret;
policy = (struct cpufreq_policy *) data;
if (!policy) {
        pr_warn("%s: missing policy\n", __func__);
        do_exit(-EINVAL);
}
em = policy->gov_data;
if (!em) {
        pr_warn("%s: missing governor data\n", __func__);
        do_exit(-EINVAL);
}
param.sched_priority = 0;
sched_setscheduler(current, SCHED_FIFO, &param);
extra line.

...
do {
        down_write(&policy->rwsem);
Why a semaphore is needed here ? They are very expensive, no ? Well I guess this is coming the cpufreq framework itself, and you can do anything about that, right ?

...
        if (!atomic_read(&em->need_wake_task))  {
I don't get the purpose of this test. Assuming we skip the first pass leading us to 'schedule()', when we exit schedule() it is because we have been switched on with wakeup_task which was invoked by em_wake_up_process because em->need_wake_task == 1. Right after this test, you set em->need_wake_task = 0, entering again in schedule.

Why not just call schedule after setting the freq and get rid of this test ?

...
                up_write(&policy->rwsem);
                set_current_state(TASK_INTERRUPTIBLE);
                schedule();
                continue;
        }
Can't you move the atomic test before holding the semaphore ?

...
        ret = __cpufreq_driver_target(policy, atomic_read(&em->target_freq),
                        CPUFREQ_RELATION_H);
        if (ret)
                pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                __func__, ret);
        em->throttle = ktime_get();
        atomic_set(&em->need_wake_task, 0);
        up_write(&policy->rwsem);
... and release the semaphore after __cpufreq_driver_target ?

...
} while (!kthread_should_stop());
do_exit(0);
+}
Assuming what I am saying above is right, why not write:

while (!kthread_should_stop()) {
    set_current_state(TASK_INTERRUPTIBLE);
    schedule();

    down_write(&policy->rwsem);
    ret = __cpufreq_driver_target(policy,
                            atomic_read(&em->target_freq),
                            CPUFREQ_RELATION_H);
    up_write(&policy->rwsem);
    em->throttle = ktime_get();
};

...
+static void em_wake_up_process(struct task_struct *task) +{
/* this is null during early boot */
if (IS_ERR_OR_NULL(task)) {
        return;
}
IMO, this test means somewhere in the code we are putting the cart before the horse.

...
wake_up_process(task);
+}

+void arch_scale_cpu_freq(void) +{
struct cpufreq_policy *policy;
struct em_data *em;
int cpu;
for_each_online_cpu(cpu) {
        policy = cpufreq_cpu_get(cpu);
        if (IS_ERR_OR_NULL(policy))
                continue;
        em = policy->gov_data;
        if (!em) {
                cpufreq_cpu_put(policy);
                continue;
        }
        /*
         * FIXME replace the atomic stuff by holding write-locks
         * in arch_eval_cpu_freq?
         */
        if (atomic_read(&em->need_wake_task)) {
                em_wake_up_process(em->task);
        }
        cpufreq_cpu_put(policy);
}
+}

+/**

arch_eval_cpu_freq - scale cpu frequency based on CFS utilization

@update_cpus: mask of CPUs with updated utilization and capacity

Declared and weakly defined in kernel/sched/fair.c This definition overrides

the default. In the case of CONFIG_FAIR_GROUP_SCHED, update_cpus may

contains cpus that are not in the same policy. Otherwise update_cpus will be

a single cpu.

Holds read lock for policy->rw_sem.

FIXME weak arch function means that only one definition of this function can

be linked. How to support multiple energy model policies?
using ops ?

...
*/

+void arch_eval_cpu_freq(struct cpumask *update_cpus) +{
struct cpufreq_policy *policy;
struct em_data *em;
int index;
unsigned int cpu, tmp;
unsigned long percent_util = 0, max_util = 0, cap = 0, util = 0;
/*
 * In the case of CONFIG_FAIR_GROUP_SCHED, policy->cpus may be a subset
 * of update_cpus. In such case take the first cpu in update_cpus, get
 * its policy and try to scale the affects cpus. Then we clear the
 * corresponding bits from update_cpus and try again. If a policy does
 * not exist for a cpu then we remove that bit as well, preventing an
 * infinite loop.
 */
while (!cpumask_empty(update_cpus)) {
        percent_util = 0;
        max_util = 0;
        cap = 0;
        util = 0;
        cpu = cpumask_first(update_cpus);
        policy = cpufreq_cpu_get(cpu);
        if (IS_ERR_OR_NULL(policy)) {
                cpumask_clear_cpu(cpu, update_cpus);
                continue;
        }
        if (!policy->gov_data)
                goto bail;
        em = policy->gov_data;
        if (ktime_before(ktime_get(), em->throttle)) {
                trace_printk("THROTTLED");
                goto bail;
        }
Can you explain the throttle thing ?

...
        /*
         * try scaling cpus
         *
         * algorithm assumptions & description:
         *      all cpus in a policy run at the same rate/capacity.
         *      choose frequency target based on most utilized cpu.
         *      do not care about aggregating cpu utilization.
         *      do not track any historical trends beyond utilization
         *      if max_util > 80% of current capacity,
         *              go to max capacity
         *      if max_util < 20% of current capacity,
         *              go to the next lowest capacity
         *      otherwise, stay at the same capacity state
         */
        for_each_cpu(tmp, policy->cpus) {
                util = get_cpu_usage(tmp);
                if (util > max_util)
                        max_util = util;
        }
        cap = capacity_of(cpu);
        if (!cap) {
                goto bail;
        }
When 'cpu' does not have a capacity ?

...
        index = cpufreq_frequency_table_get_index(policy, policy->cur);
        if (max_util > em->up_threshold[index]) {
                /* write em->target_freq with read lock held */
                atomic_long_set(&em->target_freq, policy->max);
                /*
                 * FIXME this is gross. convert arch_eval_cpu_freq to
                 * hold the write lock?
                 */
                atomic_set(&em->need_wake_task, 1);
        } else if (max_util < em->down_threshold[index]) {
                /* write em->target_freq with read lock held */
                atomic_long_set(&em->target_freq, policy->cur - 1);
                /*
                 * FIXME this is gross. convert arch_eval_cpu_freq to
                 * hold the write lock?
                 */
                atomic_set(&em->need_wake_task, 1);
        }
+bail:
        /* remove policy->cpus fromm update_cpus */
        cpumask_andnot(update_cpus, update_cpus, policy->cpus);
        cpufreq_cpu_put(policy);
}
return;
+}

+static void em_start(struct cpufreq_policy *policy) +{
int index = 0, count = 0;
unsigned int capacity;
struct em_data *em;
struct cpufreq_frequency_table *pos;
/* prepare per-policy private data */
em = kzalloc(sizeof(*em), GFP_KERNEL);
if (!em) {
        pr_debug("%s: failed to allocate private data\n", __func__);
        return;
}
policy->gov_data = em;
/* how many entries in the frequency table? */
cpufreq_for_each_entry(pos, policy->freq_table)
        count++;
/* pre-compute thresholds */
em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
Allocation check missing.

...
cpufreq_for_each_entry(pos, policy->freq_table) {
        /* FIXME capacity below is not scaled for uarch */
        capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
        em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
        em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
        pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
                        __func__, cpumask_first(policy->cpus), index,
                        capacity, em->up_threshold[index],
                        em->down_threshold[index]);
        index++;
}
/* init per-policy kthread */
em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task");
if (IS_ERR_OR_NULL(em->task))
        pr_err("%s: failed to create kenergy_model_task thread\n", __func__);
+}

+static void em_stop(struct cpufreq_policy *policy) +{
struct em_data *em;
em = policy->gov_data;
kthread_stop(em->task);
/* replace with devm counterparts */
kfree(em->up_threshold);
kfree(em->down_threshold);
kfree(em);
+}

+static int energy_model_setup(struct cpufreq_policy *policy, unsigned int event) +{
switch (event) {
        case CPUFREQ_GOV_START:
                /* Start managing the frequency */
                em_start(policy);
                return 0;
        case CPUFREQ_GOV_STOP:
                em_stop(policy);
                return 0;
        case CPUFREQ_GOV_LIMITS:        /* unused */
        case CPUFREQ_GOV_POLICY_INIT:   /* unused */
        case CPUFREQ_GOV_POLICY_EXIT:   /* unused */
                break;
}
return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ENERGY_MODEL +static +#endif +struct cpufreq_governor cpufreq_gov_energy_model = {
.name                   = "energy_model",
.governor               = energy_model_setup,
.owner                  = THIS_MODULE,
+};

+static int __init energy_model_init(void) +{
return cpufreq_register_governor(&cpufreq_gov_energy_model);
+}

+static void __exit energy_model_exit(void) +{
cpufreq_unregister_governor(&cpufreq_gov_energy_model);
+}

+/* Try to make this the default governor */ +fs_initcall(energy_model_init);

+MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 188aa13..d9386b0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4385,7 +4385,7 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); }

-static unsigned long capacity_of(int cpu) +unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; } @@ -4609,7 +4609,7 @@ static int __get_cpu_usage(int cpu, int delta) return sum; }

-static int get_cpu_usage(int cpu) +int get_cpu_usage(int cpu) { return __get_cpu_usage(cpu, 0); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 23b4c11..c357c77 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -806,6 +806,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag) return sd; }

+unsigned long capacity_of(int cpu); +int get_cpu_usage(int cpu);

DECLARE_PER_CPU(struct sched_domain *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id);
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Juri Lelli

9 Mar 9 Mar

2:06 p.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

From: Michael Turquette mturquette@linaro.org

{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.

arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.

All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.

Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.

Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?

Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org --- kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4236,6 +4236,11 @@ static inline void hrtick_update(struct rq *rq) } #endif

+static inline bool energy_aware(void) +{ + return sched_feat(ENERGY_AWARE); +} + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -4246,6 +4251,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + struct cpumask update_cpus; + + cpumask_clear(&update_cpus);

for_each_sched_entity(se) { if (se->on_rq) @@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1); + /* track cpus that need to be re-evaluated */ + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); }

+ /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); + + /* + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to + * typedef update_cpus into an int and skip all of the cpumask + * stuff + */ + cpumask_set_cpu(cpu_of(rq), &update_cpus); } + + if (energy_aware() && !cpumask_empty(&update_cpus)) + arch_eval_cpu_freq(&update_cpus); + hrtick_update(rq); }

@@ -4296,6 +4318,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; int task_sleep = flags & DEQUEUE_SLEEP; + struct cpumask update_cpus; + + cpumask_clear(&update_cpus);

for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -4336,12 +4361,26 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1); + /* track runqueues/cpus that need to be re-evaluated */ + cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus); }

+ /* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); + + /* + * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to + * typedef update_cpus into an int and skip all of the cpumask + * stuff + */ + cpumask_set_cpu(cpu_of(rq), &update_cpus); } + + if (energy_aware() && !cpumask_empty(&update_cpus)) + arch_eval_cpu_freq(&update_cpus); + hrtick_update(rq); }

@@ -4615,11 +4654,6 @@ int get_cpu_usage(int cpu) }

-static inline bool energy_aware(void) -{ - return sched_feat(ENERGY_AWARE); -} - struct energy_env { struct sched_group *sg_top; struct sched_group *sg_cap; @@ -8292,6 +8326,9 @@ static void run_rebalance_domains(struct softirq_action *h) * stopped. */ nohz_idle_balance(this_rq, idle); + + if (energy_aware()) + arch_scale_cpu_freq(); }

/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c357c77..167ba2a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -808,6 +808,8 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)

unsigned long capacity_of(int cpu); int get_cpu_usage(int cpu); +void arch_scale_cpu_freq(void); +void arch_eval_cpu_freq(struct cpumask *update_cpus);

DECLARE_PER_CPU(struct sched_domain *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); -- 2.2.2

Morten Rasmussen

11 Mar 11 Mar

5 p.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:

...

From: Michael Turquette mturquette@linaro.org

{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.

arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.

All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.

Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.

Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?

Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c

[...]

...

@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
}
/* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
/*
 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
 * typedef update_cpus into an int and skip all of the cpumask
 * stuff
 */
cpumask_set_cpu(cpu_of(rq), &update_cpus);

You are only affecting one cpu in {en,de}queue_task_fair() so update_cpus should never have more than one cpu set. The for_each_sched_entity() loops are just to ensure that all the parent sched groups' sched_entities are enqueued on their parents rqs. They all belong to one cpu and that is cpu_of(rq).

I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.

Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.

If you keep the cpumask here in {en,de}queue_task_fair() you can get away with just a single cpumask_set_cpu(cpu_of(rq)).

If you want to experiment with hooking into load_balance() which would cover periodic, idle, and nohz_idle balancing I think it could worth a try setting env->src_cpu and env->dst_cpu in the update_cpus mask just after attach_tasks() in load_balance(). You get there every time tasks have been migrated due to 'normal' load-balancing.

You can do something very similar after attach_one_task() in active_load_balance_cpu_stop() to get active balance migrations too (not 'normal').

I guess that should cover all changes in utilization due to task migrations. The last bit missing is changes to utilization caused by changing task behaviour. For example, a small task suddenly turning heavy. Since it is now runnable all the time it won't be dequeued and load_balance() may not find a reason to migrate it. If we don't check the utilization of busy cpus every now and again the frequency we picked a while ago may no longer match the current utilization.

A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.

Morten

Juri Lelli

13 Mar 13 Mar

10:37 a.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

On 11/03/15 17:00, Morten Rasmussen wrote:

...

On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:

...
From: Michael Turquette mturquette@linaro.org

{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.

arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.

All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.

Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.

Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?

Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c

[...]

...
@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
}
/* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
/*
 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
 * typedef update_cpus into an int and skip all of the cpumask
 * stuff
 */
cpumask_set_cpu(cpu_of(rq), &update_cpus);
You are only affecting one cpu in {en,de}queue_task_fair() so update_cpus should never have more than one cpu set. The for_each_sched_entity() loops are just to ensure that all the parent sched groups' sched_entities are enqueued on their parents rqs. They all belong to one cpu and that is cpu_of(rq).

I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.

Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.

Right. We need a way to keep track of "pending requests".

...

If you keep the cpumask here in {en,de}queue_task_fair() you can get away with just a single cpumask_set_cpu(cpu_of(rq)).

If you want to experiment with hooking into load_balance() which would cover periodic, idle, and nohz_idle balancing I think it could worth a try setting env->src_cpu and env->dst_cpu in the update_cpus mask just after attach_tasks() in load_balance(). You get there every time tasks have been migrated due to 'normal' load-balancing.

You can do something very similar after attach_one_task() in active_load_balance_cpu_stop() to get active balance migrations too (not 'normal').

{attach,detach}_tasks() (and attach_one_task()) end up calling {activate,deactivate}_task, so we have {enqueue,dequeue}_task for source and destination runqueues. So, even in these cases it seems fine to just evaluate the new situation from {en,de}queue_task_fair() (or even the core calling sites if we want to be more general).

Am I missing something?

...

I guess that should cover all changes in utilization due to task migrations. The last bit missing is changes to utilization caused by changing task behaviour. For example, a small task suddenly turning heavy. Since it is now runnable all the time it won't be dequeued and load_balance() may not find a reason to migrate it. If we don't check the utilization of busy cpus every now and again the frequency we picked a while ago may no longer match the current utilization.

A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.

Right, this bit is missing. Well, the event would be a task crossing the up threshold at the current cap. We kick the thing only if this happened. But yeah, maybe something more clever is possible.

Thanks,

- Juri

Morten Rasmussen

5:07 p.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

On Fri, Mar 13, 2015 at 10:37:53AM +0000, Juri Lelli wrote:

...

On 11/03/15 17:00, Morten Rasmussen wrote:

...
On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:

...
From: Michael Turquette mturquette@linaro.org

{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.

arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.

All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.

Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.

Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?

Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c

[...]

...
@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
update_entity_load_avg(se, 1);
/* track cpus that need to be re-evaluated */
cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
}
/* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
/*
 * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
 * typedef update_cpus into an int and skip all of the cpumask
 * stuff
 */
cpumask_set_cpu(cpu_of(rq), &update_cpus);
You are only affecting one cpu in {en,de}queue_task_fair() so update_cpus should never have more than one cpu set. The for_each_sched_entity() loops are just to ensure that all the parent sched groups' sched_entities are enqueued on their parents rqs. They all belong to one cpu and that is cpu_of(rq).

I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.

Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.
Right. We need a way to keep track of "pending requests".

...
If you keep the cpumask here in {en,de}queue_task_fair() you can get away with just a single cpumask_set_cpu(cpu_of(rq)).

If you want to experiment with hooking into load_balance() which would cover periodic, idle, and nohz_idle balancing I think it could worth a try setting env->src_cpu and env->dst_cpu in the update_cpus mask just after attach_tasks() in load_balance(). You get there every time tasks have been migrated due to 'normal' load-balancing.

You can do something very similar after attach_one_task() in active_load_balance_cpu_stop() to get active balance migrations too (not 'normal').

{attach,detach}_tasks() (and attach_one_task()) end up calling {activate,deactivate}_task, so we have {enqueue,dequeue}_task for source and destination runqueues. So, even in these cases it seems fine to just evaluate the new situation from {en,de}queue_task_fair() (or even the core calling sites if we want to be more general).

Am I missing something?

No, it is me not doing my homework well enough. You are right.

You would however potentially get multiple enqueue/dequeue calls to the same cpus if you are doing a load_balance() but I guess the overhead would be small.

...

...
I guess that should cover all changes in utilization due to task migrations. The last bit missing is changes to utilization caused by changing task behaviour. For example, a small task suddenly turning heavy. Since it is now runnable all the time it won't be dequeued and load_balance() may not find a reason to migrate it. If we don't check the utilization of busy cpus every now and again the frequency we picked a while ago may no longer match the current utilization.

A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.

Right, this bit is missing. Well, the event would be a task crossing the up threshold at the current cap. We kick the thing only if this happened. But yeah, maybe something more clever is possible.

To complete the picture we also need to sort out the problems with blocked load and nohz that you encountered so we don't keep the frequency up unnecessarily.

Michael Turquette

16 Mar 16 Mar

10:08 p.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

Quoting Juri Lelli (2015-03-13 03:37:53)

...

On 11/03/15 17:00, Morten Rasmussen wrote:

...
On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:

...
From: Michael Turquette mturquette@linaro.org

{en,de}queue_task_fair are updated to track which cpus will have changed utilization values as function of task queueing. The affected cpus are passed on to arch_eval_cpu_freq for further machine-specific processing based on a selectable policy.

arch_scale_cpu_freq is called from run_rebalance_domains as a way to kick off the scaling process (via wake_up_process), so as to prevent re-entering the {en,de}queue code.

All of the call sites in this patch are up for discussion. Does it make sense to track which cpus have updated statistics in enqueue_fair_task? I chose this because I wanted to gather statistics for all cpus affected in the event CONFIG_FAIR_GROUP_SCHED is enabled. As agreed at LPC14 the next version of this patch will focus on the simpler case of not using scheduler cgroups, which should remove a good chunk of this code, including the cpumask stuff.

Also discussed at LPC14 is that fact that load_balance is a very interesting place to do this as frequency can be considered in concert with task placement. Please put forth any ideas on a sensible way to do this.

Is run_rebalance_domains a logical place to change cpu frequency? What other call sites make sense?

Even for platforms that can target a cpu frequency without sleeping (x86, some ARM platforms with PM microcontrollers) it is currently necessary to always kick the frequency target work out into a kthread. This is because of the rw_sem usage in the cpufreq core which might sleep. Replacing that lock type is probably a good idea.

Not-signed-off-by: Mike Turquette mturquette@linaro.org

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 2 ++ 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9386b0..1043266 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c

[...]

...
@@ -4275,12 +4283,26 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_cfs_shares(cfs_rq); update_entity_load_avg(se, 1);
       /* track cpus that need to be re-evaluated */
       cpumask_set_cpu(cpu_of(rq_of(cfs_rq)), &update_cpus);
}
/* !CONFIG_FAIR_GROUP_SCHED */ if (!se) { update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1);
       /*
        * FIXME for !CONFIG_FAIR_GROUP_SCHED it might be nice to
        * typedef update_cpus into an int and skip all of the cpumask
        * stuff
        */
       cpumask_set_cpu(cpu_of(rq), &update_cpus);
You are only affecting one cpu in {en,de}queue_task_fair() so update_cpus should never have more than one cpu set. The for_each_sched_entity() loops are just to ensure that all the parent sched groups' sched_entities are enqueued on their parents rqs. They all belong to one cpu and that is cpu_of(rq).

I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.

Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.
Right. We need a way to keep track of "pending requests".

Instead of tracking "requests" by tagging cpus that have updated stats in a cpumask we could track "constraints", where a constraint is the return value of get_cpu_usage stored in a per-cpu variable whenever a task is migrated. (or whenever the load contribution for a task change significantly. more on that at the bottom)

In the case of scaling cpu frequency up we can do as Juri says below and kick the logic whenever we cross the up_threshold.

For scaling down we need to know usage (from get_cpu_usage) of every cpu in the frequency domain, or more specifically we need to know the max usage in the freq domain. This is enough to pick the best capacity state based on floor(max_usage).

...

...
If you keep the cpumask here in {en,de}queue_task_fair() you can get away with just a single cpumask_set_cpu(cpu_of(rq)).

If you want to experiment with hooking into load_balance() which would cover periodic, idle, and nohz_idle balancing I think it could worth a try setting env->src_cpu and env->dst_cpu in the update_cpus mask just after attach_tasks() in load_balance(). You get there every time tasks have been migrated due to 'normal' load-balancing.

You can do something very similar after attach_one_task() in active_load_balance_cpu_stop() to get active balance migrations too (not 'normal').

{attach,detach}_tasks() (and attach_one_task()) end up calling {activate,deactivate}_task, so we have {enqueue,dequeue}_task for source and destination runqueues. So, even in these cases it seems fine to just evaluate the new situation from {en,de}queue_task_fair() (or even the core calling sites if we want to be more general).

Am I missing something?

...
I guess that should cover all changes in utilization due to task migrations. The last bit missing is changes to utilization caused by changing task behaviour. For example, a small task suddenly turning heavy. Since it is now runnable all the time it won't be dequeued and load_balance() may not find a reason to migrate it. If we don't check the utilization of busy cpus every now and again the frequency we picked a while ago may no longer match the current utilization.

How do we normally detect a significant change in task behavior? If we have an answer to this question then we have a call site to potentially evaluate cpu frequency.

...

...
A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.

Right, this bit is missing. Well, the event would be a task crossing the up threshold at the current cap. We kick the thing only if this happened. But yeah, maybe something more clever is possible.

If we update constraints in per-cpu vars (as I outlined above) then could we move the code that does that out of {en,de}queue_fair_task and into some place common that solves this problem? I need some help on this one, but I was imagining something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6fce9e0..266f4aa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2785,6 +2785,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq = cfs_rq_of(se); long contrib_delta, utilization_delta; int cpu = cpu_of(rq_of(cfs_rq)); + int cap; u64 now;

/* @@ -2818,6 +2819,10 @@ static inline void update_entity_load_avg(struct sched_entity *se, -utilization_delta); }

+ /* store updated per-cpu usage when update_cfs_rq is true */ + cap = get_cpu_usage(cpu); + atomic_long_set(&per_cpu(usage, cpu), cap); + trace_sched_load_avg_cpu(cpu, cfs_rq); }

The above untested snippet only cares about the CFS usage. In the future we might want to aggregate it with DL and RT usage as well, or we might have a per-class, per-cpu atomic.

Any ideas on this approach?

Further thoughts:

Assuming that the above idea is agreeable, one of the questions I am grappling with is whether we should store the usage in a per-cpu variable or if we should add it as a member to struct rq (in which case it would be a sum of usage from every sched class).

I am currently only looking at the CFS-centric, per-cpu atomic variable method, following the philosophy of Keep It Simple, Stupid. However putting it in struct rq has benefits:

1) locking is already managed for us when we need to update with a new usage value, as _THIS_ cpu is the one that will update its own usage. This means we can avoid costly atomic operations.

2) locking is already managed for us when walking the rq's looking for a max usage in places like load_balance. It is NOT managed when a new task is placed on the rq though, so this would need some synchronization? Or we could ignore the possibility of scaling cpu when a new task is enqueued.

Regards, Mike

Dietmar Eggemann

17 Mar 17 Mar

5:13 p.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

Hi Mike,

On 16/03/15 22:08, Michael Turquette wrote:

...

Quoting Juri Lelli (2015-03-13 03:37:53)

...
On 11/03/15 17:00, Morten Rasmussen wrote:

...
On Mon, Mar 09, 2015 at 02:06:17PM +0000, Juri Lelli wrote:

...
From: Michael Turquette mturquette@linaro.org

[...]

...

...
...
I was about to says that you don't need more than just an int, but if you want to hook into rebalance_domains() or somewhere in that path where we move tasks between cpus you would probably want to update both source and destination cpus. It gets more complicated in nohz_idle_balance() where you may move tasks between multiple sources and destinations in one go. So having a mask might not be a bad idea after all.

Since you only periodically check the cpus in the update_cpus mask you may have had tasks {en,de}queue on multiple cpus in the meantime. So you need the cpumask anyway.

Right. We need a way to keep track of "pending requests".

Instead of tracking "requests" by tagging cpus that have updated stats in a cpumask we could track "constraints", where a constraint is the return value of get_cpu_usage stored in a per-cpu variable whenever a task is migrated. (or whenever the load contribution for a task change significantly. more on that at the bottom)

I'm not sure this is necessary. The information that we have a cpu with a request is sufficient because we can always get the cpu usage just before we kick the kthread. (exactly what you do right now).

The issue is that we could see a lot of requests during the time the em mechanism is throttled.

Why don't you just add:

cpumask_set_cpu(cpu_of(rq), &update_cpus);

if (energy_aware() && !cpumask_empty(&update_cpus)) arch_eval_cpu_freq(&update_cpus);

into task_tick_fair() to detect cpu usage changes due to a single task running in case there is no enqueue/dequeue happening?

...

In the case of scaling cpu frequency up we can do as Juri says below and kick the logic whenever we cross the up_threshold.

For scaling down we need to know usage (from get_cpu_usage) of every cpu in the frequency domain, or more specifically we need to know the max usage in the freq domain. This is enough to pick the best capacity state based on floor(max_usage).

[...]

...

How do we normally detect a significant change in task behavior? If we have an answer to this question then we have a call site to potentially evaluate cpu frequency.

If the task is running out flat, it would be the cfs tick function.

...

...
...
A naive solution could be to mark a cpu in update_cpus every time it receives a tick. At the end of task_tick_fair() for example. But then it isn't really event driven anymore. It needs some more thinking.

Right, this bit is missing. Well, the event would be a task crossing the up threshold at the current cap. We kick the thing only if this happened. But yeah, maybe something more clever is possible.

If we update constraints in per-cpu vars (as I outlined above) then could we move the code that does that out of {en,de}queue_fair_task and into some place common that solves this problem? I need some help on this one, but I was imagining something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6fce9e0..266f4aa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2785,6 +2785,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq = cfs_rq_of(se); long contrib_delta, utilization_delta; int cpu = cpu_of(rq_of(cfs_rq));
  int cap;
   u64 now;

   /*
@@ -2818,6 +2819,10 @@ static inline void update_entity_load_avg(struct sched_entity *se, -utilization_delta); }
  /* store updated per-cpu usage when update_cfs_rq is true */
  cap = get_cpu_usage(cpu);
  atomic_long_set(&per_cpu(usage, cpu), cap);
   trace_sched_load_avg_cpu(cpu, cfs_rq);
}

I guess it could work but why changing the current approach (calling arch_eval_cpu_freq() from CFS's enqueue, dequeue and *tick*)?

update_entity_load_avg() is all about updating PELT for the sched_entity se and the two cfs_rq PELT signals (cfs_rq->utilization_load_avg (blue*), cfs_rq->utilization_blocked_avg (green*)).

I just don't see the advantage right now to save cpu usage (red*) per CFS's enqueue, dequeue and tick operation comparing to save the information that a cpu requested a freq change. In case of 'pending requests' happening in a throttle period, what is the value of the per (cfs_)rq signal 'cpu usage' any way? It could be a stale value once the throttle period is over.

* colors in attached diagram (data from TC2 (2 A15, 2 A7) in update_entity_load_avg()

...

The above untested snippet only cares about the CFS usage. In the future we might want to aggregate it with DL and RT usage as well, or we might have a per-class, per-cpu atomic.

IMHO,we should focus on CFS (with the underlaying PELT mechanism) for now and solve this smaller set of problems first:

(1) Fixup the 'struct cpumask update_cpus' problem. (2) Fix the problem that we evaluate the cpu_usage exactly at the end of a throttling period if there were 'pending requests' even if there are no events at this moment. (3) kick the logic whenever we cross the up_threshold for scaling cpu frequency up.

...

Any ideas on this approach?

Further thoughts:

Assuming that the above idea is agreeable, one of the questions I am grappling with is whether we should store the usage in a per-cpu variable or if we should add it as a member to struct rq (in which case it would be a sum of usage from every sched class).

I am currently only looking at the CFS-centric, per-cpu atomic variable method, following the philosophy of Keep It Simple, Stupid. However putting it in struct rq has benefits:

I like this 'CFS-centric, per-cpu atomic variable method' because IMHO it gives the playground for everything we need right now to let it work with EAS.

...

locking is already managed for us when we need to update with a new

usage value, as _THIS_ cpu is the one that will update its own usage. This means we can avoid costly atomic operations.

locking is already managed for us when walking the rq's looking for a

max usage in places like load_balance. It is NOT managed when a new task is placed on the rq though, so this would need some synchronization? Or we could ignore the possibility of scaling cpu when a new task is enqueued.

IMHO, we can life with the little overhead of some atomics for now to get it working in CFS with EAS for the end of march RFC.

-- Dietmar

...

Regards, Mike

Michael Turquette

18 Mar 18 Mar

5:23 a.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

Hi Dietmar,

Quoting Dietmar Eggemann (2015-03-17 10:13:39)

...

Hi Mike,

On 16/03/15 22:08, Michael Turquette wrote:

...
Instead of tracking "requests" by tagging cpus that have updated stats in a cpumask we could track "constraints", where a constraint is the return value of get_cpu_usage stored in a per-cpu variable whenever a task is migrated. (or whenever the load contribution for a task change significantly. more on that at the bottom)

I'm not sure this is necessary. The information that we have a cpu with a request is sufficient because we can always get the cpu usage just before we kick the kthread. (exactly what you do right now).

Well, I've never thought that calling get_cpu_usage from the kthread was really a good idea. Shouldn't there be some locking around accesses to the cfs_rq data structures?

This is one reason why I was considering migrating the data that the governor wants (per-cpu usage data) into per-cpu atomics, as opposed to iterating over every cpu with get_cpu_usage or utilization_load_avg_of (as it was called in my original series). It is basically a poor-mans IPC mechanism.

If accessing cfs_rq without locking is actually safe then please let me know.

...

The issue is that we could see a lot of requests during the time the em mechanism is throttled.

Why don't you just add:

cpumask_set_cpu(cpu_of(rq), &update_cpus);

if (energy_aware() && !cpumask_empty(&update_cpus)) arch_eval_cpu_freq(&update_cpus);

into task_tick_fair() to detect cpu usage changes due to a single task running in case there is no enqueue/dequeue happening?

Sure, and putting the change into update_entity_load_avg as I suggested achieves the same end result. That data will never be more stale than the rest of the PELT values.

Also I've been thinking a lot about the combination of throttling and the kthread. I'm starting to think that this design would end up being periodic (based on the throttle period), which isn't so different from how legacy CPUfreq governors work today. Do you have time to chat on the phone about this? I'll send out an invite.

[snip]

...

...
If we update constraints in per-cpu vars (as I outlined above) then could we move the code that does that out of {en,de}queue_fair_task and into some place common that solves this problem? I need some help on this one, but I was imagining something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6fce9e0..266f4aa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2785,6 +2785,7 @@ static inline void update_entity_load_avg(struct sched_entity *se, struct cfs_rq *cfs_rq = cfs_rq_of(se); long contrib_delta, utilization_delta; int cpu = cpu_of(rq_of(cfs_rq));
  int cap;
   u64 now;

   /*
@@ -2818,6 +2819,10 @@ static inline void update_entity_load_avg(struct sched_entity *se, -utilization_delta); }
  /* store updated per-cpu usage when update_cfs_rq is true */
  cap = get_cpu_usage(cpu);
  atomic_long_set(&per_cpu(usage, cpu), cap);
   trace_sched_load_avg_cpu(cpu, cfs_rq);
}
I guess it could work but why changing the current approach (calling arch_eval_cpu_freq() from CFS's enqueue, dequeue and *tick*)?

update_entity_load_avg() is all about updating PELT for the sched_entity se and the two cfs_rq PELT signals (cfs_rq->utilization_load_avg (blue*), cfs_rq->utilization_blocked_avg (green*)).

I just don't see the advantage right now to save cpu usage (red*) per CFS's enqueue, dequeue and tick operation comparing to save the

Thanks for the graph. I'll state some assumptions. Please let me know if I am wrong.

cpu0 & cpu1 are A15s whose max capacity is 1024. You are running some use case that mostly keeps those cpus fully utilized (or more correctly, over-utilized) at the highest frequency, which is why the red dot stays capped at SCHED_LOAD_SCALE.

cpu2 & cpu3 seem to show the same behavior but the different scales make it hard to be certain. I'll assume that they are both running at their highest frequency and over-utilized for most of the use case.

Assuming what I stated above is correct, this seems like useful data to me. The point of having the per-cpu atomic storage is to have lockless access to this useful info. If the cpufreq thread needs to iterate over this data to find the most utilized cpu in the freq domain then the per-cpu version of this removes any requirement to lock the cfs_rq structures.

...

information that a cpu requested a freq change. In case of 'pending requests' happening in a throttle period, what is the value of the per (cfs_)rq signal 'cpu usage' any way? It could be a stale value once the throttle period is over.

My proposal was to update the per-cpu data in update_entity_load_avg every time the load contribution changes. Thus it would never be stale (or no more or less stale than the rest of PELT values). Please let me know if I have any holes in that assumption.

When the governor thread finally is ready to perform a cpu frequency transition it has already had the latest data *pushed* to it via the shared per-cpu atomic.

...

colors in attached diagram (data from TC2 (2 A15, 2 A7) in

update_entity_load_avg()

...
The above untested snippet only cares about the CFS usage. In the future we might want to aggregate it with DL and RT usage as well, or we might have a per-class, per-cpu atomic.

IMHO,we should focus on CFS (with the underlaying PELT mechanism) for now and solve this smaller set of problems first:

(1) Fixup the 'struct cpumask update_cpus' problem.

Can you elaborate on what this problem is? I wanted to replace this cpumask since I am not trying to solve the CONFIG_FAIR_GROUP_SCHED problem where a sched_entity is not a task. Is there another problem besides that?

...

(2) Fix the problem that we evaluate the cpu_usage exactly at the end of a throttling period if there were 'pending requests' even if there are no events at this moment.

Yes, I agree this is a problem.

...

(3) kick the logic whenever we cross the up_threshold for scaling cpu frequency up.

This can be taken care of by either kicking the thread in update_entity_load_avg or by adding the thread-kick to every caller of that function, namely task_tick_fair as you pointed out.

...

...
Any ideas on this approach?

Further thoughts:

Assuming that the above idea is agreeable, one of the questions I am grappling with is whether we should store the usage in a per-cpu variable or if we should add it as a member to struct rq (in which case it would be a sum of usage from every sched class).

I am currently only looking at the CFS-centric, per-cpu atomic variable method, following the philosophy of Keep It Simple, Stupid. However putting it in struct rq has benefits:

I like this 'CFS-centric, per-cpu atomic variable method' because IMHO it gives the playground for everything we need right now to let it work with EAS.

I am confused. At the top of this mail you said you didn't see the advantage, but you like it here. Hopefully we can discuss it over the call.

Thanks, Mike

...

...

locking is already managed for us when we need to update with a new

usage value, as _THIS_ cpu is the one that will update its own usage. This means we can avoid costly atomic operations.

locking is already managed for us when walking the rq's looking for a

max usage in places like load_balance. It is NOT managed when a new task is placed on the rq though, so this would need some synchronization? Or we could ignore the possibility of scaling cpu when a new task is enqueued.

IMHO, we can life with the little overhead of some atomics for now to get it working in CFS with EAS for the end of march RFC.

-- Dietmar

...
Regards, Mike

Dietmar Eggemann

12:33 p.m.

New subject: [PATCH 3/6] sched: cfs: cpu frequency scaling based on task placement

Hi Mike,

On 18/03/15 05:23, Michael Turquette wrote:

...

Hi Dietmar,

Quoting Dietmar Eggemann (2015-03-17 10:13:39)

...
Hi Mike,

On 16/03/15 22:08, Michael Turquette wrote:

...
Instead of tracking "requests" by tagging cpus that have updated stats in a cpumask we could track "constraints", where a constraint is the return value of get_cpu_usage stored in a per-cpu variable whenever a task is migrated. (or whenever the load contribution for a task change significantly. more on that at the bottom)

I'm not sure this is necessary. The information that we have a cpu with a request is sufficient because we can always get the cpu usage just before we kick the kthread. (exactly what you do right now).

Well, I've never thought that calling get_cpu_usage from the kthread was really a good idea. Shouldn't there be some locking around accesses to the cfs_rq data structures?

Heaven forbid, me neither :-)

...

This is one reason why I was considering migrating the data that the governor wants (per-cpu usage data) into per-cpu atomics, as opposed to iterating over every cpu with get_cpu_usage or utilization_load_avg_of (as it was called in my original series). It is basically a poor-mans IPC mechanism.

OK, for me it sounds like you want to store cpu usage in per-cpu atomics in update_entity_load_avg() to avoid iterating the 'policy->cpus) in arch_eval_cpu_freq().

IMHO, iterating over cpus to read stuff is fine.

E.g. in:

load_balance()->find_busiest_group()->update_sd_lb_stats()->update_sg_lb_stats()

for_each_cpu_and(i, sched_group_cpus(group), env->cpus)

We don't hold the locks for 'i' when we read e.g. the weighted cpuload (cpu_rq(cpu)->cfs.runnable_load_avg) (or the utilization for that matter).

...

If accessing cfs_rq without locking is actually safe then please let me know.

...
The issue is that we could see a lot of requests during the time the em mechanism is throttled.

Why don't you just add:

cpumask_set_cpu(cpu_of(rq), &update_cpus);

if (energy_aware() && !cpumask_empty(&update_cpus)) arch_eval_cpu_freq(&update_cpus);

into task_tick_fair() to detect cpu usage changes due to a single task running in case there is no enqueue/dequeue happening?

Sure, and putting the change into update_entity_load_avg as I suggested achieves the same end result. That data will never be more stale than the rest of the PELT values.

True, I was talking about the fact that updating PELT might happen way before the throttling time is over so before you kick the kthread again you would have to update cpu usage any way. But on second thought this might not be a problem (the staleness of cpu load) after all. If nothing is running on that cpu it's probably (nohz) idle and if there is, then we get (de/en-queues or ticks) which then could set the stage for arch_scale_cpu_freq() to do its thing. Not sure if this could be an issue for scaling frequency down though?

...

Also I've been thinking a lot about the combination of throttling and the kthread. I'm starting to think that this design would end up being periodic (based on the throttle period), which isn't so different from how legacy CPUfreq governors work today. Do you have time to chat on the phone about this? I'll send out an invite.

Sure.

[...]

...

Thanks for the graph. I'll state some assumptions. Please let me know if I am wrong.

cpu0 & cpu1 are A15s whose max capacity is 1024. You are running some use case that mostly keeps those cpus fully utilized (or more correctly, over-utilized) at the highest frequency, which is why the red dot stays capped at SCHED_LOAD_SCALE.

You're right here.

The workload is 'sysbench --test=cpu --num-threads=8 --max-time=3 --max-requests=100000 run

I further attached the plots of 'cap', 'util', 'max_util' (trace in arch_eval_cpu_freq()) and 'em->target_freq' (trace in arch_scale_cpu_freq()) for this particular test run.

...

cpu2 & cpu3 seem to show the same behavior but the different scales make it hard to be certain. I'll assume that they are both running at their highest frequency and over-utilized for most of the use case.

Yes.

...

Assuming what I stated above is correct, this seems like useful data to me. The point of having the per-cpu atomic storage is to have lockless access to this useful info. If the cpufreq thread needs to iterate over this data to find the most utilized cpu in the freq domain then the per-cpu version of this removes any requirement to lock the cfs_rq structures.

IMHO, just to read cfs_rq load/utilization data, we do not need to hold the cfs_rq lock.

...

...
information that a cpu requested a freq change. In case of 'pending requests' happening in a throttle period, what is the value of the per (cfs_)rq signal 'cpu usage' any way? It could be a stale value once the throttle period is over.

My proposal was to update the per-cpu data in update_entity_load_avg every time the load contribution changes. Thus it would never be stale (or no more or less stale than the rest of PELT values). Please let me know if I have any holes in that assumption.

OK, but since reading of cfs_rq utilization on all cpu's of the frequency domain is possible in arch_eval_cpu_freq(), why bother?

...

When the governor thread finally is ready to perform a cpu frequency transition it has already had the latest data *pushed* to it via the shared per-cpu atomic.

IMHO, we don't need this per-cpu atomic.

...

...

colors in attached diagram (data from TC2 (2 A15, 2 A7) in

update_entity_load_avg()

...
The above untested snippet only cares about the CFS usage. In the future we might want to aggregate it with DL and RT usage as well, or we might have a per-class, per-cpu atomic.

IMHO,we should focus on CFS (with the underlaying PELT mechanism) for now and solve this smaller set of problems first:

(1) Fixup the 'struct cpumask update_cpus' problem.

Can you elaborate on what this problem is? I wanted to replace this cpumask since I am not trying to solve the CONFIG_FAIR_GROUP_SCHED problem where a sched_entity is not a task. Is there another problem besides that?

No, I don't think so. 'update_cpus' is right now only a single cpu. And even for group scheduling we're only dealing with one cpu (cpu_of(rq).

Morten pointed out earlier that we might need a mask if we want to tell the Sched-DVFS to change the frequency for more than 1 freq domain with kicks of both kthreads (e.g. in load balance on DIE level, where dst_cpu and src_cpu belong to two different freq domains). Maybe not for this iteration ...

...

...
(2) Fix the problem that we evaluate the cpu_usage exactly at the end of a throttling period if there were 'pending requests' even if there are no events at this moment.

Yes, I agree this is a problem.

Cool.

...

...
(3) kick the logic whenever we cross the up_threshold for scaling cpu frequency up.

This can be taken care of by either kicking the thread in update_entity_load_avg or by adding the thread-kick to every caller of that function, namely task_tick_fair as you pointed out.

OK.

...

...
...
Any ideas on this approach?

Further thoughts:

Assuming that the above idea is agreeable, one of the questions I am grappling with is whether we should store the usage in a per-cpu variable or if we should add it as a member to struct rq (in which case it would be a sum of usage from every sched class).

I am currently only looking at the CFS-centric, per-cpu atomic variable method, following the philosophy of Keep It Simple, Stupid. However putting it in struct rq has benefits:

I like this 'CFS-centric, per-cpu atomic variable method' because IMHO it gives the playground for everything we need right now to let it work with EAS.

I am confused. At the top of this mail you said you didn't see the advantage, but you like it here. Hopefully we can discuss it over the call.

Sorry about that, I thought you were referring to the existing approach (the atomic em->need_wake_task). I should have spotted the ' ... per-cpu ...' in the sentence above :-)

Like I said before, I don't see the advantage of having a per-cpu atomic for the cpu load.

Cheers,

-- Dietmar

[...]

Juri Lelli

9 Mar 9 Mar

2:06 p.m.

New subject: [PATCH 4/6] sched/energy_model: use EAS energy model

We already have capacity information in the EAS energy model tables. Let's use it to both avoid duplication and achieve freq/uarch invariance.

Signed-off-by: Juri Lelli juri.lelli@arm.com --- kernel/sched/energy_model.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 7fec5af..64bd50e 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -247,6 +247,8 @@ static void em_start(struct cpufreq_policy *policy) unsigned int capacity; struct em_data *em; struct cpufreq_frequency_table *pos; + struct sched_domain *sd; + struct sched_group_energy *sge = NULL;

/* prepare per-policy private data */ em = kzalloc(sizeof(*em), GFP_KERNEL); @@ -265,16 +267,35 @@ static void em_start(struct cpufreq_policy *policy) em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);

- cpufreq_for_each_entry(pos, policy->freq_table) { - /* FIXME capacity below is not scaled for uarch */ - capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max; + for_each_domain(cpumask_first(policy->cpus), sd) + if (!sd->child) { +#ifdef CONFIG_SCHED_DEBUG + pr_debug("%s: using %s sched_group_energy\n", + __func__, + sd->name); +#endif + sge = sd->groups->sge; + break; + } + + if (!sge) { + pr_debug("%s: failed to access sched_group_energy\n", + __func__); + kfree(em->up_threshold); + kfree(em->down_threshold); + kfree(em); + return; + } + + for (index = 0; index < sge->nr_cap_states; index++) { + /* capacity below is both freq and uarch scaled */ + capacity = sge->cap_states[index].cap; em->up_threshold[index] = capacity * UP_THRESHOLD / 100; em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100; pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n", __func__, cpumask_first(policy->cpus), index, capacity, em->up_threshold[index], em->down_threshold[index]); - index++; }

/* init per-policy kthread */ -- 2.2.2

Mike Turquette

10 Mar 10 Mar

1:38 a.m.

New subject: [PATCH 4/6] sched/energy_model: use EAS energy model

Quoting Juri Lelli (2015-03-09 07:06:18)

...

We already have capacity information in the EAS energy model tables. Let's use it to both avoid duplication and achieve freq/uarch invariance.

Signed-off-by: Juri Lelli juri.lelli@arm.com

kernel/sched/energy_model.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 7fec5af..64bd50e 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -247,6 +247,8 @@ static void em_start(struct cpufreq_policy *policy) unsigned int capacity; struct em_data *em; struct cpufreq_frequency_table *pos;
  struct sched_domain *sd;
  struct sched_group_energy *sge = NULL;
/* prepare per-policy private data */ em = kzalloc(sizeof(*em), GFP_KERNEL); @@ -265,16 +267,35 @@ static void em_start(struct cpufreq_policy *policy) em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
  cpufreq_for_each_entry(pos, policy->freq_table) {
          /* FIXME capacity below is not scaled for uarch */
          capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
  for_each_domain(cpumask_first(policy->cpus), sd)
          if (!sd->child) {
+#ifdef CONFIG_SCHED_DEBUG
                  pr_debug("%s: using %s sched_group_energy\n",
                           __func__,
                           sd->name);
+#endif
                  sge = sd->groups->sge;
                  break;
          }
  if (!sge) {
          pr_debug("%s: failed to access sched_group_energy\n",
                   __func__);
          kfree(em->up_threshold);
          kfree(em->down_threshold);
          kfree(em);
          return;
  }
  for (index = 0; index < sge->nr_cap_states; index++) {
          /* capacity below is both freq and uarch scaled */
          capacity = sge->cap_states[index].cap;

Hi Juri,

Thanks for sharing the patch. Is there a functional change in this patch? From what I can tell this is a different way of doing the same thing. Please let me know if I am wrong.

Also, I'm trying to understand the full dependency on the energy model. I see get_cpu_usage is now used here. Assuming that we use the old code that uses the internal cpufreq frequency table (instead of the sge cap state table) then the only tangible dependency is on Vincent's V10 cpu capacity rework, correct? Put another way, you are not yet using any energy_diff stuff in the dvfs policy.

Thanks, Mike

...

            em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
            em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
            pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
                            __func__, cpumask_first(policy->cpus), index,
                            capacity, em->up_threshold[index],
                            em->down_threshold[index]);

```
          index++;
  }
```

/* init per-policy kthread */ -- 2.2.2

Juri Lelli

10:10 a.m.

New subject: [PATCH 4/6] sched/energy_model: use EAS energy model

Hi Mike,

On 10/03/15 01:38, Mike Turquette wrote:

...

Quoting Juri Lelli (2015-03-09 07:06:18)

...
We already have capacity information in the EAS energy model tables. Let's use it to both avoid duplication and achieve freq/uarch invariance.

Signed-off-by: Juri Lelli juri.lelli@arm.com

kernel/sched/energy_model.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 7fec5af..64bd50e 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -247,6 +247,8 @@ static void em_start(struct cpufreq_policy *policy) unsigned int capacity; struct em_data *em; struct cpufreq_frequency_table *pos;
  struct sched_domain *sd;
  struct sched_group_energy *sge = NULL;
/* prepare per-policy private data */ em = kzalloc(sizeof(*em), GFP_KERNEL); @@ -265,16 +267,35 @@ static void em_start(struct cpufreq_policy *policy) em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
  cpufreq_for_each_entry(pos, policy->freq_table) {
          /* FIXME capacity below is not scaled for uarch */
          capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
  for_each_domain(cpumask_first(policy->cpus), sd)
          if (!sd->child) {
+#ifdef CONFIG_SCHED_DEBUG
                  pr_debug("%s: using %s sched_group_energy\n",
                           __func__,
                           sd->name);
+#endif
                  sge = sd->groups->sge;
                  break;
          }
  if (!sge) {
          pr_debug("%s: failed to access sched_group_energy\n",
                   __func__);
          kfree(em->up_threshold);
          kfree(em->down_threshold);
          kfree(em);
          return;
  }
  for (index = 0; index < sge->nr_cap_states; index++) {
          /* capacity below is both freq and uarch scaled */
          capacity = sge->cap_states[index].cap;
Hi Juri,

Thanks for sharing the patch. Is there a functional change in this patch? From what I can tell this is a different way of doing the same thing. Please let me know if I am wrong.

Yes, you are right, no functional change here. But I needed to create these tables using the energy model because the utilization signal we get from get_cpu_usage is both uarch and freq capped to the current capacity of the cpu. So the thresholds had to take this into account.

...

Also, I'm trying to understand the full dependency on the energy model. I see get_cpu_usage is now used here. Assuming that we use the old code that uses the internal cpufreq frequency table (instead of the sge cap state table) then the only tangible dependency is on Vincent's V10 cpu capacity rework, correct? Put another way, you are not yet using any energy_diff stuff in the dvfs policy.

No energy_diff stuff in there. But you still got a dependency from the energy model tables to create the internal tables.

I'm not sure if we want to ever use energy_diff stuff in there, actually. We may want to be as general as possible, and just act on changes of the utilization signal of the various cpus.

Best,

- Juri

...

Thanks, Mike

...

            em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
            em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
            pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
                            __func__, cpumask_first(policy->cpus), index,
                            capacity, em->up_threshold[index],
                            em->down_threshold[index]);

```
          index++;
  }
```

/* init per-policy kthread */ -- 2.2.2

Mike Turquette

11 p.m.

New subject: [PATCH 4/6] sched/energy_model: use EAS energy model

On Tue, Mar 10, 2015 at 3:10 AM, Juri Lelli juri.lelli@arm.com wrote:

...

Hi Mike,

On 10/03/15 01:38, Mike Turquette wrote:

...
Quoting Juri Lelli (2015-03-09 07:06:18)

...
We already have capacity information in the EAS energy model tables. Let's use it to both avoid duplication and achieve freq/uarch invariance.

Signed-off-by: Juri Lelli juri.lelli@arm.com

kernel/sched/energy_model.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 7fec5af..64bd50e 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -247,6 +247,8 @@ static void em_start(struct cpufreq_policy *policy) unsigned int capacity; struct em_data *em; struct cpufreq_frequency_table *pos;
  struct sched_domain *sd;
  struct sched_group_energy *sge = NULL;

  /* prepare per-policy private data */
  em = kzalloc(sizeof(*em), GFP_KERNEL);
@@ -265,16 +267,35 @@ static void em_start(struct cpufreq_policy *policy) em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
  cpufreq_for_each_entry(pos, policy->freq_table) {
          /* FIXME capacity below is not scaled for uarch */
          capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
  for_each_domain(cpumask_first(policy->cpus), sd)
          if (!sd->child) {
+#ifdef CONFIG_SCHED_DEBUG
                  pr_debug("%s: using %s sched_group_energy\n",
                           __func__,
                           sd->name);
+#endif
                  sge = sd->groups->sge;
                  break;
          }
  if (!sge) {
          pr_debug("%s: failed to access sched_group_energy\n",
                   __func__);
          kfree(em->up_threshold);
          kfree(em->down_threshold);
          kfree(em);
          return;
  }
  for (index = 0; index < sge->nr_cap_states; index++) {
          /* capacity below is both freq and uarch scaled */
          capacity = sge->cap_states[index].cap;
Hi Juri,

Thanks for sharing the patch. Is there a functional change in this patch? From what I can tell this is a different way of doing the same thing. Please let me know if I am wrong.
Yes, you are right, no functional change here. But I needed to create these tables using the energy model because the utilization signal we get from get_cpu_usage is both uarch and freq capped to the current capacity of the cpu. So the thresholds had to take this into account.

I had forgotten about the uarch invariance bit. I'll need to take that into account as well if building the threshold cap values from scratch.

However the thresholds at a certain capacity will always be less than that capacity (e.g. 20% or 80% of that frequency/capacity). So besides taking into account the uarch invariance was there another reason to pull the energy model cap state table in?

...

...
Also, I'm trying to understand the full dependency on the energy model. I see get_cpu_usage is now used here. Assuming that we use the old code that uses the internal cpufreq frequency table (instead of the sge cap state table) then the only tangible dependency is on Vincent's V10 cpu capacity rework, correct? Put another way, you are not yet using any energy_diff stuff in the dvfs policy.

No energy_diff stuff in there. But you still got a dependency from the energy model tables to create the internal tables.

I'm not sure if we want to ever use energy_diff stuff in there, actually. We may want to be as general as possible, and just act on changes of the utilization signal of the various cpus.

For a first step I completely agree. But in the future I can imagine a scenario where we want to compare the energy cost of migrating a task from src_cpu to dst_cpu versus running src_cpu at a faster frequency.

Regards, Mike

...

Best,

Juri

...

Thanks, Mike

...

            em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
            em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
            pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
                            __func__, cpumask_first(policy->cpus), index,
                            capacity, em->up_threshold[index],
                            em->down_threshold[index]);

          index++;
  }

  /* init per-policy kthread */

-- 2.2.2

Juri Lelli

11 Mar 11 Mar

11 a.m.

New subject: [PATCH 4/6] sched/energy_model: use EAS energy model

On 10/03/15 23:00, Mike Turquette wrote:

...

On Tue, Mar 10, 2015 at 3:10 AM, Juri Lelli juri.lelli@arm.com wrote:

...
Hi Mike,

On 10/03/15 01:38, Mike Turquette wrote:

...
Quoting Juri Lelli (2015-03-09 07:06:18)

...
We already have capacity information in the EAS energy model tables. Let's use it to both avoid duplication and achieve freq/uarch invariance.

Signed-off-by: Juri Lelli juri.lelli@arm.com

kernel/sched/energy_model.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 7fec5af..64bd50e 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -247,6 +247,8 @@ static void em_start(struct cpufreq_policy *policy) unsigned int capacity; struct em_data *em; struct cpufreq_frequency_table *pos;
  struct sched_domain *sd;
  struct sched_group_energy *sge = NULL;

  /* prepare per-policy private data */
  em = kzalloc(sizeof(*em), GFP_KERNEL);
@@ -265,16 +267,35 @@ static void em_start(struct cpufreq_policy *policy) em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
  cpufreq_for_each_entry(pos, policy->freq_table) {
          /* FIXME capacity below is not scaled for uarch */
          capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
  for_each_domain(cpumask_first(policy->cpus), sd)
          if (!sd->child) {
+#ifdef CONFIG_SCHED_DEBUG
                  pr_debug("%s: using %s sched_group_energy\n",
                           __func__,
                           sd->name);
+#endif
                  sge = sd->groups->sge;
                  break;
          }
  if (!sge) {
          pr_debug("%s: failed to access sched_group_energy\n",
                   __func__);
          kfree(em->up_threshold);
          kfree(em->down_threshold);
          kfree(em);
          return;
  }
  for (index = 0; index < sge->nr_cap_states; index++) {
          /* capacity below is both freq and uarch scaled */
          capacity = sge->cap_states[index].cap;
Hi Juri,

Thanks for sharing the patch. Is there a functional change in this patch? From what I can tell this is a different way of doing the same thing. Please let me know if I am wrong.
Yes, you are right, no functional change here. But I needed to create these tables using the energy model because the utilization signal we get from get_cpu_usage is both uarch and freq capped to the current capacity of the cpu. So the thresholds had to take this into account.
I had forgotten about the uarch invariance bit. I'll need to take that into account as well if building the threshold cap values from scratch.

However the thresholds at a certain capacity will always be less than that capacity (e.g. 20% or 80% of that frequency/capacity). So besides taking into account the uarch invariance was there another reason to pull the energy model cap state table in?

Nope. I just had to build the thresholds around the max capacity you can get at a certain OPP on a certain CPU.

...

...
...
Also, I'm trying to understand the full dependency on the energy model. I see get_cpu_usage is now used here. Assuming that we use the old code that uses the internal cpufreq frequency table (instead of the sge cap state table) then the only tangible dependency is on Vincent's V10 cpu capacity rework, correct? Put another way, you are not yet using any energy_diff stuff in the dvfs policy.

No energy_diff stuff in there. But you still got a dependency from the energy model tables to create the internal tables.

I'm not sure if we want to ever use energy_diff stuff in there, actually. We may want to be as general as possible, and just act on changes of the utilization signal of the various cpus.

For a first step I completely agree. But in the future I can imagine a scenario where we want to compare the energy cost of migrating a task from src_cpu to dst_cpu versus running src_cpu at a faster frequency.

Not sure I'm following you here. Isn't this up to the scheduler?

Best,

- Juri

...

Regards, Mike

...

Best,

Juri

...

Thanks, Mike

...

            em->up_threshold[index] = capacity * UP_THRESHOLD / 100;
            em->down_threshold[index] = capacity * DOWN_THRESHOLD / 100;
            pr_debug("%s: cpu = %u index = %d capacity = %u up = %u down = %u\n",
                            __func__, cpumask_first(policy->cpus), index,
                            capacity, em->up_threshold[index],
                            em->down_threshold[index]);

          index++;
  }

  /* init per-policy kthread */

-- 2.2.2

Dietmar Eggemann

16 Mar 16 Mar

4:45 p.m.

New subject: [PATCH 4/6] sched/energy_model: use EAS energy model

Hi Juri,

On 09/03/15 14:06, Juri Lelli wrote:

...

We already have capacity information in the EAS energy model tables. Let's use it to both avoid duplication and achieve freq/uarch invariance.

[...]

...

cpufreq_for_each_entry(pos, policy->freq_table) {

/* FIXME capacity below is not scaled for uarch */

capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;

With CONFIG_PROVE_RCU I get:

[ 7.908644] kernel/sched/energy_model.c:271 suspicious rcu_dereference_check() usage! [ 8.024003] [ 8.024003] other info that might help us debug this: [ 8.024003] [ 8.184381] [ 8.184381] rcu_scheduler_active = 1, debug_locks = 1 [ 8.291042] 5 locks held by swapper/0/1: [ 8.345822] #0: (&dev->mutex){......}, at: [<c031837b>] __driver_attach+0x27/0x58 [ 8.455530] #1: (&dev->mutex){......}, at: [<c0318387>] __driver_attach+0x33/0x58 [ 8.566741] #2: (subsys mutex#7){+.+.+.}, at: [<c0317213>] subsys_interface_register+0x2b/0x80 [ 8.635648] scsi 0:0:0:0: Direct-Access Kingston DT 100 G2 PMAP PQ: 0 ANSI: 0 CCS [ 8.803312] #3: (cpufreq_rwsem){.+.+.+}, at: [<c038c915>] __cpufreq_add_dev.isra.27+0x6d/0x6e8 [ 8.926736] #4: (&policy->rwsem){+.+.+.}, at: [<c038cb79>] __cpufreq_add_dev.isra.27+0x2d1/0x6e8 [ 9.053076] [ 9.053076] stack backtrace: [ 9.160686] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.19.0-rc7+ #10 [ 9.229420] Hardware name: ARM-Versatile Express [ 9.293019] [<c0012ced>] (unwind_backtrace) from [<c000fbc9>] (show_stack+0x11/0x14) [ 9.414683] [<c000fbc9>] (show_stack) from [<c04bd991>] (dump_stack+0x65/0x70) [ 9.540294] [<c04bd991>] (dump_stack) from [<c00528c1>] (energy_model_setup+0x101/0x1b4) [ 9.675339] [<c00528c1>] (energy_model_setup) from [<c038b60b>] (__cpufreq_governor+0x63/0x180) [ 9.818187] [<c038b60b>] (__cpufreq_governor) from [<c038c639>] (cpufreq_set_policy+0x119/0x144) [ 9.962893] [<c038c639>] (cpufreq_set_policy) from [<c038c895>] (cpufreq_init_policy+0x5d/0x70) [ 10.108145] [<c038c895>] (cpufreq_init_policy) from [<c038cf15>] (__cpufreq_add_dev.isra.27+0x66d/0x6e8) [ 10.256520] [<c038cf15>] (__cpufreq_add_dev.isra.27) from [<c0317243>] (subsys_interface_register+0x5b/0x80) [ 10.407772] [<c0317243>] (subsys_interface_register) from [<c038bd93>] (cpufreq_register_driver+0xa7/0x1b0)

IMHO, some rcu locking is needed to prevent this when dealing with sd pointers:

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 78902d9f75e0..e17173ccfaa5 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -268,6 +268,7 @@ static void em_start(struct cpufreq_policy *policy) em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);

+ rcu_read_lock(); for_each_domain(cpumask_first(policy->cpus), sd) if (!sd->child) { #ifdef CONFIG_SCHED_DEBUG @@ -285,6 +286,7 @@ static void em_start(struct cpufreq_policy *policy) kfree(em->up_threshold); kfree(em->down_threshold); kfree(em); + rcu_read_unlock(); return; }

@@ -299,6 +301,8 @@ static void em_start(struct cpufreq_policy *policy) em->down_threshold[index]); }

+ rcu_read_unlock(); + /* init per-policy kthread */ em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task"); if (IS_ERR_OR_NULL(em->task))

-- Dietmar

...

for_each_domain(cpumask_first(policy->cpus), sd)
if (!sd->child) {
+#ifdef CONFIG_SCHED_DEBUG

[...]

Juri Lelli

4:52 p.m.

New subject: [PATCH 4/6] sched/energy_model: use EAS energy model

Hi Dietmar,

On 16/03/15 16:45, Dietmar Eggemann wrote:

...

Hi Juri,

On 09/03/15 14:06, Juri Lelli wrote:

...
We already have capacity information in the EAS energy model tables. Let's use it to both avoid duplication and achieve freq/uarch invariance.

[...]

...
cpufreq_for_each_entry(pos, policy->freq_table) {
/* FIXME capacity below is not scaled for uarch */
capacity = pos->frequency * SCHED_CAPACITY_SCALE / policy->max;
With CONFIG_PROVE_RCU I get:

[ 7.908644] kernel/sched/energy_model.c:271 suspicious rcu_dereference_check() usage! [ 8.024003] [ 8.024003] other info that might help us debug this: [ 8.024003] [ 8.184381] [ 8.184381] rcu_scheduler_active = 1, debug_locks = 1 [ 8.291042] 5 locks held by swapper/0/1: [ 8.345822] #0: (&dev->mutex){......}, at: [<c031837b>] __driver_attach+0x27/0x58 [ 8.455530] #1: (&dev->mutex){......}, at: [<c0318387>] __driver_attach+0x33/0x58 [ 8.566741] #2: (subsys mutex#7){+.+.+.}, at: [<c0317213>] subsys_interface_register+0x2b/0x80 [ 8.635648] scsi 0:0:0:0: Direct-Access Kingston DT 100 G2 PMAP PQ: 0 ANSI: 0 CCS [ 8.803312] #3: (cpufreq_rwsem){.+.+.+}, at: [<c038c915>] __cpufreq_add_dev.isra.27+0x6d/0x6e8 [ 8.926736] #4: (&policy->rwsem){+.+.+.}, at: [<c038cb79>] __cpufreq_add_dev.isra.27+0x2d1/0x6e8 [ 9.053076] [ 9.053076] stack backtrace: [ 9.160686] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.19.0-rc7+ #10 [ 9.229420] Hardware name: ARM-Versatile Express [ 9.293019] [<c0012ced>] (unwind_backtrace) from [<c000fbc9>] (show_stack+0x11/0x14) [ 9.414683] [<c000fbc9>] (show_stack) from [<c04bd991>] (dump_stack+0x65/0x70) [ 9.540294] [<c04bd991>] (dump_stack) from [<c00528c1>] (energy_model_setup+0x101/0x1b4) [ 9.675339] [<c00528c1>] (energy_model_setup) from [<c038b60b>] (__cpufreq_governor+0x63/0x180) [ 9.818187] [<c038b60b>] (__cpufreq_governor) from [<c038c639>] (cpufreq_set_policy+0x119/0x144) [ 9.962893] [<c038c639>] (cpufreq_set_policy) from [<c038c895>] (cpufreq_init_policy+0x5d/0x70) [ 10.108145] [<c038c895>] (cpufreq_init_policy) from [<c038cf15>] (__cpufreq_add_dev.isra.27+0x66d/0x6e8) [ 10.256520] [<c038cf15>] (__cpufreq_add_dev.isra.27) from [<c0317243>] (subsys_interface_register+0x5b/0x80) [ 10.407772] [<c0317243>] (subsys_interface_register) from [<c038bd93>] (cpufreq_register_driver+0xa7/0x1b0)

IMHO, some rcu locking is needed to prevent this when dealing with sd pointers:

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 78902d9f75e0..e17173ccfaa5 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -268,6 +268,7 @@ static void em_start(struct cpufreq_policy *policy) em->up_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL); em->down_threshold = kcalloc(count, sizeof(unsigned int), GFP_KERNEL);
  rcu_read_lock();
  for_each_domain(cpumask_first(policy->cpus), sd)
          if (!sd->child) {
#ifdef CONFIG_SCHED_DEBUG @@ -285,6 +286,7 @@ static void em_start(struct cpufreq_policy *policy) kfree(em->up_threshold); kfree(em->down_threshold); kfree(em);
          rcu_read_unlock();
          return;
  }
@@ -299,6 +301,8 @@ static void em_start(struct cpufreq_policy *policy) em->down_threshold[index]); }
  rcu_read_unlock();
  /* init per-policy kthread */
  em->task = kthread_create(energy_model_thread, policy, "kenergy_model_task");
  if (IS_ERR_OR_NULL(em->task))

Right. I already have the same fix on my wip branch. Unfortunately, I posted this before fixing that :(.

Thanks,

- Juri

...

-- Dietmar

...
for_each_domain(cpumask_first(policy->cpus), sd)
if (!sd->child) {
+#ifdef CONFIG_SCHED_DEBUG
[...]

Juri Lelli

9 Mar 9 Mar

2:06 p.m.

New subject: [PATCH 5/6] sched/energy_model: bound kthreads to run on related_cpus

There is one thread for each cluster (cpufreq policy domain). Bound them to run only on the related_cpus of that domain, as to avoid spurious wakeups on unrelated cpus.

Signed-off-by: Juri Lelli juri.lelli@arm.com --- kernel/sched/energy_model.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 64bd50e..df18dac 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -71,7 +71,7 @@ static int energy_model_thread(void *data)

param.sched_priority = 0; sched_setscheduler(current, SCHED_FIFO, &param); - + sched_setaffinity(0, policy->related_cpus);

do { down_write(&policy->rwsem); -- 2.2.2

Dietmar Eggemann

16 Mar 16 Mar

5:48 p.m.

New subject: [PATCH 5/6] sched/energy_model: bound kthreads to run on related_cpus

Hi Juri,

On 09/03/15 14:06, Juri Lelli wrote:

...

There is one thread for each cluster (cpufreq policy domain). Bound them to run only on the related_cpus of that domain, as to avoid spurious wakeups on unrelated cpus.

Signed-off-by: Juri Lelli juri.lelli@arm.com

kernel/sched/energy_model.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 64bd50e..df18dac 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -71,7 +71,7 @@ static int energy_model_thread(void *data)

param.sched_priority = 0; sched_setscheduler(current, SCHED_FIFO, &param);

sched_setaffinity(0, policy->related_cpus);

Why not using set_cpus_allowed_ptr(current, ...) here? All the checks in sched_setscheduler() seem unnecessary if you want to change current.

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index 78902d9f75e0..35477d1d6fff 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -72,7 +72,7 @@ static int energy_model_thread(void *data) param.sched_priority = 0; sched_setscheduler(current, SCHED_FIFO, &param);

- sched_setaffinity(0, policy->related_cpus); + set_cpus_allowed_ptr(current, policy->related_cpus);

do { down_write(&policy->rwsem);

root@linaro-nano:~# ./ea_on.sh GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION ARCH_CAPACITY NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN ENERGY_AWARE

TC2:

root@linaro-nano:~# ps -ef | grep -v grep | grep energy root 1054 2 0 17:40 ? 00:00:00 [kenergy_model_t] root 1055 2 0 17:40 ? 00:00:00 [kenergy_model_t]

root@linaro-nano:~# taskset -p 1054 pid 1054's current affinity mask: 3 root@linaro-nano:~# taskset -p 1055 pid 1055's current affinity mask: 1c

-- Dietmar

...

do { down_write(&policy->rwsem);

Juri Lelli

9 Mar 9 Mar

2:06 p.m.

New subject: [PATCH 6/6] sched/energy_model: fix spurious kicks

No need to kick the kthread if we are already at the lowest OPP.

Signed-off-by: Juri Lelli juri.lelli@arm.com --- kernel/sched/energy_model.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/energy_model.c b/kernel/sched/energy_model.c index df18dac..960c2d0 100644 --- a/kernel/sched/energy_model.c +++ b/kernel/sched/energy_model.c @@ -222,7 +222,7 @@ void arch_eval_cpu_freq(struct cpumask *update_cpus) * hold the write lock? */ atomic_set(&em->need_wake_task, 1); - } else if (max_util < em->down_threshold[index]) { + } else if (max_util < em->down_threshold[index] && index != 0) { /* write em->target_freq with read lock held */ atomic_long_set(&em->target_freq, policy->cur - 1); /* -- 2.2.2

Mike Turquette

10 Mar 10 Mar

12:34 a.m.

Quoting Juri Lelli (2015-03-09 07:06:14)

...

Hi all,

as agreed with Mike, I'm posting here his sched-dvfs patchset [1] plus my EAS bindings to make it use the energy model. This is meant to re-start discussion on the attempt to drive dvfs directly from the scheduler.

This patchset is based on 4.0-rc2 + EASv3 [2].

Mike is already working on a slightly different, more "reactive", approach. The way we can use EAS infrastructure shouldn't change much anyway, so this works also as a first example of such thing. The core of my little delta resides in the possibility to read the energy tables to get freq and uarch invariance in the new governor (4/6) and the use of get_cpu_usage() to get the utilization signal from the scheduler (that I squashed in 02/06 together with some little fixes).

Thanks for sharing Juri. Can you push this to your ARM tree? I was able to fetch it today but this branch wasn't around. I've started to do the rebase myself with:

git rebase --onto v4.0-rc3 v3.18 arm/wip/ea_v3_dvfs

Unfortunately I'm getting lots of little merge conflicts. Nothing too scary yet but it would save me a lot of time to just fetch your branch.

Thanks! Mike

...

Best,

Juri

[1] https://lkml.org/lkml/2014/10/22/21 [2] https://lkml.org/lkml/2015/2/4/537

Juri Lelli (3): sched/energy_model: use EAS energy model sched/energy_model: bound kthreads to run on related_cpus sched/energy_model: fix spurious kicks

Michael Turquette (1): sched: cfs: cpu frequency scaling based on task placement

Mike Turquette (2): cpufreq: add per-governor private data sched: energy_model: simple cpu frequency scaling policy

drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 364 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 51 ++++++- kernel/sched/sched.h | 5 + 6 files changed, 441 insertions(+), 7 deletions(-) create mode 100644 kernel/sched/energy_model.c

-- 2.2.2

Juri Lelli

9:58 a.m.

Hi Mike,

On 10/03/15 00:34, Mike Turquette wrote:

...

Quoting Juri Lelli (2015-03-09 07:06:14)

...
Hi all,

as agreed with Mike, I'm posting here his sched-dvfs patchset [1] plus my EAS bindings to make it use the energy model. This is meant to re-start discussion on the attempt to drive dvfs directly from the scheduler.

This patchset is based on 4.0-rc2 + EASv3 [2].

Mike is already working on a slightly different, more "reactive", approach. The way we can use EAS infrastructure shouldn't change much anyway, so this works also as a first example of such thing. The core of my little delta resides in the possibility to read the energy tables to get freq and uarch invariance in the new governor (4/6) and the use of get_cpu_usage() to get the utilization signal from the scheduler (that I squashed in 02/06 together with some little fixes).

Thanks for sharing Juri. Can you push this to your ARM tree? I was able to fetch it today but this branch wasn't around. I've started to do the rebase myself with:

git rebase --onto v4.0-rc3 v3.18 arm/wip/ea_v3_dvfs

Unfortunately I'm getting lots of little merge conflicts. Nothing too scary yet but it would save me a lot of time to just fetch your branch.

Yeah, I know, easv3 has small conflicts on 4.0-rc3. We are fixing this sharing thing. I hope to get back to you soon with a reliable way to do it (next week I guess).

Thanks,

- Juri

...

Thanks! Mike

...
Best,

Juri

[1] https://lkml.org/lkml/2014/10/22/21 [2] https://lkml.org/lkml/2015/2/4/537

Juri Lelli (3): sched/energy_model: use EAS energy model sched/energy_model: bound kthreads to run on related_cpus sched/energy_model: fix spurious kicks

Michael Turquette (1): sched: cfs: cpu frequency scaling based on task placement

Mike Turquette (2): cpufreq: add per-governor private data sched: energy_model: simple cpu frequency scaling policy

drivers/cpufreq/Kconfig | 21 +++ include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/energy_model.c | 364 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 51 ++++++- kernel/sched/sched.h | 5 + 6 files changed, 441 insertions(+), 7 deletions(-) create mode 100644 kernel/sched/energy_model.c

-- 2.2.2

3339

days inactive

3348

days old

eas-dev@lists.linaro.org

26 comments

participants

tags (0)

participants (7)

Daniel Lezcano
Dietmar Eggemann
Juri Lelli
Michael Turquette
Michael Turquette
Mike Turquette
Morten Rasmussen